[honey] Better per-term wdf upper bound
commitcaca4e6999620c207584f061647a3ec8a5c96aaf
authorOlly Betts <olly@survex.com>
Tue, 13 Mar 2018 02:46:25 +0000 (13 15:46 +1300)
committerOlly Betts <olly@survex.com>
Tue, 13 Mar 2018 03:57:11 +0000 (13 16:57 +1300)
tree45909647b7203ff354a6ec9b627d0bd8eac2d306
parenteaf81734cbcab99b54320d683d90bdad237c40b7
[honey] Better per-term wdf upper bound

Previously we used min(cf(term), wdf_upper_bound(db)) for the per
term upper bound - that's tight for any terms which attain that
upper bound, and for terms with termfreq == 1, which are common
in the database (e.g. 66% for a database of wikipedia), but probably
much less common in searches.

We now use max(first_wdf(term), cf(term) - first_wdf(term)) when
termfreq > 1, which means terms with termfreq == 2 will also attain
their bound (another 11% for the same database) while terms with higher
termfreq but below the global bound will get a tighter bound.
xapian-core/backends/honey/honey_database.cc
xapian-core/backends/honey/honey_postlisttable.cc
xapian-core/backends/honey/honey_postlisttable.h