[glass] Better per-term wdf upper bound
commitb3120f6137dbffb1c5a97f944a724bca80a718fe
authorOlly Betts <olly@survex.com>
Wed, 14 Mar 2018 20:44:36 +0000 (15 09:44 +1300)
committerOlly Betts <olly@survex.com>
Tue, 20 Mar 2018 20:20:07 +0000 (21 09:20 +1300)
tree0675a61b52f4bf2be863e639fdf9a3a8b8ffea57
parente657c070c7ccee7bd183705318a48f551f1b2b19
[glass] Better per-term wdf upper bound

Previously we used min(cf(term), wdf_upper_bound(db)) for the per
term upper bound - that's tight for any terms which attain that
upper bound, and for terms with termfreq == 1, which are common
in the database (e.g. 66% for a database of wikipedia), but probably
much less common in searches.

We now use max(first_wdf(term), cf(term) - first_wdf(term)) when
termfreq > 1, which means terms with termfreq == 2 will also attain
their bound (another 11% for the same database) while terms with higher
termfreq but below the global bound will get a tighter bound.

Applies the same technique as caca4e6999620c207584f061647a3ec8a5c96aaf
did for honey.

(cherry picked from commit 81dd4862b907fa40380996532b54b6633bd355dd)
xapian-core/backends/glass/glass_database.cc
xapian-core/backends/glass/glass_postlist.cc
xapian-core/backends/glass/glass_postlist.h