Public Git Hosting - xapian.git/commit

commit	bc5bd910851fae3a5213ecab5e635288162c5dd4
author	Robert Stepanek <rsto@fastmailteam.com>
	Mon, 8 Jan 2024 12:47:32 +0000 (8 13:47 +0100)
committer	Robert Stepanek <rsto@fastmailteam.com>
	Mon, 8 Jan 2024 12:47:32 +0000 (8 13:47 +0100)
tree	e1c02be84d84a318367a320d490467fb506c5c7c	tree \| snapshot (tar.gz zip)
parent	df2ed66eb61493135e48bf704d7cdfb7df670888	commit \| diff

Do not treat fullwidth Latin and symbols as unbroken script

Up until now all Unicode codepoints in the Halfwidth and
Fullwidth Forms Block (U+FF00..U+FFEF) were treated as
unbroken script. This causes terms that consist of fullwidth
Latin characters in this range to not being lowercased
before indexing, resulting in queries not finding such text.

This patch changes word-breaker to only consider halfwidth
Katakana and Hanul characters as unbroken script, handling
all fullwidth Latin characters, numbers and symbols in this
block as broken script.

xapian-core/queryparser/word-breaker.cc		diff \| blob \| blame \| history
xapian-core/tests/api_queryparser.cc		diff \| blob \| blame \| history
xapian-core/tests/api_termgen.cc		diff \| blob \| blame \| history