Do not treat fullwidth Latin and symbols as unbroken script
commitbc5bd910851fae3a5213ecab5e635288162c5dd4
authorRobert Stepanek <rsto@fastmailteam.com>
Mon, 8 Jan 2024 12:47:32 +0000 (8 13:47 +0100)
committerRobert Stepanek <rsto@fastmailteam.com>
Mon, 8 Jan 2024 12:47:32 +0000 (8 13:47 +0100)
treee1c02be84d84a318367a320d490467fb506c5c7c
parentdf2ed66eb61493135e48bf704d7cdfb7df670888
Do not treat fullwidth Latin and symbols as unbroken script

Up until now all Unicode codepoints in the Halfwidth and
Fullwidth Forms Block (U+FF00..U+FFEF) were treated as
unbroken script. This causes terms that consist of fullwidth
Latin characters in this range to not being lowercased
before indexing, resulting in queries not finding such text.

This patch changes word-breaker to only consider halfwidth
Katakana and Hanul characters as unbroken script, handling
all fullwidth Latin characters, numbers and symbols in this
block as broken script.
xapian-core/queryparser/word-breaker.cc
xapian-core/tests/api_queryparser.cc
xapian-core/tests/api_termgen.cc