From c49c48afe09a1a78989628bbffd49dd3efc154dd Mon Sep 17 00:00:00 2001 From: Douglas Bagnall Date: Sat, 20 Apr 2024 09:57:15 +1200 Subject: [PATCH] ldb:utf8: ldb_ascii_toupper() avoids real toupper() MIME-Version: 1.0 Content-Type: text/plain; charset=utf8 Content-Transfer-Encoding: 8bit If a non-lowercase ASCII character has an uppercase counterpart in some locale, toupper() will convert it to an int codepoint. Probably that codepoint is too big to fit in our char return type, so we would truncate it to 8 bit. So it becomes an arbitrary mapping. It would also behave strangely with a byte with the top bit set, say 0xE2. If char is unsigned on this system, that is 'â', which uppercases to 'Â', with the codepoint 0xC2. That seems fine in isolation, but remember this is ldb_utf8.c, and that byte was not a codepoint but a piece of a long utf-8 encoding. In the more likely case where char is signed, toupper() is being passed a negative number, the result of which is undefined. Signed-off-by: Douglas Bagnall Reviewed-by: Andrew Bartlett Autobuild-User(master): Andrew Bartlett Autobuild-Date(master): Tue Apr 23 02:37:25 UTC 2024 on atb-devel-224 --- lib/ldb/common/ldb_utf8.c | 12 +++++++++++- 1 file changed, 11 insertions(+), 1 deletion(-) diff --git a/lib/ldb/common/ldb_utf8.c b/lib/ldb/common/ldb_utf8.c index f45b72dde50..178bdd86de1 100644 --- a/lib/ldb/common/ldb_utf8.c +++ b/lib/ldb/common/ldb_utf8.c @@ -136,5 +136,15 @@ int ldb_attr_dn(const char *attr) } _PRIVATE_ char ldb_ascii_toupper(char c) { - return ('a' <= c && c <= 'z') ? c ^ 0x20 : toupper(c); + /* + * We are aiming for a 1970s C-locale toupper(), when all letters + * were 7-bit and behaved with true American spirit. + * + * For example, we don't want the "i" in "