libstdc++: Add Unicode-aware width estimation for std::format
commit37a4c5c23a270cd9350ba5d56e526371424b5742
authorJonathan Wakely <jwakely@redhat.com>
Sat, 16 Dec 2023 23:30:20 +0000 (16 23:30 +0000)
committerJonathan Wakely <jwakely@redhat.com>
Mon, 8 Jan 2024 01:14:50 +0000 (8 01:14 +0000)
tree81225d49f3fc522327374cf5424feabb39d4ea54
parent74a0dab18292bef54f316eb086112332befbc6a7
libstdc++: Add Unicode-aware width estimation for std::format

This implements the requirements in the following proposals, which
dictate how std::format deals with non-ASCII strings:
https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2019/p1868r1.html
https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2023/p2572r1.html
https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2023/p2675r1.pdf

There are two parts to this. The width estimation for strings must only
count the width of the first character in an extended grapheme cluster.
That requires implementing the algorithm for detecting cluster breaks,
which requires a number of lookup tables of the grapheme cluster break
properties (and Indic_Conjunct_Break and Extended_Pictographic
properties) of every code point. Additionally, some characters have a
field width of 2, which requires another lookup table of field widths
for every code point.  The tables added in this commit do not contain
entries for every code point from 0 to 0x10FFFF as that would be very
inefficient and use too much memory. Instead the tables only contain the
code points that form an "edge" for a property, omitting all the code
points that have the same property as the preceding one. We can use a
binary search to find the closest code point in the table that is not
greater than the one we're looking for.

The tables are generated by a new Python script added to the
contrib/unicode directory, and a new data file downloaded from the
Unicode Consortium website.

The rules for extended grapheme cluster breaking are implemented for the
latest Unicode standard, version 15.1.0.

libstdc++-v3/ChangeLog:

* include/Makefile.am: Add new headers.
* include/Makefile.in: Regenerate.
* include/bits/unicode.h: New file.
* include/bits/unicode-data.h: New file.
* include/std/format: Include <bits/unicode.h>.
(__literal_encoding_is_utf8): Move to <bits/unicode.h>.
(_Spec::_M_fill): Change type to char32_t.
(_Spec::_M_parse_fill_and_align): Read a Unicode scalar value
instead of a single character.
(__write_padded): Change __fill_char parameter to char32_t and
encode it into the output.
(__formatter_str::format): Use new __unicode::__field_width and
__unicode::__truncate functions.
* include/std/ostream: Adjust namespace qualification for
__literal_encoding_is_utf8.
* include/std/print: Likewise.
* src/c++23/print.cc: Add [[unlikely]] attribute to error path.
* testsuite/ext/unicode/view.cc: New test.
* testsuite/std/format/functions/format.cc: Add missing examples
from the standard demonstrating alignment with non-ASCII
characters. Add examples checking correct handling of extended
grapheme clusters.

contrib/ChangeLog:

* unicode/README: Add notes about generating libstdc++ tables.
* unicode/GraphemeBreakProperty.txt: New file.
* unicode/emoji-data.txt: New file.
* unicode/gen_libstdcxx_unicode_data.py: New file.
14 files changed:
contrib/unicode/GraphemeBreakProperty.txt [new file with mode: 0644]
contrib/unicode/README
contrib/unicode/emoji-data.txt [new file with mode: 0644]
contrib/unicode/gen_libstdcxx_unicode_data.py [new file with mode: 0755]
libstdc++-v3/include/Makefile.am
libstdc++-v3/include/Makefile.in
libstdc++-v3/include/bits/unicode-data.h [new file with mode: 0644]
libstdc++-v3/include/bits/unicode.h [new file with mode: 0644]
libstdc++-v3/include/std/format
libstdc++-v3/include/std/ostream
libstdc++-v3/include/std/print
libstdc++-v3/src/c++23/print.cc
libstdc++-v3/testsuite/ext/unicode/view.cc [new file with mode: 0644]
libstdc++-v3/testsuite/std/format/functions/format.cc