1 * Normalization implementation notes
5 Unicode normalization is implemented as String.Normalize(), which
6 supports all of FormD, FormC, FormKD and FormKC.
8 FormD and FormKD decompose the input string.
9 FormC and FormKC combine the decomposed input string.
11 Mono's Unicode Normalization methods are implemented in
12 Mono.Globalization.Unicode.Normalization.
14 *** Normalization array resources
16 The Normalization implementation involves a lot of array lookup
17 which mostly represent UCD (Unicode Character Data) which is
18 essential to Unicode Normalization.
20 By default (in the release), the arrays are defined as C array and
21 then loaded via icalls (see the static constructor). Defined in
22 normalization-table.h.
24 Alternatively, for debugging purpose, you can switch to managed array
25 lookup instead. The arrays are then defined in
26 NormalizationGenerated.cs.
28 Both .h and -Generated.cs files can be generated by running
29 create-normalization-source.exe, which reads UCD and emits them.
31 There are 6 arrays in our implementation. Each array is of [size]:
33 - byte props [char.MaxValue]:
34 Stores "properties" for each character, where the "properties"
35 are dedicated set of the properties for normalization as defined
36 in "DerivedNormalizationProps.txt".
37 It is used for quick check (NF*_QC) etc.
40 Stores all the normalized strings in the mapping entries expanded
41 as an array of chars. Element at 0 is 0. Each of the strings is
42 NULL-terminated (ends with 0). The entries are sorted first in the
43 order of the primary composite (source) char, and second in the
44 order of the normalized string.
46 For example, if the length of the normalized string of the first
47 mapping entry is 2, then [1] holds the first character of the
48 normalized string of the first mapping entry. [2] holds the second
49 character of the normalized string of the first mapping entry.
51 - short charMapIndex [char.MaxValue]:
52 Stores the indexes to the mapping for each primary composite (source)
53 Unicode character. If there is no mapping for the character, then
56 Note that mapping information is not directly stored in any of the
60 mappedChars: [A1, A2, B1, C1, C2, D1, D2, D3, E1]
61 charMapIndex: [0, 2, 3, 5, 8]
63 - short helperIndex [char.MaxValue]
64 Stores the index to mappedChars of the first character of the
65 first entry of the normalized strings for each character (note
66 that it is *not* map from primary composite but from head of
68 If there is no mapping for the character, then 0 is returned.
70 - ushort mapIdxToComposite [maps.Length]:
71 Stores the primary composite (source) character for each mapping,
72 where the key is the index to mappedChars.
73 It is a "reversed" charMapIndex array (which is char-to-mapidx).
75 example: char src = (char) mapIdxToComposite [mapIdx];
77 - byte combiningClass [char.MaxValue]:
78 Stores the UCD CombiningClass value for each Unicode character.