1 UTF-8 Support for Jim Tcl
2 =========================
4 Author: Steve Bennett <steveb@workware.net.au>
5 Date: 2 Nov 2010 10:55:52 EST
9 Traditionally Jim Tcl has support strings, including binary strings containing
10 nulls, however it has had no support for multi-byte character encodings.
12 In some fields, such as when dealing with the web, or other user-generated content,
13 support for multi-byte character encodings is necessary.
14 In these cases it would be very useful for Jim Tcl to be able to process strings
15 as multi-byte character strings rather than simply binary bytes.
17 Supporting multiple character encodings and translation between those encodings
18 is beyond the scope of Jim Tcl. Therefore, Jim has been enhanced to add support
19 for UTF-8, as probably the most popular general purpose multi-byte encoding.
21 UTF-8 support is optional. It can be enabled at compile time with:
23 ./configure --enable-utf8
25 The Jim Tcl documentation fully documents the UTF-8 support. This README includes
26 additional background information.
30 It is important to understand that Unicode is an abstract representation
31 of the concept of a "character", while UTF-8 is an encoding of
32 Unicode into bytes. Thus the Unicode codepoint U+00B5 is encoded
33 in UTF-8 with the byte sequence: 0xc2, 0xb5. This is different from
34 ASCII which the same name is used interchangeably between a character
39 Even without UTF-8 enabled, it is useful to be able to encode UTF-8 characters
40 in strings. This can be done with the \uNNNN Unicode escape. This syntax
41 is compatible with Tcl and is enabled even if UTF-8 is disabled.
43 Like Tcl, currently only 16-bit Unicode characters can be encoded.
47 Due to the design of the UTF-8 encoding, many (most) commands continue
48 to work with UTF-8 strings. This is due to the following properties of UTF-8:
50 * ASCII characters in strings have the same representation in UTF-8
51 * An ASCII string will never match the middle of a multi-byte UTF-8 sequence
52 * UTF-8 strings can be sorted as bytes and produce the same result as sorting
54 * UTF-8 strings in Jim continue to be null terminated
56 Commands Supporting UTF-8
57 -------------------------
58 The following commands have been enhanced to support UTF-8 strings.
60 * array {get,names,unset}
63 * lsearch -glob, -regexp
64 * switch -glob, -regexp
69 * string index, range, length, compare, equal, first, last, map, match, reverse, tolower, toupper
70 * string bytelength (new)
71 * info procs, commands, vars, globals, locals
75 Jim Tcl has no support for UTF-8 character classes. Thus [:alpha:]
76 will match [a-zA-Z], but not non-ASCII alphabetic characters. The
77 same is true for 'string is'.
81 Normally, Jim Tcl uses the system-supplied POSIX-compatible regex
84 Typically systems do not provide a UTF-8 capable regex implementation,
85 therefore when UTF-8 support is enabled, the built-in regex
86 implementation is used which includes UTF-8 support.
90 Case folding is much more complex under Unicode than under ASCII.
91 For example it is possible for a character to change the number of
92 bytes required for representation when converting from one case to
93 another. Jim Tcl supports only "simple" case folding, where case
94 is folded only where the number of bytes does not change.
96 Case folding tables are automatically generated from the official
97 unicode data table at http://unicode.org/Public/UNIDATA/UnicodeData.txt
99 Working with Binary Data and non-UTF-8 encodings
100 ------------------------------------------------
101 Almost all Jim commands will work identically with binary data and
102 UTF-8 encoded data, including read, gets, puts and 'string eq'. It
103 is only certain string manipulation commands which will operated
104 differently. For example, 'string index' will return UTF-8 characters,
107 If it is necessary to manipulate strings containing binary, non-ASCII
108 data (bytes >= 0x80), there are two options.
110 1. Build Jim without UTF-8 support
111 2. Arrange to encode and decode binary data or data in other encodings
112 to UTF-8 before manipulation.
116 Jim_Utf8Length() will calculate the character length of the string and cache
117 it for later access. It uses utf8_strlen() which relies on the string to be null
118 terminated (which it always will be).
120 It is possible to tell if a string is ascii-only because length == bytelength
122 It is possible to provide optimised versions of various routines for
123 the ascii-only case. Currently this is done only for 'string index' and 'string range'.