README.utf-8

   1 UTF-8 Support for Jim Tcl
   2 =========================
   3
   4 Author: Steve Bennett <steveb@workware.net.au>
   5 Date: 2 Nov 2010 10:55:52 EST
   6
   7 OVERVIEW
   8 --------
   9 Early versions of Jim Tcl supported strings, including binary strings containing
  10 nulls, however it had no support for multi-byte character encodings.
  11
  12 In some fields, such as when dealing with the web, or other user-generated content,
  13 support for multi-byte character encodings is necessary.
  14 In these cases it would be very useful for Jim Tcl to be able to process strings
  15 as multi-byte character strings rather than simply binary bytes.
  16
  17 Supporting multiple character encodings and translation between those encodings
  18 is beyond the scope of Jim Tcl. Therefore, Jim has been enhanced to add support
  19 for UTF-8, as the most popular general purpose multi-byte encoding.
  20
  21 UTF-8 support is optional. It can be enabled at compile time with:
  22
  23   ./configure --enable-utf8
  24
  25 The Jim Tcl documentation fully documents the UTF-8 support. This README includes
  26 additional background information.
  27
  28 Unicode vs UTF-8
  29 ----------------
  30 It is important to understand that Unicode is an abstract representation
  31 of the concept of a "character", while UTF-8 is an encoding of
  32 Unicode into bytes.  Thus the Unicode codepoint U+00B5 is encoded
  33 in UTF-8 with the byte sequence: 0xc2, 0xb5. This is different from
  34 ASCII where the same name is used interchangeably between a character value
  35 and and its encoding.
  36
  37 Unicode Escapes
  38 ---------------
  39 Even without UTF-8 enabled, it is useful to be able to encode UTF-8 characters
  40 in strings. This can be done with the \uNNNN Unicode escape. This syntax
  41 is compatible with Tcl and is enabled even if UTF-8 is disabled.
  42
  43 Unlike Tcl, Jim Tcl supports  Unicode characters up to 21 bits.
  44 In addition to \uNNNN, Jim Tcl also supports variable length Unicode
  45 character specifications with \u{NNNNNN} where there may be anywhere between
  46 1 and 6 hex within the braces. e.g. \u{24B62}
  47
  48 UTF-8 Properties
  49 ----------------
  50 Due to the design of the UTF-8 encoding, many (most) commands continue
  51 to work with UTF-8 strings. This is due to the following properties of UTF-8:
  52
  53 * ASCII characters in strings have the same representation in UTF-8
  54 * An ASCII string will never match the middle of a multi-byte UTF-8 sequence
  55 * UTF-8 strings can be sorted as bytes and produce the same result as sorting
  56   by characters
  57 * UTF-8 strings in Jim continue to be null terminated
  58
  59 Commands Supporting UTF-8
  60 -------------------------
  61 The following commands have been enhanced to support UTF-8 strings.
  62
  63 * array {get,names,unset}
  64 * case
  65 * glob
  66 * lsearch -glob, -regexp
  67 * switch -glob, -regexp
  68 * regexp, regsub
  69 * format
  70 * scan
  71 * split
  72 * string index, range, length, compare, equal, first, last, map, match, reverse, tolower, toupper
  73 * string bytelength (new)
  74 * info procs, commands, vars, globals, locals
  75
  76 Character Classes
  77 -----------------
  78 Jim Tcl has no support for UTF-8 character classes.  Thus [:alpha:]
  79 will match [a-zA-Z], but not non-ASCII alphabetic characters.  The
  80 same is true for 'string is'.
  81
  82 Regular Expressions
  83 -------------------
  84 Normally, Jim Tcl uses the system-supplied POSIX-compatible regex
  85 implementation.
  86
  87 Typically systems do not provide a UTF-8 capable regex implementation,
  88 therefore when UTF-8 support is enabled, the built-in regex
  89 implementation is used which includes UTF-8 support.
  90
  91 Case Insensitivity
  92 ------------------
  93 Case folding is much more complex under Unicode than under ASCII.
  94 For example it is possible for a character to change the number of
  95 bytes required for representation when converting from one case to
  96 another. Jim Tcl supports only "simple" case folding, where case
  97 is folded only where the number of bytes does not change.
  98
  99 Case folding tables are automatically generated from the official
 100 unicode data table at http://unicode.org/Public/UNIDATA/UnicodeData.txt
 101
 102 Working with Binary Data and non-UTF-8 encodings
 103 ------------------------------------------------
 104 Almost all Jim commands will work identically with binary data and
 105 UTF-8 encoded data, including read, gets, puts and 'string eq'.  It
 106 is only certain string manipulation commands that behave differently.
 107 For example, 'string index' will return UTF-8 characters, not bytes.
 108
 109 If it is necessary to manipulate strings containing binary, non-ASCII
 110 data (bytes >= 0x80), there are two options.
 111
 112 1. Build Jim without UTF-8 support
 113 2. Use 'string byterange', 'string bytelength' and 'pack', 'unpack' and
 114    'binary' to operate on strings as bytes rather than characters.
 115
 116 Internal Details
 117 ----------------
 118 Jim_Utf8Length() will calculate the character length of the string and cache
 119 it for later access. It uses utf8_strlen() which relies on the string to be null
 120 terminated (which it always will be).
 121
 122 It is possible to tell if a string is ascii-only because length == bytelength
 123
 124 It is possible to provide optimised versions of various routines for
 125 the ascii-only case. Both 'string index' and 'string range' currently
 126 perform such optimisation.