mingw/html/ext/Encode/lib/Encode/Supported.html

   1 <?xml version="1.0" ?>
   2 <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
   3 <html xmlns="http://www.w3.org/1999/xhtml">
   4 <head>
   5 <title>Encode::Supported -- Encodings supported by Encode</title>
   6 <meta http-equiv="content-type" content="text/html; charset=utf-8" />
   7 <link rev="made" href="mailto:" />
   8 </head>
   9
  10 <body style="background-color: white">
  11 <table border="0" width="100%" cellspacing="0" cellpadding="3">
  12 <tr><td class="block" style="background-color: #cccccc" valign="middle">
  13 <big><strong><span class="block">&nbsp;Encode::Supported -- Encodings supported by Encode</span></strong></big>
  14 </td></tr>
  15 </table>
  16
  17 <p><a name="__index__"></a></p>
  18 <!-- INDEX BEGIN -->
  19
  20 <ul>
  21
  22         <li><a href="#name">NAME</a></li>
  23         <li><a href="#description">DESCRIPTION</a></li>
  24         <ul>
  25
  26                 <li><a href="#encoding_names">Encoding Names</a></li>
  27         </ul>
  28
  29         <li><a href="#supported_encodings">Supported Encodings</a></li>
  30         <ul>
  31
  32                 <li><a href="#builtin_encodings">Built-in Encodings</a></li>
  33                 <li><a href="#encode__unicode__other_unicode_encodings">Encode::Unicode - other Unicode encodings</a></li>
  34                 <li><a href="#encode__byte__extended_ascii">Encode::Byte - Extended ASCII</a></li>
  35                 <li><a href="#cjk__chinese__japanese__korean__multibyte_">CJK: Chinese, Japanese, Korean (Multibyte)</a></li>
  36                 <li><a href="#miscellaneous_encodings">Miscellaneous encodings</a></li>
  37         </ul>
  38
  39         <li><a href="#unsupported_encodings">Unsupported encodings</a></li>
  40         <li><a href="#encoding_vs__charset__terminology">Encoding vs. Charset - terminology</a></li>
  41         <li><a href="#encoding_classification__by_anton_tagunov_and_dan_kogai_">Encoding Classification (by Anton Tagunov and Dan Kogai)</a></li>
  42         <ul>
  43
  44                 <li><a href="#microsoftrelated_naming_mess">Microsoft-related naming mess</a></li>
  45         </ul>
  46
  47         <li><a href="#glossary">Glossary</a></li>
  48         <li><a href="#see_also">See Also</a></li>
  49         <li><a href="#references">References</a></li>
  50         <ul>
  51
  52                 <li><a href="#other_notable_sites">Other Notable Sites</a></li>
  53                 <li><a href="#offline_sources">Offline sources</a></li>
  54         </ul>
  55
  56 </ul>
  57 <!-- INDEX END -->
  58
  59 <hr />
  60 <p>
  61 </p>
  62 <h1><a name="name">NAME</a></h1>
  63 <p>Encode::Supported -- Encodings supported by Encode</p>
  64 <p>
  65 </p>
  66 <hr />
  67 <h1><a name="description">DESCRIPTION</a></h1>
  68 <p>
  69 </p>
  70 <h2><a name="encoding_names">Encoding Names</a></h2>
  71 <p>Encoding names are case insensitive. White space in names
  72 is ignored.  In addition, an encoding may have aliases.
  73 Each encoding has one ``canonical'' name.  The ``canonical''
  74 name is chosen from the names of the encoding by picking
  75 the first in the following sequence (with a few exceptions).</p>
  76 <ul>
  77 <li>
  78 <p>The name used by the Perl community.  That includes 'utf8' and 'ascii'.
  79 Unlike aliases, canonical names directly reach the method so such
  80 frequently used words like 'utf8' don't need to do alias lookups.</p>
  81 </li>
  82 <li>
  83 <p>The MIME name as defined in IETF RFCs.  This includes all ``iso-''s.</p>
  84 </li>
  85 <li>
  86 <p>The name in the IANA registry.</p>
  87 </li>
  88 <li>
  89 <p>The name used by the organization that defined it.</p>
  90 </li>
  91 </ul>
  92 <p>In case <em>de jure</em> canonical names differ from that of the Encode
  93 module, they are always aliased if it ever be implemented.  So you can
  94 safely tell if a given encoding is implemented or not just by passing
  95 the canonical name.</p>
  96 <p>Because of all the alias issues, and because in the general case
  97 encodings have state, ``Encode'' uses an encoding object internally
  98 once an operation is in progress.</p>
  99 <p>
 100 </p>
 101 <hr />
 102 <h1><a name="supported_encodings">Supported Encodings</a></h1>
 103 <p>As of Perl 5.8.0, at least the following encodings are recognized.
 104 Note that unless otherwise specified, they are all case insensitive
 105 (via alias) and all occurrence of spaces are replaced with '-'.
 106 In other words, ``ISO 8859 1'' and ``iso-8859-1'' are identical.</p>
 107 <p>Encodings are categorized and implemented in several different modules
 108 but you don't have to <code>use Encode::XX</code> to make them available for
 109 most cases.  Encode.pm will automatically load those modules on demand.</p>
 110 <p>
 111 </p>
 112 <h2><a name="builtin_encodings">Built-in Encodings</a></h2>
 113 <p>The following encodings are always available.</p>
 114 <pre>
 115   Canonical     Aliases                      Comments &amp; References
 116   ----------------------------------------------------------------
 117   ascii         US-ascii ISO-646-US                         [ECMA]
 118   ascii-ctrl                                      Special Encoding
 119   iso-8859-1    latin1                                       [ISO]
 120   null                                            Special Encoding
 121   utf8          UTF-8                                    [RFC2279]
 122   ----------------------------------------------------------------</pre>
 123 <p><em>null</em> and <em>ascii-ctrl</em> are special.  ``null'' fails for all character
 124 so when you set fallback mode to PERLQQ, HTMLCREF or XMLCREF, ALL
 125 CHARACTERS will fall back to character references.  Ditto for
 126 ``ascii-ctrl'' except for control characters.  For fallback modes, see
 127 <a href="file://C|\msysgit\mingw\html/lib/Encode.html">the Encode manpage</a>.</p>
 128 <p>
 129 </p>
 130 <h2><a name="encode__unicode__other_unicode_encodings">Encode::Unicode -- other Unicode encodings</a></h2>
 131 <p>Unicode coding schemes other than native utf8 are supported by
 132 Encode::Unicode, which will be autoloaded on demand.</p>
 133 <pre>
 134   ----------------------------------------------------------------
 135   UCS-2BE       UCS-2, iso-10646-1                      [IANA, UC]
 136   UCS-2LE                                                     [UC]
 137   UTF-16                                                      [UC]
 138   UTF-16BE                                                    [UC]
 139   UTF-16LE                                                    [UC]
 140   UTF-32                                                      [UC]
 141   UTF-32BE      UCS-4                                         [UC]
 142   UTF-32LE                                                    [UC]
 143   UTF-7                                                  [RFC2152]
 144   ----------------------------------------------------------------</pre>
 145 <p>To find how (UCS-2|UTF-(16|32))(LE|BE)? differ from one another,
 146 see <a href="file://C|\msysgit\mingw\html/lib/Encode/Unicode.html">the Encode::Unicode manpage</a>.</p>
 147 <p>UTF-7 is a special encoding which ``re-encodes'' UTF-16BE into a 7-bit
 148 encoding.  It is implemented seperately by Encode::Unicode::UTF7.</p>
 149 <p>
 150 </p>
 151 <h2><a name="encode__byte__extended_ascii">Encode::Byte -- Extended ASCII</a></h2>
 152 <p>Encode::Byte implements most single-byte encodings except for
 153 Symbols and EBCDIC. The following encodings are based on single-byte
 154 encodings implemented as extended ASCII.  Most of them map
 155 \x80-\xff (upper half) to non-ASCII characters.</p>
 156 <dl>
 157 <dt><strong><a name="item_iso_2d8859_and_corresponding_vendor_mappings">ISO-8859 and corresponding vendor mappings</a></strong>
 158
 159 <dd>
 160 <p>Since there are so many, they are presented in table format with
 161 languages and corresponding encoding names by vendors.  Note that
 162 the table is sorted in order of ISO-8859 and the corresponding vendor
 163 mappings are slightly different from that of ISO.  See
 164 <a href="http://czyborra.com/charsets/iso8859.html">http://czyborra.com/charsets/iso8859.html</a> for details.</p>
 165 </dd>
 166 <dd>
 167 <pre>
 168   Lang/Regions  ISO/Other Std.  DOS     Windows Macintosh  Others
 169   ----------------------------------------------------------------
 170   N. America    (ASCII)         cp437        AdobeStandardEncoding
 171                                 cp863 (DOSCanadaF)
 172   W. Europe     iso-8859-1      cp850   cp1252  MacRoman  nextstep
 173                                                          hp-roman8
 174                                 cp860 (DOSPortuguese)
 175   Cntrl. Europe iso-8859-2      cp852   cp1250  MacCentralEurRoman
 176                                                 MacCroatian
 177                                                 MacRomanian
 178                                                 MacRumanian
 179   Latin3[1]     iso-8859-3
 180   Latin4[2]     iso-8859-4
 181   Cyrillics     iso-8859-5      cp855   cp1251  MacCyrillic
 182     (See also next section)     cp866           MacUkrainian
 183   Arabic        iso-8859-6      cp864   cp1256  MacArabic
 184                                 cp1006          MacFarsi
 185   Greek         iso-8859-7      cp737   cp1253  MacGreek
 186                                 cp869 (DOSGreek2)
 187   Hebrew        iso-8859-8      cp862   cp1255  MacHebrew
 188   Turkish       iso-8859-9      cp857   cp1254  MacTurkish
 189   Nordics       iso-8859-10     cp865
 190                                 cp861           MacIcelandic
 191                                                 MacSami
 192   Thai          iso-8859-11[3]  cp874           MacThai
 193   (iso-8859-12 is nonexistent. Reserved for Indics?)
 194   Baltics       iso-8859-13     cp775           cp1257
 195   Celtics       iso-8859-14
 196   Latin9 [4]    iso-8859-15
 197   Latin10       iso-8859-16
 198   Vietnamese    viscii                  cp1258  MacVietnamese
 199   ----------------------------------------------------------------</pre>
 200 </dd>
 201 <dd>
 202 <pre>
 203   [1] Esperanto, Maltese, and Turkish. Turkish is now on 8859-9.
 204   [2] Baltics.  Now on 8859-10, except for Latvian.
 205   [3] TIS 620 +  Non-Breaking Space (0xA0 / U+00A0)
 206   [4] Nicknamed Latin0; the Euro sign as well as French and Finnish
 207       letters that are missing from 8859-1 were added.</pre>
 208 </dd>
 209 <dd>
 210 <p>All cp* are also available as ibm-*, ms-*, and windows-* .  See also
 211 <a href="http://czyborra.com/charsets/codepages.html">http://czyborra.com/charsets/codepages.html</a>.</p>
 212 </dd>
 213 <dd>
 214 <p>Macintosh encodings don't seem to be registered in such entities as
 215 IANA.  ``Canonical'' names in Encode are based upon Apple's Tech Note
 216 1150.  See <a href="http://developer.apple.com/technotes/tn/tn1150.html">http://developer.apple.com/technotes/tn/tn1150.html</a>
 217 for details.</p>
 218 </dd>
 219 </li>
 220 <dt><strong><a name="item_koi8__2d_de_facto_standard_for_the_cyrillic_world">KOI8 - De Facto Standard for the Cyrillic world</a></strong>
 221
 222 <dd>
 223 <p>Though ISO-8859 does have ISO-8859-5, the KOI8 series is far more
 224 popular in the Net.   <a href="file://C|\msysgit\mingw\html/lib/Encode.html">the Encode manpage</a> comes with the following KOI charsets.
 225 For gory details, see <a href="http://czyborra.com/charsets/cyrillic.html">http://czyborra.com/charsets/cyrillic.html</a></p>
 226 </dd>
 227 <dd>
 228 <pre>
 229   ----------------------------------------------------------------
 230   koi8-f
 231   koi8-r cp878                                           [RFC1489]
 232   koi8-u                                                 [RFC2319]
 233   ----------------------------------------------------------------</pre>
 234 </dd>
 235 </li>
 236 <dt><strong><a name="item_gsm0338__2d_hentai_latin_1">gsm0338 - Hentai Latin 1</a></strong>
 237
 238 <dd>
 239 <p>GSM0338 is for GSM handsets. Though it shares alphanumerals with
 240 ASCII, control character ranges and other parts are mapped very
 241 differently, mainly to store Greek characters.  There are also escape
 242 sequences (starting with 0x1B) to cover e.g. the Euro sign.  Some
 243 special cases like a trailing 0x00 byte or a lone 0x1B byte are not
 244 well-defined and <code>decode()</code> will return an empty string for them.
 245 One possible workaround is</p>
 246 </dd>
 247 <dd>
 248 <pre>
 249    $gsm =~ s/\x00\z/\x00\x00/;
 250    $uni = decode(&quot;gsm0338&quot;, $gsm);
 251    $uni .= &quot;\xA0&quot; if $gsm =~ /\x1B\z/;</pre>
 252 </dd>
 253 <dd>
 254 <p>Note that the Encode implementation of GSM0338 does not implement the
 255 reuse of Latin capital letters as Greek capital letters (for example,
 256 the 0x5A is U+005A (LATIN CAPITAL LETTER Z), not U+0396 (GREEK CAPITAL
 257 LETTER ZETA).</p>
 258 </dd>
 259 <dd>
 260 <p>The GSM0338 is also covered in Encode::Byte even though it is not
 261 an ``extended ASCII'' encoding.</p>
 262 </dd>
 263 </li>
 264 </dl>
 265 <p>
 266 </p>
 267 <h2><a name="cjk__chinese__japanese__korean__multibyte_">CJK: Chinese, Japanese, Korean (Multibyte)</a></h2>
 268 <p>Note that Vietnamese is listed above.  Also read ``Encoding vs Charset''
 269 below.  Also note that these are implemented in distinct modules by
 270 countries, due to the size concerns (simplified Chinese is mapped
 271 to 'CN', continental China, while traditional Chinese is mapped to
 272 'TW', Taiwan).  Please refer to their respective documentation pages.</p>
 273 <dl>
 274 <dt><strong><a name="item_encode_3a_3acn__2d_2d_continental_china">Encode::CN -- Continental China</a></strong>
 275
 276 <dd>
 277 <pre>
 278   Standard      DOS/Win Macintosh                Comment/Reference
 279   ----------------------------------------------------------------
 280   euc-cn [1]            MacChineseSimp
 281   (gbk)         cp936 [2]
 282   gb12345-raw                      { GB12345 without CES }
 283   gb2312-raw                       { GB2312  without CES }
 284   hz
 285   iso-ir-165
 286   ----------------------------------------------------------------</pre>
 287 </dd>
 288 <dd>
 289 <pre>
 290   [1] GB2312 is aliased to this.  See L&lt;Microsoft-related naming mess&gt;
 291   [2] gbk is aliased to this.  See L&lt;Microsoft-related naming mess&gt;</pre>
 292 </dd>
 293 <dt><strong><a name="item_encode_3a_3ajp__2d_2d_japan">Encode::JP -- Japan</a></strong>
 294
 295 <dd>
 296 <pre>
 297   Standard      DOS/Win Macintosh                Comment/Reference
 298   ----------------------------------------------------------------
 299   euc-jp
 300   shiftjis      cp932   macJapanese
 301   7bit-jis
 302   iso-2022-jp                                            [RFC1468]
 303   iso-2022-jp-1                                          [RFC2237]
 304   jis0201-raw  { JIS X 0201 (roman + halfwidth kana) without CES }
 305   jis0208-raw  { JIS X 0208 (Kanji + fullwidth kana) without CES }
 306   jis0212-raw  { JIS X 0212 (Extended Kanji)         without CES }
 307   ----------------------------------------------------------------</pre>
 308 </dd>
 309 <dt><strong><a name="item_encode_3a_3akr__2d_2d_korea">Encode::KR -- Korea</a></strong>
 310
 311 <dd>
 312 <pre>
 313   Standard      DOS/Win Macintosh                Comment/Reference
 314   ----------------------------------------------------------------
 315   euc-kr                MacKorean                        [RFC1557]
 316                 cp949 [1]
 317   iso-2022-kr                                            [RFC1557]
 318   johab                                  [KS X 1001:1998, Annex 3]
 319   ksc5601-raw                              { KSC5601 without CES }
 320   ----------------------------------------------------------------</pre>
 321 </dd>
 322 <dd>
 323 <pre>
 324   [1] ks_c_5601-1987, (x-)?windows-949, and uhc are aliased to this.
 325   See below.</pre>
 326 </dd>
 327 <dt><strong><a name="item_encode_3a_3atw__2d_2d_taiwan">Encode::TW -- Taiwan</a></strong>
 328
 329 <dd>
 330 <pre>
 331   Standard      DOS/Win Macintosh                Comment/Reference
 332   ----------------------------------------------------------------
 333   big5-eten     cp950   MacChineseTrad {big5 aliased to big5-eten}
 334   big5-hkscs
 335   ----------------------------------------------------------------</pre>
 336 </dd>
 337 <dt><strong><a name="item_encode_3a_3ahanextra__2d_2d_more_chinese_via_cpan">Encode::HanExtra -- More Chinese via CPAN</a></strong>
 338
 339 <dd>
 340 <p>Due to the size concerns, additional Chinese encodings below are
 341 distributed separately on CPAN, under the name Encode::HanExtra.</p>
 342 </dd>
 343 <dd>
 344 <pre>
 345   Standard      DOS/Win Macintosh                Comment/Reference
 346   ----------------------------------------------------------------
 347   big5ext                                   CMEX's Big5e Extension
 348   big5plus                                  CMEX's Big5+ Extension
 349   cccii         Chinese Character Code for Information Interchange
 350   euc-tw                             EUC (Extended Unix Character)
 351   gb18030                          GBK with Traditional Characters
 352   ----------------------------------------------------------------</pre>
 353 </dd>
 354 </li>
 355 <dt><strong><a name="item_encode_3a_3ajis2k__2d_2d_jis_x_0213_encodings_via_">Encode::JIS2K -- JIS X 0213 encodings via CPAN</a></strong>
 356
 357 <dd>
 358 <p>Due to size concerns, additional Japanese encodings below are
 359 distributed separately on CPAN, under the name Encode::JIS2K.</p>
 360 </dd>
 361 <dd>
 362 <pre>
 363   Standard      DOS/Win Macintosh                Comment/Reference
 364   ----------------------------------------------------------------
 365   euc-jisx0213
 366   shiftjisx0123
 367   iso-2022-jp-3
 368   jis0213-1-raw
 369   jis0213-2-raw
 370   ----------------------------------------------------------------</pre>
 371 </dd>
 372 </li>
 373 </dl>
 374 <p>
 375 </p>
 376 <h2><a name="miscellaneous_encodings">Miscellaneous encodings</a></h2>
 377 <dl>
 378 <dt><strong><a name="item_encode_3a_3aebcdic">Encode::EBCDIC</a></strong>
 379
 380 <dd>
 381 <p>See <a href="file://C|\msysgit\mingw\html/pod/perlebcdic.html">the perlebcdic manpage</a> for details.</p>
 382 </dd>
 383 <dd>
 384 <pre>
 385   ----------------------------------------------------------------
 386   cp37
 387   cp500
 388   cp875
 389   cp1026
 390   cp1047
 391   posix-bc
 392   ----------------------------------------------------------------</pre>
 393 </dd>
 394 </li>
 395 <dt><strong><a name="item_encode_3a_3asymbols">Encode::Symbols</a></strong>
 396
 397 <dd>
 398 <p>For symbols  and dingbats.</p>
 399 </dd>
 400 <dd>
 401 <pre>
 402   ----------------------------------------------------------------
 403   symbol
 404   dingbats
 405   MacDingbats
 406   AdobeZdingbat
 407   AdobeSymbol
 408   ----------------------------------------------------------------</pre>
 409 </dd>
 410 </li>
 411 <dt><strong><a name="item_encode_3a_3amime_3a_3aheader">Encode::MIME::Header</a></strong>
 412
 413 <dd>
 414 <p>Strictly speaking, MIME header encoding documented in RFC 2047 is more
 415 of encapsulation than encoding.  However, their support in modern
 416 world is imperative so they are supported.</p>
 417 </dd>
 418 <dd>
 419 <pre>
 420   ----------------------------------------------------------------
 421   MIME-Header                                            [RFC2047]
 422   MIME-B                                                 [RFC2047]
 423   MIME-Q                                                 [RFC2047]
 424   ----------------------------------------------------------------</pre>
 425 </dd>
 426 </li>
 427 <dt><strong><a name="item_encode_3a_3aguess">Encode::Guess</a></strong>
 428
 429 <dd>
 430 <p>This one is not a name of encoding but a utility that lets you pick up
 431 the most appropriate encoding for a data out of given <em>suspects</em>.  See
 432 <a href="file://C|\msysgit\mingw\html/lib/Encode/Guess.html">the Encode::Guess manpage</a> for details.</p>
 433 </dd>
 434 </li>
 435 </dl>
 436 <p>
 437 </p>
 438 <hr />
 439 <h1><a name="unsupported_encodings">Unsupported encodings</a></h1>
 440 <p>The following encodings are not supported as yet; some because they
 441 are rarely used, some because of technical difficulties.  They may
 442 be supported by external modules via CPAN in the future, however.</p>
 443 <dl>
 444 <dt><strong><a name="item_iso_2d2022_2djp_2d2__5brfc1554_5d">ISO-2022-JP-2 [RFC1554]</a></strong>
 445
 446 <dd>
 447 <p>Not very popular yet.  Needs Unicode Database or equivalent to
 448 implement <code>encode()</code> (because it includes JIS X 0208/0212, KSC5601, and
 449 GB2312 simultaneously, whose code points in Unicode overlap.  So you
 450 need to lookup the database to determine to what character set a given
 451 Unicode character should belong).</p>
 452 </dd>
 453 </li>
 454 <dt><strong><a name="item_iso_2d2022_2dcn__5brfc1922_5d">ISO-2022-CN [RFC1922]</a></strong>
 455
 456 <dd>
 457 <p>Not very popular.  Needs CNS 11643-1 and -2 which are not available in
 458 this module.  CNS 11643 is supported (via euc-tw) in Encode::HanExtra.
 459 Autrijus Tang may add support for this encoding in his module in future.</p>
 460 </dd>
 461 </li>
 462 <dt><strong><a name="item_various_hp_2dux_encodings">Various HP-UX encodings</a></strong>
 463
 464 <dd>
 465 <p>The following are unsupported due to the lack of mapping data.</p>
 466 </dd>
 467 <dd>
 468 <pre>
 469   '8'  - arabic8, greek8, hebrew8, kana8, thai8, and turkish8
 470   '15' - japanese15, korean15, and roi15</pre>
 471 </dd>
 472 </li>
 473 <dt><strong><a name="item_cyrillic_encoding_iso_2dir_2d111">Cyrillic encoding ISO-IR-111</a></strong>
 474
 475 <dd>
 476 <p>Anton Tagunov doubts its usefulness.</p>
 477 </dd>
 478 </li>
 479 <dt><strong><a name="item_iso_2d8859_2d8_2d1__5bhebrew_5d">ISO-8859-8-1 [Hebrew]</a></strong>
 480
 481 <dd>
 482 <p>None of the Encode team knows Hebrew enough (ISO-8859-8, cp1255 and
 483 MacHebrew are supported because and just because there were mappings
 484 available at <a href="http://www.unicode.org/">http://www.unicode.org/</a>).  Contributions welcome.</p>
 485 </dd>
 486 </li>
 487 <dt><strong><a name="item_isiri_3342_2c_iran_system_2c_isiri_2900__5bfarsi_5">ISIRI 3342, Iran System, ISIRI 2900 [Farsi]</a></strong>
 488
 489 <dd>
 490 <p>Ditto.</p>
 491 </dd>
 492 </li>
 493 <dt><strong><a name="item_thai_encoding_tcvn">Thai encoding TCVN</a></strong>
 494
 495 <dd>
 496 <p>Ditto.</p>
 497 </dd>
 498 </li>
 499 <dt><strong><a name="item_vietnamese_encodings_vps">Vietnamese encodings VPS</a></strong>
 500
 501 <dd>
 502 <p>Though Jungshik Shin has reported that Mozilla supports this encoding,
 503 it was too late before 5.8.0 for us to add it.  In the future, it
 504 may be available via a separate module.  See
 505 <a href="http://lxr.mozilla.org/seamonkey/source/intl/uconv/ucvlatin/vps.uf">http://lxr.mozilla.org/seamonkey/source/intl/uconv/ucvlatin/vps.uf</a>
 506 and
 507 <a href="http://lxr.mozilla.org/seamonkey/source/intl/uconv/ucvlatin/vps.ut">http://lxr.mozilla.org/seamonkey/source/intl/uconv/ucvlatin/vps.ut</a>
 508 if you are interested in helping us.</p>
 509 </dd>
 510 </li>
 511 <dt><strong><a name="item_various_mac_encodings">Various Mac encodings</a></strong>
 512
 513 <dd>
 514 <p>The following are unsupported due to the lack of mapping data.</p>
 515 </dd>
 516 <dd>
 517 <pre>
 518   MacArmenian,  MacBengali,   MacBurmese,   MacEthiopic
 519   MacExtArabic, MacGeorgian,  MacKannada,   MacKhmer
 520   MacLaotian,   MacMalayalam, MacMongolian, MacOriya
 521   MacSinhalese, MacTamil,     MacTelugu,    MacTibetan
 522   MacVietnamese</pre>
 523 </dd>
 524 <dd>
 525 <p>The rest which are already available are based upon the vendor mappings
 526 at <a href="http://www.unicode.org/Public/MAPPINGS/VENDORS/APPLE/">http://www.unicode.org/Public/MAPPINGS/VENDORS/APPLE/</a> .</p>
 527 </dd>
 528 </li>
 529 <dt><strong><a name="item__28mac_29_indic_encodings">(Mac) Indic encodings</a></strong>
 530
 531 <dd>
 532 <p>The maps for the following are available at <a href="http://www.unicode.org/">http://www.unicode.org/</a>
 533 but remain unsupport because those encodings need algorithmical
 534 approach, currently unsupported by <em>enc2xs</em>:</p>
 535 </dd>
 536 <dd>
 537 <pre>
 538   MacDevanagari
 539   MacGurmukhi
 540   MacGujarati</pre>
 541 </dd>
 542 <dd>
 543 <p>For details, please see <code>Unicode mapping issues and notes:</code> at
 544 <a href="http://www.unicode.org/Public/MAPPINGS/VENDORS/APPLE/DEVANAGA.TXT">http://www.unicode.org/Public/MAPPINGS/VENDORS/APPLE/DEVANAGA.TXT</a> .</p>
 545 </dd>
 546 <dd>
 547 <p>I believe this issue is prevalent not only for Mac Indics but also in
 548 other Indic encodings, but the above were the only Indic encodings
 549 maps that I could find at <a href="http://www.unicode.org/">http://www.unicode.org/</a> .</p>
 550 </dd>
 551 </li>
 552 </dl>
 553 <p>
 554 </p>
 555 <hr />
 556 <h1><a name="encoding_vs__charset__terminology">Encoding vs. Charset -- terminology</a></h1>
 557 <p>We are used to using the term (character) <em>encoding</em> and <em>character
 558 set</em> interchangeably.  But just as confusing the terms byte and
 559 character is dangerous and the terms should be differentiated when
 560 needed, we need to differentiate <em>encoding</em> and <em>character set</em>.</p>
 561 <p>To understand that, here is a description of how we make computers
 562 grok our characters.</p>
 563 <ul>
 564 <li>
 565 <p>First we start with which characters to include.  We call this
 566 collection of characters <em>character repertoire</em>.</p>
 567 </li>
 568 <li>
 569 <p>Then we have to give each character a unique ID so your computer can
 570 tell the difference between 'a' and 'A'.  This itemized character
 571 repertoire is now a <em>character set</em>.</p>
 572 </li>
 573 <li>
 574 <p>If your computer can grow the character set without further
 575 processing, you can go ahead and use it.  This is called a <em>coded
 576 character set</em> (CCS) or <em>raw character encoding</em>.  ASCII is used this
 577 way for most cases.</p>
 578 </li>
 579 <li>
 580 <p>But in many cases, especially multi-byte CJK encodings, you have to
 581 tweak a little more.  Your network connection may not accept any data
 582 with the Most Significant Bit set, and your computer may not be able to
 583 tell if a given byte is a whole character or just half of it.  So you
 584 have to <em>encode</em> the character set to use it.</p>
 585 <p>A <em>character encoding scheme</em> (CES) determines how to encode a given
 586 character set, or a set of multiple character sets.  7bit ISO-2022 is
 587 an example of a CES.  You switch between character sets via <em>escape
 588 sequences</em>.</p>
 589 </li>
 590 </ul>
 591 <p>Technically, or mathematically, speaking, a character set encoded in
 592 such a CES that maps character by character may form a CCS.  EUC is such
 593 an example.  The CES of EUC is as follows:</p>
 594 <ul>
 595 <li>
 596 <p>Map ASCII unchanged.</p>
 597 </li>
 598 <li>
 599 <p>Map such a character set that consists of 94 or 96 powered by N
 600 members by adding 0x80 to each byte.</p>
 601 </li>
 602 <li>
 603 <p>You can also use 0x8e and 0x8f to indicate that the following sequence of
 604 characters belongs to yet another character set.  To each following byte
 605 is added the value 0x80.</p>
 606 </li>
 607 </ul>
 608 <p>By carefully looking at the encoded byte sequence, you can find that the
 609 byte sequence conforms a unique number.  In that sense, EUC is a CCS
 610 generated by a CES above from up to four CCS (complicated?).  UTF-8
 611 falls into this category.  See <em>perlUnicode/``UTF-8''</em> to find out how
 612 UTF-8 maps Unicode to a byte sequence.</p>
 613 <p>You may also have found out by now why 7bit ISO-2022 cannot comprise
 614 a CCS.  If you look at a byte sequence \x21\x21, you can't tell if
 615 it is two !'s or IDEOGRAPHIC SPACE.  EUC maps the latter to \xA1\xA1
 616 so you have no trouble differentiating between ``!!''. and ``&nbsp;&nbsp;''.</p>
 617 <p>
 618 </p>
 619 <hr />
 620 <h1><a name="encoding_classification__by_anton_tagunov_and_dan_kogai_">Encoding Classification (by Anton Tagunov and Dan Kogai)</a></h1>
 621 <p>This section tries to classify the supported encodings by their
 622 applicability for information exchange over the Internet and to
 623 choose the most suitable aliases to name them in the context of
 624 such communication.</p>
 625 <ul>
 626 <li>
 627 <p>To (en|de)code encodings marked by <code>(**)</code>, you need
 628 <code>Encode::HanExtra</code>, available from CPAN.</p>
 629 </li>
 630 </ul>
 631 <p>Encoding names</p>
 632 <pre>
 633   US-ASCII    UTF-8    ISO-8859-*  KOI8-R
 634   Shift_JIS   EUC-JP   ISO-2022-JP ISO-2022-JP-1
 635   EUC-KR      Big5     GB2312</pre>
 636 <p>are registered with IANA as preferred MIME names and may
 637 be used over the Internet.</p>
 638 <p><a href="#item_shift_jis"><code>Shift_JIS</code></a> has been officialized by JIS X 0208:1997.
 639 <a href="#microsoftrelated_naming_mess">Microsoft-related naming mess</a> gives details.</p>
 640 <p><a href="#item_gb2312"><code>GB2312</code></a> is the IANA name for <code>EUC-CN</code>.
 641 See <a href="#microsoftrelated_naming_mess">Microsoft-related naming mess</a> for details.</p>
 642 <p><code>GB_2312-80</code> <em>raw</em> encoding is available as <code>gb2312-raw</code>
 643 with Encode. See <a href="file://C|\msysgit\mingw\html/lib/Encode/CN.html">the Encode::CN manpage</a> for details.</p>
 644 <pre>
 645   EUC-CN
 646   KOI8-U        [RFC2319]</pre>
 647 <p>have not been registered with IANA (as of March 2002) but
 648 seem to be supported by major web browsers.
 649 The IANA name for <code>EUC-CN</code> is <a href="#item_gb2312"><code>GB2312</code></a>.</p>
 650 <pre>
 651   KS_C_5601-1987</pre>
 652 <p>is heavily misused.
 653 See <a href="#microsoftrelated_naming_mess">Microsoft-related naming mess</a> for details.</p>
 654 <p><a href="#item_ks_c_5601_2d1987"><code>KS_C_5601-1987</code></a> <em>raw</em> encoding is available as <code>kcs5601-raw</code>
 655 with Encode. See <a href="file://C|\msysgit\mingw\html/lib/Encode/KR.html">the Encode::KR manpage</a> for details.</p>
 656 <pre>
 657   UTF-16 UTF-16BE UTF-16LE</pre>
 658 <p>are IANA-registered <a href="#item_charset"><code>charset</code></a>s. See [RFC 2781] for details.
 659 Jungshik Shin reports that UTF-16 with a BOM is well accepted
 660 by MS IE 5/6 and NS 4/6. Beware however that</p>
 661 <ul>
 662 <li>
 663 <p><a href="#item_utf_2d16"><code>UTF-16</code></a> support in any software you're going to be
 664 using/interoperating with has probably been less tested
 665 then <code>UTF-8</code> support</p>
 666 </li>
 667 <li>
 668 <p><code>UTF-8</code> coded data seamlessly passes traditional
 669 command piping (<code>cat</code>, <code>more</code>, etc.) while <a href="#item_utf_2d16"><code>UTF-16</code></a> coded
 670 data is likely to cause confusion (with its zero bytes,
 671 for example)</p>
 672 </li>
 673 <li>
 674 <p>it is beyond the power of words to describe the way HTML browsers
 675 encode non-<code>ASCII</code> form data. To get a general impression, visit
 676 <a href="http://ppewww.ph.gla.ac.uk/~flavell/charset/form-i18n.html">http://ppewww.ph.gla.ac.uk/~flavell/charset/form-i18n.html</a>.
 677 While encoding of form data has stabilized for <code>UTF-8</code> encoded pages
 678 (at least IE 5/6, NS 6, and Opera 6 behave consistently), be sure to
 679 expect fun (and cross-browser discrepancies) with <a href="#item_utf_2d16"><code>UTF-16</code></a> encoded
 680 pages!</p>
 681 </li>
 682 </ul>
 683 <p>The rule of thumb is to use <code>UTF-8</code> unless you know what
 684 you're doing and unless you really benefit from using <a href="#item_utf_2d16"><code>UTF-16</code></a>.</p>
 685 <pre>
 686   ISO-IR-165    [RFC1345]
 687   VISCII
 688   GB 12345
 689   GB 18030 (**)  (see links bellow)
 690   EUC-TW   (**)</pre>
 691 <p>are totally valid encodings but not registered at IANA.
 692 The names under which they are listed here are probably the
 693 most widely-known names for these encodings and are recommended
 694 names.</p>
 695 <pre>
 696   BIG5PLUS (**)</pre>
 697 <p>is a proprietary name.</p>
 698 <p>
 699 </p>
 700 <h2><a name="microsoftrelated_naming_mess">Microsoft-related naming mess</a></h2>
 701 <p>Microsoft products misuse the following names:</p>
 702 <dl>
 703 <dt><strong><a name="item_ks_c_5601_2d1987">KS_C_5601-1987</a></strong>
 704
 705 <dd>
 706 <p>Microsoft extension to <code>EUC-KR</code>.</p>
 707 </dd>
 708 <dd>
 709 <p>Proper names: <code>CP949</code>, <code>UHC</code>, <code>x-windows-949</code> (as used by Mozilla).</p>
 710 </dd>
 711 <dd>
 712 <p>See <a href="http://lists.w3.org/Archives/Public/ietf-charsets/2001AprJun/0033.html">http://lists.w3.org/Archives/Public/ietf-charsets/2001AprJun/0033.html</a>
 713 for details.</p>
 714 </dd>
 715 <dd>
 716 <p>Encode aliases <a href="#item_ks_c_5601_2d1987"><code>KS_C_5601-1987</code></a> to <code>cp949</code> to reflect this common
 717 misusage. <em>Raw</em> <a href="#item_ks_c_5601_2d1987"><code>KS_C_5601-1987</code></a> encoding is available as
 718 <code>kcs5601-raw</code>.</p>
 719 </dd>
 720 <dd>
 721 <p>See <a href="file://C|\msysgit\mingw\html/lib/Encode/KR.html">the Encode::KR manpage</a> for details.</p>
 722 </dd>
 723 </li>
 724 <dt><strong><a name="item_gb2312">GB2312</a></strong>
 725
 726 <dd>
 727 <p>Microsoft extension to <code>EUC-CN</code>.</p>
 728 </dd>
 729 <dd>
 730 <p>Proper names: <code>CP936</code>, <code>GBK</code>.</p>
 731 </dd>
 732 <dd>
 733 <p><a href="#item_gb2312"><code>GB2312</code></a> has been registered in the <code>EUC-CN</code> meaning at
 734 IANA. This has partially repaired the situation: Microsoft's
 735 <a href="#item_gb2312"><code>GB2312</code></a> has become a superset of the official <a href="#item_gb2312"><code>GB2312</code></a>.</p>
 736 </dd>
 737 <dd>
 738 <p>Encode aliases <a href="#item_gb2312"><code>GB2312</code></a> to <code>euc-cn</code> in full agreement with
 739 IANA registration. <code>cp936</code> is supported separately.
 740 <em>Raw</em> <code>GB_2312-80</code> encoding is available as <code>gb2312-raw</code>.</p>
 741 </dd>
 742 <dd>
 743 <p>See <a href="file://C|\msysgit\mingw\html/lib/Encode/CN.html">the Encode::CN manpage</a> for details.</p>
 744 </dd>
 745 </li>
 746 <dt><strong><a name="item_big5">Big5</a></strong>
 747
 748 <dd>
 749 <p>Microsoft extension to <a href="#item_big5"><code>Big5</code></a>.</p>
 750 </dd>
 751 <dd>
 752 <p>Proper name: <code>CP950</code>.</p>
 753 </dd>
 754 <dd>
 755 <p>Encode separately supports <a href="#item_big5"><code>Big5</code></a> and <code>cp950</code>.</p>
 756 </dd>
 757 </li>
 758 <dt><strong><a name="item_shift_jis">Shift_JIS</a></strong>
 759
 760 <dd>
 761 <p>Microsoft's understanding of <a href="#item_shift_jis"><code>Shift_JIS</code></a>.</p>
 762 </dd>
 763 <dd>
 764 <p>JIS has not endorsed the full Microsoft standard however.
 765 The official <a href="#item_shift_jis"><code>Shift_JIS</code></a> includes only JIS X 0201 and JIS X 0208
 766 character sets, while Microsoft has always used <a href="#item_shift_jis"><code>Shift_JIS</code></a>
 767 to encode a wider character repertoire. See <a href="#item_iana"><code>IANA</code></a> registration for
 768 <code>Windows-31J</code>.</p>
 769 </dd>
 770 <dd>
 771 <p>As a historical predecessor, Microsoft's variant
 772 probably has more rights for the name, though it may be objected
 773 that Microsoft shouldn't have used JIS as part of the name
 774 in the first place.</p>
 775 </dd>
 776 <dd>
 777 <p>Unambiguous name: <code>CP932</code>. <a href="#item_iana"><code>IANA</code></a> name (also used by Mozilla, and
 778 provided as an alias by Encode): <code>Windows-31J</code>.</p>
 779 </dd>
 780 <dd>
 781 <p>Encode separately supports <a href="#item_shift_jis"><code>Shift_JIS</code></a> and <code>cp932</code>.</p>
 782 </dd>
 783 </li>
 784 </dl>
 785 <p>
 786 </p>
 787 <hr />
 788 <h1><a name="glossary">Glossary</a></h1>
 789 <dl>
 790 <dt><strong><a name="item_character_repertoire">character repertoire</a></strong>
 791
 792 <dd>
 793 <p>A collection of unique characters.  A <em>character</em> set in the strictest
 794 sense. At this stage, characters are not numbered.</p>
 795 </dd>
 796 </li>
 797 <dt><strong><a name="item_set">coded character set (CCS)</a></strong>
 798
 799 <dd>
 800 <p>A character set that is mapped in a way computers can use directly.
 801 Many character encodings, including EUC, fall in this category.</p>
 802 </dd>
 803 </li>
 804 <dt><strong><a name="item_scheme">character encoding scheme (CES)</a></strong>
 805
 806 <dd>
 807 <p>An algorithm to map a character set to a byte sequence.  You don't
 808 have to be able to tell which character set a given byte sequence
 809 belongs.  7-bit ISO-2022 is a CES but it cannot be a CCS.  EUC is an
 810 example of being both a CCS and CES.</p>
 811 </dd>
 812 </li>
 813 <dt><strong><a name="item_charset">charset (in MIME context)</a></strong>
 814
 815 <dd>
 816 <p>has long been used in the meaning of <code>encoding</code>, CES.</p>
 817 </dd>
 818 <dd>
 819 <p>While the word combination <code>character set</code> has lost this meaning
 820 in MIME context since [RFC 2130], the <a href="#item_charset"><code>charset</code></a> abbreviation has
 821 retained it. This is how [RFC 2277] and [RFC 2278] bless <a href="#item_charset"><code>charset</code></a>:</p>
 822 </dd>
 823 <dd>
 824 <pre>
 825  This document uses the term &quot;charset&quot; to mean a set of rules for
 826  mapping from a sequence of octets to a sequence of characters, such
 827  as the combination of a coded character set and a character encoding
 828  scheme; this is also what is used as an identifier in MIME &quot;charset=&quot;
 829  parameters, and registered in the IANA charset registry ...  (Note
 830  that this is NOT a term used by other standards bodies, such as ISO).
 831  [RFC 2277]</pre>
 832 </dd>
 833 </li>
 834 <dt><strong><a name="item_euc">EUC</a></strong>
 835
 836 <dd>
 837 <p>Extended Unix Character.  See ISO-2022.</p>
 838 </dd>
 839 </li>
 840 <dt><strong><a name="item_iso_2d2022">ISO-2022</a></strong>
 841
 842 <dd>
 843 <p>A CES that was carefully designed to coexist with ASCII.  There are a 7
 844 bit version and an 8 bit version.</p>
 845 </dd>
 846 <dd>
 847 <p>The 7 bit version switches character set via escape sequence so it
 848 cannot form a CCS.  Since this is more difficult to handle in programs
 849 than the 8 bit version, the 7 bit version is not very popular except for
 850 iso-2022-jp, the <em>de facto</em> standard CES for e-mails.</p>
 851 </dd>
 852 <dd>
 853 <p>The 8 bit version can form a CCS.  EUC and ISO-8859 are two examples
 854 thereof.  Pre-5.6 perl could use them as string literals.</p>
 855 </dd>
 856 </li>
 857 <dt><strong><a name="item_ucs">UCS</a></strong>
 858
 859 <dd>
 860 <p>Short for <em>Universal Character Set</em>.  When you say just UCS, it means
 861 <em>Unicode</em>.</p>
 862 </dd>
 863 </li>
 864 <dt><strong><a name="item_ucs_2d2">UCS-2</a></strong>
 865
 866 <dd>
 867 <p>ISO/IEC 10646 encoding form: Universal Character Set coded in two
 868 octets.</p>
 869 </dd>
 870 </li>
 871 <dt><strong><a name="item_unicode">Unicode</a></strong>
 872
 873 <dd>
 874 <p>A character set that aims to include all character repertoires of the
 875 world.  Many character sets in various national as well as industrial
 876 standards have become, in a way, just subsets of Unicode.</p>
 877 </dd>
 878 </li>
 879 <dt><strong><a name="item_utf">UTF</a></strong>
 880
 881 <dd>
 882 <p>Short for <em>Unicode Transformation Format</em>.  Determines how to map a
 883 Unicode character into a byte sequence.</p>
 884 </dd>
 885 </li>
 886 <dt><strong><a name="item_utf_2d16">UTF-16</a></strong>
 887
 888 <dd>
 889 <p>A UTF in 16-bit encoding.  Can either be in big endian or little
 890 endian.  The big endian version is called UTF-16BE (equal to UCS-2 +
 891 surrogate support) and the little endian version is called UTF-16LE.</p>
 892 </dd>
 893 </li>
 894 </dl>
 895 <p>
 896 </p>
 897 <hr />
 898 <h1><a name="see_also">See Also</a></h1>
 899 <p><a href="file://C|\msysgit\mingw\html/lib/Encode.html">the Encode manpage</a>,
 900 <a href="file://C|\msysgit\mingw\html/lib/Encode/Byte.html">the Encode::Byte manpage</a>,
 901 <a href="file://C|\msysgit\mingw\html/lib/Encode/CN.html">the Encode::CN manpage</a>, <a href="file://C|\msysgit\mingw\html/lib/Encode/JP.html">the Encode::JP manpage</a>, <a href="file://C|\msysgit\mingw\html/lib/Encode/KR.html">the Encode::KR manpage</a>, <a href="file://C|\msysgit\mingw\html/lib/Encode/TW.html">the Encode::TW manpage</a>,
 902 <a href="file://C|\msysgit\mingw\html/lib/Encode/EBCDIC.html">the Encode::EBCDIC manpage</a>, <a href="file://C|\msysgit\mingw\html/lib/Encode/Symbol.html">the Encode::Symbol manpage</a>
 903 <a href="file://C|\msysgit\mingw\html/lib/Encode/MIME/Header.html">the Encode::MIME::Header manpage</a>, <a href="file://C|\msysgit\mingw\html/lib/Encode/Guess.html">the Encode::Guess manpage</a></p>
 904 <p>
 905 </p>
 906 <hr />
 907 <h1><a name="references">References</a></h1>
 908 <dl>
 909 <dt><strong><a name="item_ecma">ECMA</a></strong>
 910
 911 <dd>
 912 <p>European Computer Manufacturers Association
 913 <a href="http://www.ecma.ch">http://www.ecma.ch</a></p>
 914 </dd>
 915 <dl>
 916 <dt><strong><a name="item_035">ECMA-035 (eq <a href="#item_iso_2d2022"><code>ISO-2022</code></a>)</a></strong>
 917
 918 <dd>
 919 <p><a href="http://www.ecma.ch/ecma1/STAND/ECMA-035.HTM">http://www.ecma.ch/ecma1/STAND/ECMA-035.HTM</a></p>
 920 </dd>
 921 <dd>
 922 <p>The specification of ISO-2022 is available from the link above.</p>
 923 </dd>
 924 </li>
 925 </dl>
 926 <dt><strong><a name="item_iana">IANA</a></strong>
 927
 928 <dd>
 929 <p>Internet Assigned Numbers Authority
 930 <a href="http://www.iana.org/">http://www.iana.org/</a></p>
 931 </dd>
 932 <dl>
 933 <dt><strong><a name="item_assigned_charset_names_by_iana">Assigned Charset Names by IANA</a></strong>
 934
 935 <dd>
 936 <p><a href="http://www.iana.org/assignments/character-sets">http://www.iana.org/assignments/character-sets</a></p>
 937 </dd>
 938 <dd>
 939 <p>Most of the <code>canonical names</code> in Encode derive from this list
 940 so you can directly apply the string you have extracted from MIME
 941 header of mails and web pages.</p>
 942 </dd>
 943 </li>
 944 </dl>
 945 <dt><strong><a name="item_iso">ISO</a></strong>
 946
 947 <dd>
 948 <p>International Organization for Standardization
 949 <a href="http://www.iso.ch/">http://www.iso.ch/</a></p>
 950 </dd>
 951 </li>
 952 <dt><strong><a name="item_rfc">RFC</a></strong>
 953
 954 <dd>
 955 <p>Request For Comments -- need I say more?
 956 <a href="http://www.rfc-editor.org/">http://www.rfc-editor.org/</a>, <a href="http://www.rfc.net/">http://www.rfc.net/</a>,
 957 <a href="http://www.faqs.org/rfcs/">http://www.faqs.org/rfcs/</a></p>
 958 </dd>
 959 </li>
 960 <dt><strong><a name="item_uc">UC</a></strong>
 961
 962 <dd>
 963 <p>Unicode Consortium
 964 <a href="http://www.unicode.org/">http://www.unicode.org/</a></p>
 965 </dd>
 966 <dl>
 967 <dt><strong><a name="item_unicode_glossary">Unicode Glossary</a></strong>
 968
 969 <dd>
 970 <p><a href="http://www.unicode.org/glossary/">http://www.unicode.org/glossary/</a></p>
 971 </dd>
 972 <dd>
 973 <p>The glossary of this document is based upon this site.</p>
 974 </dd>
 975 </li>
 976 </dl>
 977 </dl>
 978 <p>
 979 </p>
 980 <h2><a name="other_notable_sites">Other Notable Sites</a></h2>
 981 <dl>
 982 <dt><strong><a name="item_czyborra_2ecom">czyborra.com</a></strong>
 983
 984 <dd>
 985 <p><a href="http://czyborra.com/">http://czyborra.com/</a></p>
 986 </dd>
 987 <dd>
 988 <p>Contains a lot of useful information, especially gory details of ISO
 989 vs. vendor mappings.</p>
 990 </dd>
 991 </li>
 992 <dt><strong><a name="item_cjk_2einf">CJK.inf</a></strong>
 993
 994 <dd>
 995 <p><a href="http://www.oreilly.com/people/authors/lunde/cjk_inf.html">http://www.oreilly.com/people/authors/lunde/cjk_inf.html</a></p>
 996 </dd>
 997 <dd>
 998 <p>Somewhat obsolete (last update in 1996), but still useful.  Also try</p>
 999 </dd>
1000 <dd>
1001 <p><a href="ftp://ftp.oreilly.com/pub/examples/nutshell/cjkv/pdf/GB18030_Summary.pdf">ftp://ftp.oreilly.com/pub/examples/nutshell/cjkv/pdf/GB18030_Summary.pdf</a></p>
1002 </dd>
1003 <dd>
1004 <p>You will find brief info on <code>EUC-CN</code>, <code>GBK</code> and mostly on <code>GB 18030</code>.</p>
1005 </dd>
1006 </li>
1007 <dt><strong><a name="item_jungshik_shin_27s_hangul_faq">Jungshik Shin's Hangul FAQ</a></strong>
1008
1009 <dd>
1010 <p><a href="http://jshin.net/faq">http://jshin.net/faq</a></p>
1011 </dd>
1012 <dd>
1013 <p>And especially its subject 8.</p>
1014 </dd>
1015 <dd>
1016 <p><a href="http://jshin.net/faq/qa8.html">http://jshin.net/faq/qa8.html</a></p>
1017 </dd>
1018 <dd>
1019 <p>A comprehensive overview of the Korean (<code>KS *</code>) standards.</p>
1020 </dd>
1021 </li>
1022 <dt><strong><a name="item_debian_2eorg_3a__22introduction_to_i18n_22">debian.org: ``Introduction to i18n''</a></strong>
1023
1024 <dd>
1025 <p>A brief description for most of the mentioned CJK encodings is
1026 contained in
1027 <a href="http://www.debian.org/doc/manuals/intro-i18n/ch-codes.en.html">http://www.debian.org/doc/manuals/intro-i18n/ch-codes.en.html</a></p>
1028 </dd>
1029 </li>
1030 </dl>
1031 <p>
1032 </p>
1033 <h2><a name="offline_sources">Offline sources</a></h2>
1034 <dl>
1035 <dt><strong><a name="item_cjkv_information_processing_by_ken_lunde"><code>CJKV Information Processing</code> by Ken Lunde</a></strong>
1036
1037 <dd>
1038 <p>CJKV Information Processing
1039 1999 O'Reilly &amp; Associates, ISBN : 1-56592-224-7</p>
1040 </dd>
1041 <dd>
1042 <p>The modern successor of <a href="#item_cjk_2einf"><code>CJK.inf</code></a>.</p>
1043 </dd>
1044 <dd>
1045 <p>Features a comprehensive coverage of CJKV character sets and
1046 encodings along with many other issues faced by anyone trying
1047 to better support CJKV languages/scripts in all the areas of
1048 information processing.</p>
1049 </dd>
1050 <dd>
1051 <p>To purchase this book, visit
1052 <a href="http://www.oreilly.com/catalog/cjkvinfo/">http://www.oreilly.com/catalog/cjkvinfo/</a>
1053 or your favourite bookstore.</p>
1054 </dd>
1055 </li>
1056 </dl>
1057 <table border="0" width="100%" cellspacing="0" cellpadding="3">
1058 <tr><td class="block" style="background-color: #cccccc" valign="middle">
1059 <big><strong><span class="block">&nbsp;Encode::Supported -- Encodings supported by Encode</span></strong></big>
1060 </td></tr>
1061 </table>
1062
1063 </body>
1064
1065 </html>