docs/Samba-HOWTO-Collection/Unicode.xml

   1 <?xml version="1.0" encoding="iso-8859-1"?>
   2 <!DOCTYPE book PUBLIC "-//Samba-Team//DTD DocBook V4.2-Based Variant V1.0//EN" "http://www.samba.org/samba/DTD/samba-doc">
   3 <chapter id="unicode">
   4 <chapterinfo>
   5         &author.jelmer;
   6         &author.jht;
   7         <author>
   8                 <firstname>TAKAHASHI</firstname><surname>Motonobu</surname>
   9                 <affiliation>
  10                 <address><email>monyo@home.monyo.com</email></address>
  11                 </affiliation>
  12                 <contrib>Japanese character support</contrib>
  13         </author>
  14         <pubdate>25 March 2003</pubdate>
  15 </chapterinfo>
  16
  17 <title>Unicode/Charsets</title>
  18
  19 <sect1>
  20 <title>Features and Benefits</title>
  21
  22 <para>
  23 Every industry eventually matures. One of the great areas of maturation is in
  24 the focus that has been given over the past decade to make it possible for anyone
  25 anywhere to use a computer. It has not always been that way, in fact, not so long
  26 ago it was common for software to be written for exclusive use in the country of
  27 origin.
  28 </para>
  29
  30 <para>
  31 Of all the effort that has been brought to bear on providing native
  32 language support for all computer users, the efforts of the
  33 <ulink url="http://www.openi18n.org/">Openi18n organization</ulink>
  34 is deserving of special mention.
  35 </para>
  36
  37 <para>
  38 Samba-2.x supported a single locale through a mechanism called
  39 <emphasis>codepages</emphasis>. Samba-3 is destined to become a truly trans-global
  40 file and printer-sharing platform.
  41 </para>
  42
  43 </sect1>
  44
  45 <sect1>
  46 <title>What Are Charsets and Unicode?</title>
  47
  48 <para>
  49 Computers communicate in numbers. In texts, each number will be
  50 translated to a corresponding letter. The meaning that will be assigned
  51 to a certain number depends on the <emphasis>character set (charset)
  52 </emphasis> that is used.
  53 </para>
  54
  55 <para>
  56 A charset can be seen as a table that is used to translate numbers to
  57 letters. Not all computers use the same charset (there are charsets
  58 with German umlauts, Japanese characters, and so on). The American Standard Code
  59 for Information Interchange (ASCII) encoding system has been the normative character
  60 encoding scheme used by computers to date. This employs a charset that contains
  61 256 characters. Using this mode of encoding each character takes exactly one byte.
  62 </para>
  63
  64 <para>
  65 There are also charsets that support extended characters, but those need at least
  66 twice as much storage space as does ASCII encoding. Such charsets can contain
  67 <command>256 * 256 = 65536</command> characters, which is more than all possible
  68 characters one could think of. They are called multi-byte charsets because they use
  69 more then one byte to store one character.
  70 </para>
  71
  72 <para>
  73 One standardized multi-byte charset encoding scheme is known as
  74 <ulink url="http://www.unicode.org/">unicode</ulink>.  A big advantage of using a
  75 multi-byte charset is that you only need one. There is no need to make sure two
  76 computers use the same charset when they are communicating.
  77 </para>
  78
  79 <para>Old Windows clients use single-byte charsets, named
  80 <parameter>codepages</parameter>, by Microsoft. However, there is no support for
  81 negotiating the charset to be used in the SMB/CIFS protocol. Thus, you
  82 have to make sure you are using the same charset when talking to an older client.
  83 Newer clients (Windows NT, 200x, XP) talk unicode over the wire.
  84 </para>
  85 </sect1>
  86
  87 <sect1>
  88 <title>Samba and Charsets</title>
  89
  90 <para>
  91 As of Samba-3, Samba can (and will) talk unicode over the wire. Internally,
  92 Samba knows of three kinds of character sets:
  93 </para>
  94
  95 <variablelist>
  96         <varlistentry>
  97                 <term><smbconfoption><name>unix charset</name></smbconfoption></term>
  98                 <listitem><para>
  99                 This is the charset used internally by your operating system.
 100                 The default is <constant>UTF-8</constant>, which is fine for most
 101                 systems, which covers all characters in all languages. The default
 102                 in previous Samba releases was to save filenames in the encoding of the
 103                 clients, for example cp850 for western european countries.
 104                 </para></listitem>
 105         </varlistentry>
 106
 107         <varlistentry>
 108                 <term><smbconfoption><name>display charset</name></smbconfoption></term>
 109                 <listitem><para>This is the charset Samba will use to print messages
 110                 on your screen. It should generally be the same as the <parameter>unix charset</parameter>.
 111                 </para></listitem>
 112         </varlistentry>
 113
 114         <varlistentry>
 115                 <term><smbconfoption><name>dos charset</name></smbconfoption></term>
 116                 <listitem><para>This is the charset Samba uses when communicating with
 117                 DOS and Windows 9x/Me clients. It will talk unicode to all newer clients.
 118                 The default depends on the charsets you have installed on your system.
 119                 Run <command>testparm -v | grep <quote>dos charset</quote></command> to see
 120                 what the default is on your system.
 121                 </para></listitem>
 122         </varlistentry>
 123 </variablelist>
 124
 125 </sect1>
 126
 127 <sect1>
 128 <title>Conversion from Old Names</title>
 129
 130 <para>Because previous Samba versions did not do any charset conversion,
 131 characters in filenames are usually not correct in the UNIX charset but only
 132 for the local charset used by the DOS/Windows clients.</para>
 133
 134 <para>Bjoern Jacke has written a utility named <ulink url="http://j3e.de/linux/convmv/">convmv</ulink>
 135 that can convert whole directory structures to different charsets with one single command.
 136 </para>
 137
 138 </sect1>
 139
 140 <sect1>
 141 <title>Japanese Charsets</title>
 142
 143 <para>
 144 Setting up Japanese charsets is quite difficult. This is mainly because:
 145 </para>
 146
 147 <itemizedlist>
 148         <listitem><para>The Windows character set is extended from the original legacy Japanese
 149                 standard (JIS X 0208) and is not standardized. This means that the strictly
 150                 standardized implementation cannot support the full Windows character set.
 151         </para></listitem>
 152
 153         <listitem><para> Mainly for historical reasons, there are several encoding methods in
 154                 Japanese, which are not fully compatible with each other. There are
 155                 two major encoding methods. One is the Shift_JIS series, it is used in Windows
 156                 and some UNIX's. The other is the EUC-JP series, used in most UNIX's
 157                 and Linux. Moreover, Samba previously also offered several unique encoding
 158                 methods, named CAP and HEX, to keep interoperability with CAP/NetAtalk and
 159                 UNIX's which can't use Japanese filenames.  Some implementations of the
 160                 EUC-JP series can't support the full Windows character set.
 161         </para></listitem>
 162
 163         <listitem><para>There are some code conversion tables between Unicode and legacy
 164                 Japanese character sets. One is compatible with Windows, another one
 165                 is based on the reference of the Unicode consortium and others are
 166                 a mixed implementation. The Unicode consortium does not officially
 167                 define any conversion tables between Unicode and legacy character
 168                 sets so there cannot be standard one.
 169         </para></listitem>
 170
 171         <listitem><para>The character set and conversion tables available in iconv() depends
 172                 on the iconv library that is available. Next to that, the Japanese locale
 173                 names may be different on different systems.  This means that the value of
 174                 the charset parameters depends on the implementation of iconv() you are using.
 175                 </para>
 176
 177                 <para>Though 2 byte fixed UCS-2 encoding is used in Windows internally,
 178                 Shift_JIS series encoding is usually used in Japanese environments
 179                 as ASCII encoding is in English environments.
 180         </para></listitem>
 181 </itemizedlist>
 182
 183 <sect2><title>Basic Parameter Setting</title>
 184
 185         <para>
 186         <smbconfoption><name>dos charset</name></smbconfoption> and
 187         <smbconfoption><name>display charset</name></smbconfoption>
 188         should be set to the locale compatible with the character set
 189         and encoding method used on Windows. This is usually CP932
 190         but sometimes has a different name.
 191         </para>
 192
 193         <para>
 194         <smbconfoption><name>unix charset</name></smbconfoption> can be either Shift_JIS series,
 195         EUC-JP series and UTF-8. UTF-8 is always available but the availability of other locales
 196         and its name itself depends on the system.
 197         </para>
 198
 199         <para>
 200         Additionally, you can consider to use the Shift_JIS series as the
 201         value of the <smbconfoption><name>unix charset</name></smbconfoption>
 202         parameter by using the vfs_cap module, which does the same thing as
 203         setting <quote>coding system = CAP</quote> in the Samba 2.2 series.
 204         </para>
 205
 206         <para>
 207         Where to set <smbconfoption><name>unix charset</name></smbconfoption>
 208         to is a difficult question. Here is a list of details, advantages and
 209         disadvantages of using a certain value.
 210         </para>
 211
 212         <variablelist>
 213                 <varlistentry><term>Shift_JIS series</term>
 214                         <listitem><para>
 215                         Shift_JIS series means a locale which is equivalent to <constant>Shift_JIS</constant>,
 216                         used as a standard on Japanese Windows. In the case of <constant>Shift_JIS</constant>,
 217                         for example if a Japanese file name consist of 0x8ba4 and 0x974c
 218                         (a 4 bytes Japanese character string meaning <quote>share</quote>) and <quote>.txt</quote>
 219                         is written from Windows on Samba, the file name on UNIX becomes
 220                         0x8ba4, 0x974c, <quote>.txt</quote> (a 8 bytes BINARY string), same as Windows.
 221                         </para>
 222
 223                         <para>Since Shift_JIS series is usually used on some commercial based
 224                         UNIX's; hp-ux and AIX as Japanese locale (however, it is also possible
 225                         to use the EUC-JP series), To use Shift_JIS series on these platforms,
 226                         Japanese file names created from Windows can be referred to also on
 227                         UNIX.</para>
 228
 229                         <para>
 230                         If your UNIX is already working with Shift_JIS and there is a user
 231                         who needs to use Japanese file names written from Windows, the
 232                         Shift_JIS series is the best choice.  However, broken file names
 233                         may be displayed and some commands which cannot handle non-ASCII
 234                         filenames may be aborted during parsing filenames. especially there
 235                         may be <quote>\ (0x5c)</quote> in file names, which need to be handled carefully.
 236                         So you had better not touch file names written from Windows on UNIX.
 237                         </para>
 238
 239                         <para>
 240                         Note that most Japanized free software actually works with EUC-JP
 241                         only. You had better verify if the Japanized free software can work
 242                         with Shift_JIS.
 243                         </para>
 244                         </listitem>
 245                 </varlistentry>
 246
 247                 <varlistentry><term>EUC-JP series</term>
 248                         <listitem><para>
 249                         EUC-JP series means a locale which is equivalent to the industry
 250                         standard called EUC-JP, widely used in Japanese UNIX (although EUC
 251                         contains specifications for languages other than Japanese, such as
 252                         EUC-KR). In the case of EUC-JP series, for example if a Japanese
 253                         file name consist of 0x8ba4 and 0x974c and <quote>.txt</quote> is written from
 254                         Windows on Samba, the file name on UNIX becomes 0xb6a6, 0xcdad,
 255                         <quote>.txt</quote> (a 8 bytes BINARY string).
 256                         </para>
 257
 258                         <para>
 259                         Since EUC-JP is usually used on Open source UNIX, Linux and FreeBSD,
 260                         and on commercial based UNIX, Solaris, IRIX and Tru64 UNIX as
 261                         Japanese locale (however, it is also possible on Solaris to use
 262                         Shift_JIS and UTF-8, on Tru64 UNIX to use Shift_JIS). To use EUC-JP
 263                         series, most Japanese file names created from Windows can be
 264                         referred to also on UNIX. Also, most Japanized free software work
 265                         mainly with EUC-JP only.
 266                         </para>
 267
 268                         <para>
 269                         It is recommended to choose EUC-JP series when using Japanese file
 270                         names on these UNIX.
 271                         </para>
 272
 273                         <para>
 274                         Although there is no character which needs to be carefully treated
 275                         like <quote>\ (0x5c)</quote>, broken file names may be displayed and some
 276                         commands which cannot handle non-ASCII filenames may be aborted
 277                         during parsing filenames.
 278                         </para>
 279
 280                         <para>
 281                         Moreover, if you built Samba using differently installed libiconv,
 282                         eucJP-ms locale included in libiconv and EUC-JP series locale
 283                         included in OS may not be compatible. In this case, you may need to
 284                         avoid using incompatible characters for file names.
 285                         </para>
 286                         </listitem>
 287                 </varlistentry>
 288
 289                 <varlistentry><term>UTF-8</term>
 290                         <listitem><para>
 291                         UTF-8 means a locale which is equivalent to UTF-8, the international
 292                         standard defined by Unicode consortium. In UTF-8, a <parameter>character</parameter> is
 293                         expressed using 1-3 bytes. In case of Japanese, most characters
 294                         are expressed using 3 bytes. Since on Windows Shift_JIS, where a
 295                         character is expressed with 1 or 2 bytes, is used to express
 296                         Japanese, basically a byte length of a UTF-8 string grows 1.5 times
 297                         the length of a original Shift_JIS string. In the case of UTF-8,
 298                         for example if a Japanese file name consist of 0x8ba4 and 0x974c and
 299                         <quote>.txt</quote> is written from Windows on Samba, the file name on UNIX
 300                         becomes 0xe585, 0xb1e6, 0x9c89, <quote>.txt</quote> (a 10 bytes BINARY string).
 301                         </para>
 302
 303                         <para>
 304                         For systems where iconv() is not available or where iconv()'s locales
 305                         are not compatible with Windows, UTF-8 is the only locale available.
 306                         </para>
 307
 308                         <para>
 309                         There are no systems that use UTF-8 as default locale for Japanese.
 310                         </para>
 311
 312                         <para>
 313                         Some broken file names may be displayed and some commands which
 314                         cannot handle non-ASCII filenames may be aborted during parsing
 315                         filenames. especially there may be <quote>\ (0x5c)</quote> in file names, which
 316                         need to be handled carefully. So you had better not touch file names
 317                         written from Windows on UNIX.
 318                         </para>
 319
 320                         <para>
 321                         In addition, although it is not directly concerned with Samba, since
 322                         there is a delicate difference between iconv() function, which is
 323                         generally used on UNIX and the functions used on other platforms,
 324                         such as Windows and Java about the conversion table between
 325                         Shift_JIS and Unicode, you should be carefully to handle UTF-8.
 326                         </para>
 327
 328                         <para>
 329                         Although Mac OS X uses UTF-8 as its encoding method for filenames,
 330                         it uses an extended UTF-8 specification that Samba cannot handle so
 331                         UTF-8 locale is not available for Mac OS X.
 332                         </para>
 333                         </listitem>
 334                 </varlistentry>
 335
 336                 <varlistentry><term>Shift_JIS series + vfs_cap (CAP encoding)</term>
 337                         <listitem><para>
 338                         CAP encoding means a specification using in CAP and NetAtalk, file
 339                         server software for Macintosh. In the case of CAP encoding, for
 340                         example if a Japanese file name consist of 0x8ba4 and 0x974c and
 341                         <quote>.txt</quote> is written from Windows on Samba, the file name on UNIX
 342                         becomes <quote>:8b:a4:97L.txt</quote> (a 14 bytes ASCII string).
 343                         </para>
 344
 345                         <para>
 346                         For CAP encoding a byte which cannot be expressed as an ASCII
 347                         character (0x80 or above) is encoded as <quote>:xx</quote> form. You need to take
 348                         care of containing a <quote>\(0x5c)</quote> in a filename but filenames are not
 349                         broken in a system which cannot handle non-ASCII filenames.
 350                         </para>
 351
 352                         <para>
 353                         The greatest merit of CAP encoding is the compatibility of encoding
 354                         filenames with CAP or NetAtalk, file server software of Macintosh.
 355                         Since they usually write a file name on UNIX with CAP encoding, if a
 356                         directory is shared with both Samba and NetAtalk, you need to use
 357                         CAP encoding to avoid non-ASCII filenames are broken.
 358                         </para>
 359
 360                         <para>
 361                         However, recently there are some systems where NetAtalk has been
 362                         patched to write filenames with EUC-JP (i.e. Japanese original Vine Linux).
 363                         Here you need to choose EUC-JP series instead of CAP encoding.
 364                         </para>
 365
 366                         <para>
 367                         vfs_cap itself is available for non Shift_JIS series locales for
 368                         systems which cannot handle non-ASCII characters or systems which
 369                         shares files with NetAtalk.
 370                         </para>
 371
 372                         <para>
 373                         To use CAP encoding on Samba-3, you should use the unix charset parameter and VFS
 374                         as follows:
 375                         </para>
 376
 377 <smbconfexample><title>VFS CAP</title>
 378 <smbconfsection>[global]</smbconfsection>
 379 <smbconfoption><name>dos charset</name><value>CP932<footnote><para>the locale name "CP932" may be different name</para></footnote></value></smbconfoption>
 380 <smbconfoption><name>unix charset</name><value>CP932</value></smbconfoption>
 381
 382 <member><para>...</para></member>
 383
 384 <smbconfsection>[cap-share]</smbconfsection>
 385 <smbconfoption><name>vfs option</name><value>cap</value></smbconfoption>
 386 </smbconfexample>
 387
 388                         <para>
 389                         You should set CP932 if using GNU libiconv for unix charset. Setting this,
 390                         filenames in the <quote>cap-share</quote> share are written with CAP encoding.
 391                         </para>
 392                         </listitem>
 393                 </varlistentry>
 394         </variablelist>
 395
 396 </sect2>
 397
 398 <sect2><title>Individual Implementations</title>
 399
 400 <para>
 401 Here is some additional information regarding individual implementations:
 402 </para>
 403
 404         <variablelist>
 405                 <varlistentry><term>GNU libiconv</term>
 406                         <listitem><para>
 407                         To handle Japanese correctly, you should apply the patch
 408                         <ulink url="http://www2d.biglobe.ne.jp/~msyk/software/libiconv-patch.html">libiconv-1.8-cp932-patch.diff.gz</ulink>
 409                         to libiconv-1.8.
 410                         </para>
 411
 412                         <para>
 413                         Using the patched libiconv-1.8, these settings are available:
 414                         </para>
 415
 416
 417 <!-- FIXME: Convert to diagram ? -->
 418 <programlisting>
 419 dos charset = CP932
 420 unix charset = CP932 / eucJP-ms / UTF-8
 421                 |       |
 422                 |       +-- EUC-JP series
 423                 +-- Shift_JIS series
 424 display charset = CP932
 425 </programlisting>
 426
 427                         <para>
 428                         Other Japanese locales (for example Shift_JIS and EUC-JP) should not
 429                         be used for the lack of the compatibility with Windows.
 430                         </para>
 431                         </listitem>
 432                 </varlistentry>
 433
 434                 <varlistentry><term>GNU glibc</term>
 435                         <listitem><para>
 436                         To handle Japanese correctly, you should apply a <ulink url="http://www2d.biglobe.ne.jp/~msyk/software/glibc/">patch</ulink>
 437                         to glibc-2.2.5/2.3.1/2.3.2 or should use the patch-merged versions, glibc-2.3.3 or later.
 438                         </para>
 439
 440                         <para>
 441                         Using the above glibc, these setting are available:
 442                         </para>
 443
 444 <smbconfblock>
 445 <smbconfoption><name>dos charset</name><value>CP932</value></smbconfoption>
 446 <smbconfoption><name>unix charset</name><value>CP932 / eucJP-ms / UTF-8</value></smbconfoption>
 447 <smbconfoption><name>display charset</name><value>CP932</value></smbconfoption>
 448 </smbconfblock>
 449
 450                         <para>
 451                         Other Japanese locales (for example Shift_JIS and EUC-JP) should not
 452                         be used for the lack of the compatibility with Windows.
 453                         </para>
 454                         </listitem>
 455                 </varlistentry>
 456         </variablelist>
 457
 458 </sect2>
 459
 460 <sect2>
 461         <title>Migration from Samba-2.2 Series</title>
 462
 463 <para>
 464 Prior to Samba-2.2 series <quote>coding system</quote> parameter is used as
 465 <smbconfoption><name>unix charset</name></smbconfoption> parameter of the Samba-3 series.
 466 <link linkend="japancharsets">Next table</link> shows the mapping table when migrating from the Samba-2.2 series to Samba-3.
 467 </para>
 468
 469         <table frame="all" id="japancharsets">
 470                 <title>Japanese Character Sets in Samba-2.2 and Samba-3</title>
 471
 472                 <tgroup cols="2" align="center">
 473                         <colspec align="center"/>
 474                         <colspec align="center"/>
 475                         <thead>
 476                                 <row><entry>Samba-2.2 Coding System</entry><entry>Samba-3 unix charset</entry></row>
 477                         </thead>
 478                         <tbody>
 479                                 <row><entry>SJIS</entry><entry>Shift_JIS series</entry></row>
 480                                 <row><entry>EUC</entry><entry>EUC-JP series</entry></row>
 481                                 <row><entry>EUC3<footnote><para>Only exists in Japanese Samba version</para></footnote></entry><entry>EUC-JP series</entry></row>
 482                                 <row><entry>CAP</entry><entry>Shift_JIS series + VFS</entry></row>
 483                                 <row><entry>HEX</entry><entry>currently none</entry></row>
 484                                 <row><entry>UTF8</entry><entry>UTF-8</entry></row>
 485                                 <row><entry>UTF8-Mac<footnote><para>Only exists in Japanese Samba version</para></footnote></entry><entry>currently none</entry></row>
 486                                 <row><entry>others</entry><entry>none</entry></row>
 487                         </tbody>
 488                 </tgroup>
 489         </table>
 490
 491 </sect2>
 492
 493 </sect1>
 494
 495 <sect1>
 496         <title>Common Errors</title>
 497
 498         <sect2>
 499                 <title>CP850.so Can't Be Found</title>
 500
 501                 <para><quote>Samba is complaining about a missing <filename>CP850.so</filename> file.</quote></para>
 502
 503                 <para><emphasis>Answer:</emphasis> CP850 is the default <smbconfoption><name>dos charset</name></smbconfoption>.
 504                 The <smbconfoption><name>dos charset</name></smbconfoption> is used to convert data to the codepage used by your dos clients.
 505                 If you do not have any dos clients, you can safely ignore this message. </para>
 506
 507                 <para>CP850 should be supported by your local iconv implementation. Make sure you have all the required packages installed.
 508                 If you compiled Samba from source, make sure to configure found iconv.</para>
 509         </sect2>
 510 </sect1>
 511
 512 </chapter>