doc/src/sgml/charset.sgml

   1 <!-- $PostgreSQL$ -->
   2
   3 <chapter id="charset">
   4  <title>Localization</>
   5
   6  <para>
   7   This chapter describes the available localization features from the
   8   point of view of the administrator.
   9   <productname>PostgreSQL</productname> supports localization with
  10   two approaches:
  11
  12    <itemizedlist>
  13     <listitem>
  14      <para>
  15       Using the locale features of the operating system to provide
  16       locale-specific collation order, number formatting, translated
  17       messages, and other aspects.
  18      </para>
  19     </listitem>
  20
  21     <listitem>
  22      <para>
  23       Providing a number of different character sets defined in the
  24       <productname>PostgreSQL</productname> server, including
  25       multiple-byte character sets, to support storing text in all
  26       kinds of languages, and providing character set translation between
  27       client and server.
  28      </para>
  29     </listitem>
  30    </itemizedlist>
  31   </para>
  32
  33
  34  <sect1 id="locale">
  35   <title>Locale Support</title>
  36
  37   <indexterm zone="locale"><primary>locale</></>
  38
  39   <para>
  40    <firstterm>Locale</> support refers to an application respecting
  41    cultural preferences regarding alphabets, sorting, number
  42    formatting, etc.  <productname>PostgreSQL</> uses the standard ISO
  43    C and <acronym>POSIX</acronym> locale facilities provided by the server operating
  44    system.  For additional information refer to the documentation of your
  45    system.
  46   </para>
  47
  48   <sect2>
  49    <title>Overview</>
  50
  51    <para>
  52     Locale support is automatically initialized when a database
  53     cluster is created using <command>initdb</command>.
  54     <command>initdb</command> will initialize the database cluster
  55     with the locale setting of its execution environment by default,
  56     so if your system is already set to use the locale that you want
  57     in your database cluster then there is nothing else you need to
  58     do.  If you want to use a different locale (or you are not sure
  59     which locale your system is set to), you can instruct
  60     <command>initdb</command> exactly which locale to use by
  61     specifying the <option>--locale</option> option. For example:
  62 <screen>
  63 initdb --locale=sv_SE
  64 </screen>
  65    </para>
  66
  67    <para>
  68     This example for Unix systems sets the locale to Swedish
  69     (<literal>sv</>) as spoken
  70     in Sweden (<literal>SE</>).  Other possibilities might be
  71     <literal>en_US</> (U.S. English) and <literal>fr_CA</> (French
  72     Canadian).  If more than one character set can be useful for a
  73     locale then the specifications look like this:
  74     <literal>cs_CZ.ISO8859-2</>. What locales are available under what
  75     names on your system depends on what was provided by the operating
  76     system vendor and what was installed.  On most Unix systems, the command
  77     <literal>locale -a</> will provide a list of available locales.
  78     Windows uses more verbose names, such as <literal>German_Germany</>
  79     or <literal>Swedish_Sweden.1252</>.
  80    </para>
  81
  82    <para>
  83     Occasionally it is useful to mix rules from several locales, e.g.,
  84     use English collation rules but Spanish messages.  To support that, a
  85     set of locale subcategories exist that control only a certain
  86     aspect of the localization rules:
  87
  88     <informaltable>
  89      <tgroup cols="2">
  90       <tbody>
  91        <row>
  92         <entry><envar>LC_COLLATE</></>
  93         <entry>String sort order</>
  94        </row>
  95        <row>
  96         <entry><envar>LC_CTYPE</></>
  97         <entry>Character classification (What is a letter? Its upper-case equivalent?)</>
  98        </row>
  99        <row>
 100         <entry><envar>LC_MESSAGES</></>
 101         <entry>Language of messages</>
 102        </row>
 103        <row>
 104         <entry><envar>LC_MONETARY</></>
 105         <entry>Formatting of currency amounts</>
 106        </row>
 107        <row>
 108         <entry><envar>LC_NUMERIC</></>
 109         <entry>Formatting of numbers</>
 110        </row>
 111        <row>
 112         <entry><envar>LC_TIME</></>
 113         <entry>Formatting of dates and times</>
 114        </row>
 115       </tbody>
 116      </tgroup>
 117     </informaltable>
 118
 119     The category names translate into names of
 120     <command>initdb</command> options to override the locale choice
 121     for a specific category.  For instance, to set the locale to
 122     French Canadian, but use U.S. rules for formatting currency, use
 123     <literal>initdb --locale=fr_CA --lc-monetary=en_US</literal>.
 124    </para>
 125
 126    <para>
 127     If you want the system to behave as if it had no locale support,
 128     use the special locale <literal>C</> or <literal>POSIX</>.
 129    </para>
 130
 131    <para>
 132     The nature of some locale categories is that their value has to be
 133     fixed when the database is created.  You can use different settings
 134     for different databases, but once a database is created, you cannot
 135     change them for that database anymore. <literal>LC_COLLATE</literal>
 136     and <literal>LC_CTYPE</literal> are those categories.  They affect
 137     the sort order of indexes, so they must be kept fixed, or indexes on
 138     text columns will become corrupt.  The default values for these
 139     categories are defined when <command>initdb</command> is run, and
 140     those values are used when new databases are created, unless
 141     specified otherwise in the <command>CREATE DATABASE</command> command.
 142    </para>
 143
 144    <para>
 145     The other locale categories can be changed as desired whenever the
 146     server is running by setting the run-time configuration variables
 147     that have the same name as the locale categories (see <xref
 148     linkend="runtime-config-client-format"> for details).  The defaults
 149     that are chosen by <command>initdb</command> are actually only written into
 150     the configuration file <filename>postgresql.conf</filename> to
 151     serve as defaults when the server is started.  If you delete these
 152     assignments from <filename>postgresql.conf</filename> then the
 153     server will inherit the settings from its execution environment.
 154    </para>
 155
 156    <para>
 157     Note that the locale behavior of the server is determined by the
 158     environment variables seen by the server, not by the environment
 159     of any client.  Therefore, be careful to configure the correct locale settings
 160     before starting the server.  A consequence of this is that if
 161     client and server are set up in different locales, messages might
 162     appear in different languages depending on where they originated.
 163    </para>
 164
 165    <note>
 166     <para>
 167      When we speak of inheriting the locale from the execution
 168      environment, this means the following on most operating systems:
 169      For a given locale category, say the collation, the following
 170      environment variables are consulted in this order until one is
 171      found to be set: <envar>LC_ALL</envar>, <envar>LC_COLLATE</envar>
 172      (the variable corresponding to the respective category),
 173      <envar>LANG</envar>.  If none of these environment variables are
 174      set then the locale defaults to <literal>C</literal>.
 175     </para>
 176
 177     <para>
 178      Some message localization libraries also look at the environment
 179      variable <envar>LANGUAGE</envar> which overrides all other locale
 180      settings for the purpose of setting the language of messages.  If
 181      in doubt, please refer to the documentation of your operating
 182      system, in particular the documentation about
 183      <application>gettext</>, for more information.
 184     </para>
 185    </note>
 186
 187    <para>
 188     To enable messages to be translated to the user's preferred language,
 189     <acronym>NLS</acronym> must have been enabled at build time.  This
 190     choice is independent of the other locale support.
 191    </para>
 192   </sect2>
 193
 194   <sect2>
 195    <title>Behavior</>
 196
 197    <para>
 198     The locale settings influence the following SQL features:
 199
 200     <itemizedlist>
 201      <listitem>
 202       <para>
 203        Sort order in queries using <literal>ORDER BY</> on textual data
 204        <indexterm><primary>ORDER BY</><secondary>and locales</></indexterm>
 205       </para>
 206      </listitem>
 207
 208      <listitem>
 209       <para>
 210        The ability to use indexes with <literal>LIKE</> clauses
 211        <indexterm><primary>LIKE</><secondary>and locales</></indexterm>
 212       </para>
 213      </listitem>
 214
 215      <listitem>
 216       <para>
 217        The <function>upper</>,  <function>lower</>,  and <function>initcap</>
 218        functions
 219        <indexterm><primary>upper</><secondary>and locales</></indexterm>
 220        <indexterm><primary>lower</><secondary>and locales</></indexterm>
 221       </para>
 222      </listitem>
 223
 224      <listitem>
 225       <para>
 226        The <function>to_char</> family of functions
 227        <indexterm><primary>to_char</><secondary>and locales</></indexterm>
 228       </para>
 229      </listitem>
 230     </itemizedlist>
 231    </para>
 232
 233    <para>
 234     The drawback of using locales other than <literal>C</> or
 235     <literal>POSIX</> in <productname>PostgreSQL</> is its performance
 236     impact. It slows character handling and prevents ordinary indexes
 237     from being used by <literal>LIKE</>. For this reason use locales
 238     only if you actually need them.
 239    </para>
 240
 241    <para>
 242     As a workaround to allow <productname>PostgreSQL</> to use indexes
 243     with <literal>LIKE</> clauses under a non-C locale, several custom
 244     operator classes exist. These allow the creation of an index that
 245     performs a strict character-by-character comparison, ignoring
 246     locale comparison rules. Refer to <xref linkend="indexes-opclass">
 247     for more information.
 248    </para>
 249   </sect2>
 250
 251   <sect2>
 252    <title>Problems</>
 253
 254    <para>
 255     If locale support doesn't work according to the explanation above,
 256     check that the locale support in your operating system is
 257     correctly configured.  To check what locales are installed on your
 258     system, you can use the command <literal>locale -a</literal> if
 259     your operating system provides it.
 260    </para>
 261
 262    <para>
 263     Check that <productname>PostgreSQL</> is actually using the locale
 264     that you think it is.  The default <envar>LC_COLLATE</> and <envar>LC_CTYPE</>
 265     settings are determined at <command>initdb</> time and cannot be
 266     changed without repeating <command>initdb</>.  Other locale
 267     settings including <envar>LC_MESSAGES</> and <envar>LC_MONETARY</>
 268     are initially determined by the environment the server is started
 269     in, but can be changed on-the-fly.  You can check the active locale
 270     settings using the <command>SHOW</> command.
 271    </para>
 272
 273    <para>
 274     The directory <filename>src/test/locale</> in the source
 275     distribution contains a test suite for
 276     <productname>PostgreSQL</>'s locale support.
 277    </para>
 278
 279    <para>
 280     Client applications that handle server-side errors by parsing the
 281     text of the error message will obviously have problems when the
 282     server's messages are in a different language.  Authors of such
 283     applications are advised to make use of the error code scheme
 284     instead.
 285    </para>
 286
 287    <para>
 288     Maintaining catalogs of message translations requires the on-going
 289     efforts of many volunteers that want to see
 290     <productname>PostgreSQL</> speak their preferred language well.
 291     If messages in your language are currently not available or not fully
 292     translated, your assistance would be appreciated.  If you want to
 293     help, refer to <xref linkend="nls"> or write to the developers'
 294     mailing list.
 295    </para>
 296   </sect2>
 297  </sect1>
 298
 299
 300  <sect1 id="multibyte">
 301   <title>Character Set Support</title>
 302
 303   <indexterm zone="multibyte"><primary>character set</></>
 304
 305   <para>
 306    The character set support in <productname>PostgreSQL</productname>
 307    allows you to store text in a variety of character sets (also called
 308    encodings), including
 309    single-byte character sets such as the ISO 8859 series and
 310    multiple-byte character sets such as <acronym>EUC</> (Extended Unix
 311    Code), UTF-8, and Mule internal code.  All supported character sets
 312    can be used transparently by clients, but a few are not supported
 313    for use within the server (that is, as a server-side encoding).
 314    The default character set is selected while
 315    initializing your <productname>PostgreSQL</productname> database
 316    cluster using <command>initdb</>.  It can be overridden when you
 317    create a database, so you can have multiple
 318    databases each with a different character set.
 319   </para>
 320
 321   <para>
 322    An important restriction, however, is that each database's character set
 323    must be compatible with the database's <envar>LC_CTYPE</> setting.
 324    When <envar>LC_CTYPE</> is <literal>C</> or <literal>POSIX</>, any
 325    character set is allowed, but for other settings of <envar>LC_CTYPE</>
 326    there is only one character set that will work correctly.
 327   </para>
 328
 329    <sect2 id="multibyte-charset-supported">
 330     <title>Supported Character Sets</title>
 331
 332     <para>
 333      <xref linkend="charset-table"> shows the character sets available
 334      for use in <productname>PostgreSQL</productname>.
 335     </para>
 336
 337      <table id="charset-table">
 338       <title><productname>PostgreSQL</productname> Character Sets</title>
 339       <tgroup cols="6">
 340        <thead>
 341         <row>
 342          <entry>Name</entry>
 343          <entry>Description</entry>
 344          <entry>Language</entry>
 345          <entry>Server?</entry>
 346          <!--
 347           The Bytes/Char field is populated by looking at the values returned
 348           by pg_wchar_table.mblen function for each encoding.
 349          -->
 350          <entry>Bytes/Char</entry>
 351          <entry>Aliases</entry>
 352         </row>
 353        </thead>
 354        <tbody>
 355         <row>
 356          <entry><literal>BIG5</literal></entry>
 357          <entry>Big Five</entry>
 358          <entry>Traditional Chinese</entry>
 359          <entry>No</entry>
 360          <entry>1-2</entry>
 361          <entry><literal>WIN950</>, <literal>Windows950</></entry>
 362         </row>
 363         <row>
 364          <entry><literal>EUC_CN</literal></entry>
 365          <entry>Extended UNIX Code-CN</entry>
 366          <entry>Simplified Chinese</entry>
 367          <entry>Yes</entry>
 368          <entry>1-3</entry>
 369          <entry></entry>
 370         </row>
 371         <row>
 372          <entry><literal>EUC_JP</literal></entry>
 373          <entry>Extended UNIX Code-JP</entry>
 374          <entry>Japanese</entry>
 375          <entry>Yes</entry>
 376          <entry>1-3</entry>
 377          <entry></entry>
 378         </row>
 379         <row>
 380          <entry><literal>EUC_JIS_2004</literal></entry>
 381          <entry>Extended UNIX Code-JP, JIS X 0213</entry>
 382          <entry>Japanese</entry>
 383          <entry>Yes</entry>
 384          <entry>1-3</entry>
 385          <entry></entry>
 386         </row>
 387         <row>
 388          <entry><literal>EUC_KR</literal></entry>
 389          <entry>Extended UNIX Code-KR</entry>
 390          <entry>Korean</entry>
 391          <entry>Yes</entry>
 392          <entry>1-3</entry>
 393          <entry></entry>
 394         </row>
 395         <row>
 396          <entry><literal>EUC_TW</literal></entry>
 397          <entry>Extended UNIX Code-TW</entry>
 398          <entry>Traditional Chinese, Taiwanese</entry>
 399          <entry>Yes</entry>
 400          <entry>1-3</entry>
 401          <entry></entry>
 402         </row>
 403         <row>
 404          <entry><literal>GB18030</literal></entry>
 405          <entry>National Standard</entry>
 406          <entry>Chinese</entry>
 407          <entry>No</entry>
 408          <entry>1-2</entry>
 409          <entry></entry>
 410         </row>
 411         <row>
 412          <entry><literal>GBK</literal></entry>
 413          <entry>Extended National Standard</entry>
 414          <entry>Simplified Chinese</entry>
 415          <entry>No</entry>
 416          <entry>1-2</entry>
 417          <entry><literal>WIN936</>, <literal>Windows936</></entry>
 418         </row>
 419         <row>
 420          <entry><literal>ISO_8859_5</literal></entry>
 421          <entry>ISO 8859-5, <acronym>ECMA</> 113</entry>
 422          <entry>Latin/Cyrillic</entry>
 423          <entry>Yes</entry>
 424          <entry>1</entry>
 425          <entry></entry>
 426         </row>
 427         <row>
 428          <entry><literal>ISO_8859_6</literal></entry>
 429          <entry>ISO 8859-6, <acronym>ECMA</> 114</entry>
 430          <entry>Latin/Arabic</entry>
 431          <entry>Yes</entry>
 432          <entry>1</entry>
 433          <entry></entry>
 434         </row>
 435         <row>
 436          <entry><literal>ISO_8859_7</literal></entry>
 437          <entry>ISO 8859-7, <acronym>ECMA</> 118</entry>
 438          <entry>Latin/Greek</entry>
 439          <entry>Yes</entry>
 440          <entry>1</entry>
 441          <entry></entry>
 442         </row>
 443         <row>
 444          <entry><literal>ISO_8859_8</literal></entry>
 445          <entry>ISO 8859-8, <acronym>ECMA</> 121</entry>
 446          <entry>Latin/Hebrew</entry>
 447          <entry>Yes</entry>
 448          <entry>1</entry>
 449          <entry></entry>
 450         </row>
 451         <row>
 452          <entry><literal>JOHAB</literal></entry>
 453          <entry><acronym>JOHAB</></entry>
 454          <entry>Korean (Hangul)</entry>
 455          <entry>No</entry>
 456          <entry>1-3</entry>
 457          <entry></entry>
 458         </row>
 459         <row>
 460          <entry><literal>KOI8</literal></entry>
 461          <entry><acronym>KOI</acronym>8-R(U)</entry>
 462          <entry>Cyrillic</entry>
 463          <entry>Yes</entry>
 464          <entry>1</entry>
 465          <entry><literal>KOI8R</></entry>
 466         </row>
 467         <row>
 468          <entry><literal>LATIN1</literal></entry>
 469          <entry>ISO 8859-1, <acronym>ECMA</> 94</entry>
 470          <entry>Western European</entry>
 471          <entry>Yes</entry>
 472          <entry>1</entry>
 473          <entry><literal>ISO88591</></entry>
 474         </row>
 475         <row>
 476          <entry><literal>LATIN2</literal></entry>
 477          <entry>ISO 8859-2, <acronym>ECMA</> 94</entry>
 478          <entry>Central European</entry>
 479          <entry>Yes</entry>
 480          <entry>1</entry>
 481          <entry><literal>ISO88592</></entry>
 482         </row>
 483         <row>
 484          <entry><literal>LATIN3</literal></entry>
 485          <entry>ISO 8859-3, <acronym>ECMA</> 94</entry>
 486          <entry>South European</entry>
 487          <entry>Yes</entry>
 488          <entry>1</entry>
 489          <entry><literal>ISO88593</></entry>
 490         </row>
 491         <row>
 492          <entry><literal>LATIN4</literal></entry>
 493          <entry>ISO 8859-4, <acronym>ECMA</> 94</entry>
 494          <entry>North European</entry>
 495          <entry>Yes</entry>
 496          <entry>1</entry>
 497          <entry><literal>ISO88594</></entry>
 498         </row>
 499         <row>
 500          <entry><literal>LATIN5</literal></entry>
 501          <entry>ISO 8859-9, <acronym>ECMA</> 128</entry>
 502          <entry>Turkish</entry>
 503          <entry>Yes</entry>
 504          <entry>1</entry>
 505          <entry><literal>ISO88599</></entry>
 506         </row>
 507         <row>
 508          <entry><literal>LATIN6</literal></entry>
 509          <entry>ISO 8859-10, <acronym>ECMA</> 144</entry>
 510          <entry>Nordic</entry>
 511          <entry>Yes</entry>
 512          <entry>1</entry>
 513          <entry><literal>ISO885910</></entry>
 514         </row>
 515         <row>
 516          <entry><literal>LATIN7</literal></entry>
 517          <entry>ISO 8859-13</entry>
 518          <entry>Baltic</entry>
 519          <entry>Yes</entry>
 520          <entry>1</entry>
 521          <entry><literal>ISO885913</></entry>
 522         </row>
 523         <row>
 524          <entry><literal>LATIN8</literal></entry>
 525          <entry>ISO 8859-14</entry>
 526          <entry>Celtic</entry>
 527          <entry>Yes</entry>
 528          <entry>1</entry>
 529          <entry><literal>ISO885914</></entry>
 530         </row>
 531         <row>
 532          <entry><literal>LATIN9</literal></entry>
 533          <entry>ISO 8859-15</entry>
 534          <entry>LATIN1 with Euro and accents</entry>
 535          <entry>Yes</entry>
 536          <entry>1</entry>
 537          <entry>ISO885915</entry>
 538         </row>
 539         <row>
 540          <entry><literal>LATIN10</literal></entry>
 541          <entry>ISO 8859-16, <acronym>ASRO</> SR 14111</entry>
 542          <entry>Romanian</entry>
 543          <entry>Yes</entry>
 544          <entry>1</entry>
 545          <entry><literal>ISO885916</></entry>
 546         </row>
 547         <row>
 548          <entry><literal>MULE_INTERNAL</literal></entry>
 549          <entry>Mule internal code</entry>
 550          <entry>Multilingual Emacs</entry>
 551          <entry>Yes</entry>
 552          <entry>1-4</entry>
 553          <entry></entry>
 554         </row>
 555         <row>
 556          <entry><literal>SJIS</literal></entry>
 557          <entry>Shift JIS</entry>
 558          <entry>Japanese</entry>
 559          <entry>No</entry>
 560          <entry>1-2</entry>
 561          <entry><literal>Mskanji</>, <literal>ShiftJIS</>, <literal>WIN932</>, <literal>Windows932</></entry>
 562         </row>
 563         <row>
 564          <entry><literal>SHIFT_JIS_2004</literal></entry>
 565          <entry>Shift JIS, JIS X 0213</entry>
 566          <entry>Japanese</entry>
 567          <entry>No</entry>
 568          <entry>1-2</entry>
 569          <entry></entry>
 570         </row>
 571         <row>
 572          <entry><literal>SQL_ASCII</literal></entry>
 573          <entry>unspecified (see text)</entry>
 574          <entry><emphasis>any</></entry>
 575          <entry>Yes</entry>
 576          <entry>1</entry>
 577          <entry></entry>
 578         </row>
 579         <row>
 580          <entry><literal>UHC</literal></entry>
 581          <entry>Unified Hangul Code</entry>
 582          <entry>Korean</entry>
 583          <entry>No</entry>
 584          <entry>1-2</entry>
 585          <entry><literal>WIN949</>, <literal>Windows949</></entry>
 586         </row>
 587         <row>
 588          <entry><literal>UTF8</literal></entry>
 589          <entry>Unicode, 8-bit</entry>
 590          <entry><emphasis>all</></entry>
 591          <entry>Yes</entry>
 592          <entry>1-4</entry>
 593          <entry><literal>Unicode</></entry>
 594         </row>
 595         <row>
 596          <entry><literal>WIN866</literal></entry>
 597          <entry>Windows CP866</entry>
 598          <entry>Cyrillic</entry>
 599          <entry>Yes</entry>
 600          <entry>1</entry>
 601          <entry><literal>ALT</></entry>
 602         </row>
 603         <row>
 604          <entry><literal>WIN874</literal></entry>
 605          <entry>Windows CP874</entry>
 606          <entry>Thai</entry>
 607          <entry>Yes</entry>
 608          <entry>1</entry>
 609          <entry></entry>
 610         </row>
 611         <row>
 612          <entry><literal>WIN1250</literal></entry>
 613          <entry>Windows CP1250</entry>
 614          <entry>Central European</entry>
 615          <entry>Yes</entry>
 616          <entry>1</entry>
 617          <entry></entry>
 618         </row>
 619         <row>
 620          <entry><literal>WIN1251</literal></entry>
 621          <entry>Windows CP1251</entry>
 622          <entry>Cyrillic</entry>
 623          <entry>Yes</entry>
 624          <entry>1</entry>
 625          <entry><literal>WIN</></entry>
 626         </row>
 627         <row>
 628          <entry><literal>WIN1252</literal></entry>
 629          <entry>Windows CP1252</entry>
 630          <entry>Western European</entry>
 631          <entry>Yes</entry>
 632          <entry>1</entry>
 633          <entry></entry>
 634         </row>
 635         <row>
 636          <entry><literal>WIN1253</literal></entry>
 637          <entry>Windows CP1253</entry>
 638          <entry>Greek</entry>
 639          <entry>Yes</entry>
 640          <entry>1</entry>
 641          <entry></entry>
 642         </row>
 643         <row>
 644          <entry><literal>WIN1254</literal></entry>
 645          <entry>Windows CP1254</entry>
 646          <entry>Turkish</entry>
 647          <entry>Yes</entry>
 648          <entry>1</entry>
 649          <entry></entry>
 650         </row>
 651         <row>
 652          <entry><literal>WIN1255</literal></entry>
 653          <entry>Windows CP1255</entry>
 654          <entry>Hebrew</entry>
 655          <entry>Yes</entry>
 656          <entry>1</entry>
 657          <entry></entry>
 658         </row>
 659         <row>
 660          <entry><literal>WIN1256</literal></entry>
 661          <entry>Windows CP1256</entry>
 662          <entry>Arabic</entry>
 663          <entry>Yes</entry>
 664          <entry>1</entry>
 665          <entry></entry>
 666         </row>
 667         <row>
 668          <entry><literal>WIN1257</literal></entry>
 669          <entry>Windows CP1257</entry>
 670          <entry>Baltic</entry>
 671          <entry>Yes</entry>
 672          <entry>1</entry>
 673          <entry></entry>
 674         </row>
 675         <row>
 676          <entry><literal>WIN1258</literal></entry>
 677          <entry>Windows CP1258</entry>
 678          <entry>Vietnamese</entry>
 679          <entry>Yes</entry>
 680          <entry>1</entry>
 681          <entry><literal>ABC</>, <literal>TCVN</>, <literal>TCVN5712</>, <literal>VSCII</></entry>
 682         </row>
 683        </tbody>
 684       </tgroup>
 685      </table>
 686
 687      <para>
 688       Not all <acronym>API</>s support all the listed character sets. For example, the
 689       <productname>PostgreSQL</>
 690       JDBC driver does not support <literal>MULE_INTERNAL</>, <literal>LATIN6</>,
 691       <literal>LATIN8</>, and <literal>LATIN10</>.
 692      </para>
 693
 694      <para>
 695       The <literal>SQL_ASCII</> setting behaves considerably differently
 696       from the other settings.  When the server character set is
 697       <literal>SQL_ASCII</>, the server interprets byte values 0-127
 698       according to the ASCII standard, while byte values 128-255 are taken
 699       as uninterpreted characters.  No encoding conversion will be done when
 700       the setting is <literal>SQL_ASCII</>.  Thus, this setting is not so
 701       much a declaration that a specific encoding is in use, as a declaration
 702       of ignorance about the encoding.  In most cases, if you are
 703       working with any non-ASCII data, it is unwise to use the
 704       <literal>SQL_ASCII</> setting, because
 705       <productname>PostgreSQL</productname> will be unable to help you by
 706       converting or validating non-ASCII characters.
 707      </para>
 708     </sect2>
 709
 710    <sect2>
 711     <title>Setting the Character Set</title>
 712
 713     <para>
 714      <command>initdb</> defines the default character set
 715      for a <productname>PostgreSQL</productname> cluster. For example,
 716
 717 <screen>
 718 initdb -E EUC_JP
 719 </screen>
 720
 721      sets the default character set (encoding) to
 722      <literal>EUC_JP</literal> (Extended Unix Code for Japanese).  You
 723      can use <option>--encoding</option> instead of
 724      <option>-E</option> if you prefer to type longer option strings.
 725      If no <option>-E</> or <option>--encoding</option> option is
 726      given, <command>initdb</> attempts to determine the appropriate
 727      encoding to use based on the specified or default locale.
 728     </para>
 729
 730     <para>
 731      You can specify a non-default encoding at database creation time,
 732      provided that the encoding is compatible with the selected locale:
 733
 734 <screen>
 735 createdb -E EUC_KR -T template0 --lc-collate=ko_KR.euckr --lc-ctype=ko_KR.euckr korean
 736 </screen>
 737
 738      This will create a database named <literal>korean</literal> that
 739      uses the character set <literal>EUC_KR</literal>, and locale <literal>ko_KR</literal>.
 740      Another way to accomplish this is to use this SQL command:
 741
 742 <programlisting>
 743 CREATE DATABASE korean WITH ENCODING 'EUC_KR' COLLATE='ko_KR.euckr' CTYPE='ko_KR.euckr' TEMPLATE=template0;
 744 </programlisting>
 745
 746      The encoding for a database is stored in the system catalog
 747      <literal>pg_database</literal>.  You can see it by using the
 748      <option>-l</option> option or the <command>\l</command> command
 749      of <command>psql</command>.
 750
 751 <screen>
 752 $ <userinput>psql -l</userinput>
 753                                          List of databases
 754    Name    |  Owner   | Encoding  |  Collation  |    Ctype    |          Access Privileges
 755 -----------+----------+-----------+-------------+-------------+-------------------------------------
 756  clocaledb | hlinnaka | SQL_ASCII | C           | C           |
 757  englishdb | hlinnaka | UTF8      | en_GB.UTF8  | en_GB.UTF8  |
 758  japanese  | hlinnaka | UTF8      | ja_JP.UTF8  | ja_JP.UTF8  |
 759  korean    | hlinnaka | EUC_KR    | ko_KR.euckr | ko_KR.euckr |
 760  postgres  | hlinnaka | UTF8      | fi_FI.UTF8  | fi_FI.UTF8  |
 761  template0 | hlinnaka | UTF8      | fi_FI.UTF8  | fi_FI.UTF8  | {=c/hlinnaka,hlinnaka=CTc/hlinnaka}
 762  template1 | hlinnaka | UTF8      | fi_FI.UTF8  | fi_FI.UTF8  | {=c/hlinnaka,hlinnaka=CTc/hlinnaka}
 763 (7 rows)
 764 </screen>
 765     </para>
 766
 767     <important>
 768      <para>
 769       On most modern operating systems, <productname>PostgreSQL</productname>
 770       can determine which character set is implied by an <envar>LC_CTYPE</>
 771       setting, and it will enforce that only the correct database encoding is
 772       used.  On older systems it is your responsibility to ensure that you use
 773       the encoding expected by the locale you have selected.  A mistake in
 774       this area is likely to lead to strange misbehavior of locale-dependent
 775       operations such as sorting.
 776      </para>
 777
 778      <para>
 779       <productname>PostgreSQL</productname> will allow superusers to create
 780       databases with <literal>SQL_ASCII</> encoding even when
 781       <envar>LC_CTYPE</> is not <literal>C</> or <literal>POSIX</>.  As noted
 782       above, <literal>SQL_ASCII</> does not enforce that the data stored in
 783       the database has any particular encoding, and so this choice poses risks
 784       of locale-dependent misbehavior.  Using this combination of settings is
 785       deprecated and may someday be forbidden altogether.
 786      </para>
 787     </important>
 788    </sect2>
 789
 790    <sect2>
 791     <title>Automatic Character Set Conversion Between Server and Client</title>
 792
 793     <para>
 794      <productname>PostgreSQL</productname> supports automatic
 795      character set conversion between server and client for certain
 796      character set combinations. The conversion information is stored in the
 797      <literal>pg_conversion</> system catalog.  <productname>PostgreSQL</>
 798      comes with some predefined conversions, as shown in <xref
 799      linkend="multibyte-translation-table">. You can create a new
 800      conversion using the SQL command <command>CREATE CONVERSION</command>.
 801     </para>
 802
 803      <table id="multibyte-translation-table">
 804       <title>Client/Server Character Set Conversions</title>
 805       <tgroup cols="2">
 806        <thead>
 807         <row>
 808          <entry>Server Character Set</entry>
 809          <entry>Available Client Character Sets</entry>
 810         </row>
 811        </thead>
 812        <tbody>
 813         <row>
 814          <entry><literal>BIG5</literal></entry>
 815          <entry><emphasis>not supported as a server encoding</emphasis>
 816          </entry>
 817         </row>
 818         <row>
 819          <entry><literal>EUC_CN</literal></entry>
 820          <entry><emphasis>EUC_CN</emphasis>,
 821          <literal>MULE_INTERNAL</literal>,
 822          <literal>UTF8</literal>
 823          </entry>
 824         </row>
 825         <row>
 826          <entry><literal>EUC_JP</literal></entry>
 827          <entry><emphasis>EUC_JP</emphasis>,
 828          <literal>MULE_INTERNAL</literal>,
 829          <literal>SJIS</literal>,
 830          <literal>UTF8</literal>
 831          </entry>
 832         </row>
 833         <row>
 834          <entry><literal>EUC_KR</literal></entry>
 835          <entry><emphasis>EUC_KR</emphasis>,
 836          <literal>MULE_INTERNAL</literal>,
 837          <literal>UTF8</literal>
 838          </entry>
 839         </row>
 840         <row>
 841          <entry><literal>EUC_TW</literal></entry>
 842          <entry><emphasis>EUC_TW</emphasis>,
 843          <literal>BIG5</literal>,
 844          <literal>MULE_INTERNAL</literal>,
 845          <literal>UTF8</literal>
 846          </entry>
 847         </row>
 848         <row>
 849          <entry><literal>GB18030</literal></entry>
 850          <entry><emphasis>not supported as a server encoding</emphasis>
 851          </entry>
 852         </row>
 853         <row>
 854          <entry><literal>GBK</literal></entry>
 855          <entry><emphasis>not supported as a server encoding</emphasis>
 856          </entry>
 857         </row>
 858         <row>
 859          <entry><literal>ISO_8859_5</literal></entry>
 860          <entry><emphasis>ISO_8859_5</emphasis>,
 861          <literal>KOI8</literal>,
 862          <literal>MULE_INTERNAL</literal>,
 863          <literal>UTF8</literal>,
 864          <literal>WIN866</literal>,
 865          <literal>WIN1251</literal>
 866          </entry>
 867         </row>
 868         <row>
 869          <entry><literal>ISO_8859_6</literal></entry>
 870          <entry><emphasis>ISO_8859_6</emphasis>,
 871          <literal>UTF8</literal>
 872          </entry>
 873         </row>
 874         <row>
 875          <entry><literal>ISO_8859_7</literal></entry>
 876          <entry><emphasis>ISO_8859_7</emphasis>,
 877          <literal>UTF8</literal>
 878          </entry>
 879         </row>
 880         <row>
 881          <entry><literal>ISO_8859_8</literal></entry>
 882          <entry><emphasis>ISO_8859_8</emphasis>,
 883          <literal>UTF8</literal>
 884          </entry>
 885         </row>
 886         <row>
 887          <entry><literal>JOHAB</literal></entry>
 888          <entry><emphasis>JOHAB</emphasis>,
 889          <literal>UTF8</literal>
 890          </entry>
 891         </row>
 892         <row>
 893          <entry><literal>KOI8</literal></entry>
 894          <entry><emphasis>KOI8</emphasis>,
 895          <literal>ISO_8859_5</literal>,
 896          <literal>MULE_INTERNAL</literal>,
 897          <literal>UTF8</literal>,
 898          <literal>WIN866</literal>,
 899          <literal>WIN1251</literal>
 900          </entry>
 901         </row>
 902         <row>
 903          <entry><literal>LATIN1</literal></entry>
 904          <entry><emphasis>LATIN1</emphasis>,
 905          <literal>MULE_INTERNAL</literal>,
 906          <literal>UTF8</literal>
 907          </entry>
 908         </row>
 909         <row>
 910          <entry><literal>LATIN2</literal></entry>
 911          <entry><emphasis>LATIN2</emphasis>,
 912          <literal>MULE_INTERNAL</literal>,
 913          <literal>UTF8</literal>,
 914          <literal>WIN1250</literal>
 915          </entry>
 916         </row>
 917         <row>
 918          <entry><literal>LATIN3</literal></entry>
 919          <entry><emphasis>LATIN3</emphasis>,
 920          <literal>MULE_INTERNAL</literal>,
 921          <literal>UTF8</literal>
 922          </entry>
 923         </row>
 924         <row>
 925          <entry><literal>LATIN4</literal></entry>
 926          <entry><emphasis>LATIN4</emphasis>,
 927          <literal>MULE_INTERNAL</literal>,
 928          <literal>UTF8</literal>
 929          </entry>
 930         </row>
 931         <row>
 932          <entry><literal>LATIN5</literal></entry>
 933          <entry><emphasis>LATIN5</emphasis>,
 934          <literal>UTF8</literal>
 935          </entry>
 936         </row>
 937         <row>
 938          <entry><literal>LATIN6</literal></entry>
 939          <entry><emphasis>LATIN6</emphasis>,
 940          <literal>UTF8</literal>
 941          </entry>
 942         </row>
 943         <row>
 944          <entry><literal>LATIN7</literal></entry>
 945          <entry><emphasis>LATIN7</emphasis>,
 946          <literal>UTF8</literal>
 947          </entry>
 948         </row>
 949         <row>
 950          <entry><literal>LATIN8</literal></entry>
 951          <entry><emphasis>LATIN8</emphasis>,
 952          <literal>UTF8</literal>
 953          </entry>
 954         </row>
 955         <row>
 956          <entry><literal>LATIN9</literal></entry>
 957          <entry><emphasis>LATIN9</emphasis>,
 958          <literal>UTF8</literal>
 959          </entry>
 960         </row>
 961         <row>
 962          <entry><literal>LATIN10</literal></entry>
 963          <entry><emphasis>LATIN10</emphasis>,
 964          <literal>UTF8</literal>
 965          </entry>
 966         </row>
 967         <row>
 968          <entry><literal>MULE_INTERNAL</literal></entry>
 969          <entry><emphasis>MULE_INTERNAL</emphasis>,
 970           <literal>BIG5</literal>,
 971           <literal>EUC_CN</literal>,
 972           <literal>EUC_JP</literal>,
 973           <literal>EUC_KR</literal>,
 974           <literal>EUC_TW</literal>,
 975           <literal>ISO_8859_5</literal>,
 976           <literal>KOI8</literal>,
 977           <literal>LATIN1</literal> to <literal>LATIN4</literal>,
 978           <literal>SJIS</literal>,
 979           <literal>WIN866</literal>,
 980           <literal>WIN1250</literal>,
 981           <literal>WIN1251</literal>
 982          </entry>
 983         </row>
 984         <row>
 985          <entry><literal>SJIS</literal></entry>
 986          <entry><emphasis>not supported as a server encoding</emphasis>
 987          </entry>
 988         </row>
 989         <row>
 990          <entry><literal>SQL_ASCII</literal></entry>
 991          <entry><emphasis>any (no conversion will be performed)</emphasis>
 992          </entry>
 993         </row>
 994         <row>
 995          <entry><literal>UHC</literal></entry>
 996          <entry><emphasis>not supported as a server encoding</emphasis>
 997          </entry>
 998         </row>
 999         <row>
1000          <entry><literal>UTF8</literal></entry>
1001          <entry><emphasis>all supported encodings</emphasis>
1002          </entry>
1003         </row>
1004         <row>
1005          <entry><literal>WIN866</literal></entry>
1006          <entry><emphasis>WIN866</emphasis>,
1007           <literal>ISO_8859_5</literal>,
1008           <literal>KOI8</literal>,
1009           <literal>MULE_INTERNAL</literal>,
1010           <literal>UTF8</literal>,
1011           <literal>WIN1251</literal>
1012          </entry>
1013         </row>
1014         <row>
1015          <entry><literal>WIN874</literal></entry>
1016          <entry><emphasis>WIN874</emphasis>,
1017          <literal>UTF8</literal>
1018          </entry>
1019         </row>
1020         <row>
1021          <entry><literal>WIN1250</literal></entry>
1022          <entry><emphasis>WIN1250</emphasis>,
1023           <literal>LATIN2</literal>,
1024           <literal>MULE_INTERNAL</literal>,
1025           <literal>UTF8</literal>
1026          </entry>
1027         </row>
1028         <row>
1029          <entry><literal>WIN1251</literal></entry>
1030          <entry><emphasis>WIN1251</emphasis>,
1031           <literal>ISO_8859_5</literal>,
1032           <literal>KOI8</literal>,
1033           <literal>MULE_INTERNAL</literal>,
1034           <literal>UTF8</literal>,
1035           <literal>WIN866</literal>
1036          </entry>
1037         </row>
1038         <row>
1039          <entry><literal>WIN1252</literal></entry>
1040          <entry><emphasis>WIN1252</emphasis>,
1041           <literal>UTF8</literal>
1042          </entry>
1043         </row>
1044         <row>
1045          <entry><literal>WIN1253</literal></entry>
1046          <entry><emphasis>WIN1253</emphasis>,
1047           <literal>UTF8</literal>
1048          </entry>
1049         </row>
1050         <row>
1051          <entry><literal>WIN1254</literal></entry>
1052          <entry><emphasis>WIN1254</emphasis>,
1053           <literal>UTF8</literal>
1054          </entry>
1055         </row>
1056         <row>
1057          <entry><literal>WIN1255</literal></entry>
1058          <entry><emphasis>WIN1255</emphasis>,
1059           <literal>UTF8</literal>
1060          </entry>
1061         </row>
1062         <row>
1063          <entry><literal>WIN1256</literal></entry>
1064          <entry><emphasis>WIN1256</emphasis>,
1065          <literal>UTF8</literal>
1066          </entry>
1067         </row>
1068         <row>
1069          <entry><literal>WIN1257</literal></entry>
1070          <entry><emphasis>WIN1257</emphasis>,
1071           <literal>UTF8</literal>
1072          </entry>
1073         </row>
1074         <row>
1075          <entry><literal>WIN1258</literal></entry>
1076          <entry><emphasis>WIN1258</emphasis>,
1077          <literal>UTF8</literal>
1078          </entry>
1079         </row>
1080        </tbody>
1081       </tgroup>
1082      </table>
1083
1084     <para>
1085      To enable automatic character set conversion, you have to
1086      tell <productname>PostgreSQL</productname> the character set
1087      (encoding) you would like to use in the client. There are several
1088      ways to accomplish this:
1089
1090      <itemizedlist>
1091       <listitem>
1092        <para>
1093         Using the <command>\encoding</command> command in
1094         <application>psql</application>.
1095         <command>\encoding</command> allows you to change client
1096         encoding on the fly. For
1097         example, to change the encoding to <literal>SJIS</literal>, type:
1098
1099 <programlisting>
1100 \encoding SJIS
1101 </programlisting>
1102        </para>
1103       </listitem>
1104
1105       <listitem>
1106        <para>
1107         <application>libpq</> (<xref linkend="libpq-control">) has functions to control the client encoding.
1108        </para>
1109       </listitem>
1110
1111       <listitem>
1112        <para>
1113         Using <command>SET client_encoding TO</command>.
1114
1115         Setting the client encoding can be done with this SQL command:
1116
1117 <programlisting>
1118 SET CLIENT_ENCODING TO '<replaceable>value</>';
1119 </programlisting>
1120
1121         Also you can use the standard SQL syntax <literal>SET NAMES</literal>
1122         for this purpose:
1123
1124 <programlisting>
1125 SET NAMES '<replaceable>value</>';
1126 </programlisting>
1127
1128         To query the current client encoding:
1129
1130 <programlisting>
1131 SHOW client_encoding;
1132 </programlisting>
1133
1134         To return to the default encoding:
1135
1136 <programlisting>
1137 RESET client_encoding;
1138 </programlisting>
1139        </para>
1140       </listitem>
1141
1142       <listitem>
1143        <para>
1144         Using <envar>PGCLIENTENCODING</envar>. If the environment variable
1145         <envar>PGCLIENTENCODING</envar> is defined in the client's
1146         environment, that client encoding is automatically selected
1147         when a connection to the server is made.  (This can
1148         subsequently be overridden using any of the other methods
1149         mentioned above.)
1150        </para>
1151       </listitem>
1152
1153       <listitem>
1154       <para>
1155        Using the configuration variable <xref
1156        linkend="guc-client-encoding">. If the
1157        <varname>client_encoding</> variable is set, that client
1158        encoding is automatically selected when a connection to the
1159        server is made.  (This can subsequently be overridden using any
1160        of the other methods mentioned above.)
1161        </para>
1162       </listitem>
1163
1164      </itemizedlist>
1165     </para>
1166
1167     <para>
1168      If the conversion of a particular character is not possible
1169      &mdash; suppose you chose <literal>EUC_JP</literal> for the
1170      server and <literal>LATIN1</literal> for the client, then some
1171      Japanese characters do not have a representation in
1172      <literal>LATIN1</literal> &mdash; then an error is reported.
1173     </para>
1174
1175     <para>
1176      If the client character set is defined as <literal>SQL_ASCII</>,
1177      encoding conversion is disabled, regardless of the server's character
1178      set.  Just as for the server, use of <literal>SQL_ASCII</> is unwise
1179      unless you are working with all-ASCII data.
1180     </para>
1181    </sect2>
1182
1183    <sect2>
1184     <title>Further Reading</title>
1185
1186     <para>
1187      These are good sources to start learning about various kinds of encoding
1188      systems.
1189
1190      <variablelist>
1191       <varlistentry>
1192        <term><ulink url="http://www.i18ngurus.com/docs/984813247.html"></ulink></term>
1193
1194        <listitem>
1195         <para>
1196          An extensive collection of documents about character sets, encodings,
1197          and code pages.
1198         </para>
1199        </listitem>
1200       </varlistentry>
1201
1202       <varlistentry>
1203        <term><ulink url="ftp://ftp.ora.com/pub/examples/nutshell/ujip/doc/cjk.inf"></ulink></term>
1204
1205        <listitem>
1206         <para>
1207          Detailed explanations of <literal>EUC_JP</literal>,
1208          <literal>EUC_CN</literal>, <literal>EUC_KR</literal>,
1209          <literal>EUC_TW</literal> appear in section 3.2.
1210         </para>
1211        </listitem>
1212       </varlistentry>
1213
1214       <varlistentry>
1215        <term><ulink url="http://www.unicode.org/"></ulink></term>
1216
1217        <listitem>
1218         <para>
1219          The web site of the Unicode Consortium
1220         </para>
1221        </listitem>
1222       </varlistentry>
1223
1224       <varlistentry>
1225        <term>RFC 3629</term>
1226
1227        <listitem>
1228         <para>
1229          <acronym>UTF</acronym>-8 is defined here.
1230         </para>
1231        </listitem>
1232       </varlistentry>
1233      </variablelist>
1234     </para>
1235    </sect2>
1236
1237   </sect1>
1238
1239 </chapter>