manual/message.texi

   1 @node Message Translation, Searching and Sorting, Locales, Top
   2 @c %MENU% How to make the program speak the user's language
   3 @chapter Message Translation
   4
   5 The program's interface with the user should be designed to ease the user's
   6 task.  One way to ease the user's task is to use messages in whatever
   7 language the user prefers.
   8
   9 Printing messages in different languages can be implemented in different
  10 ways.  One could add all the different languages in the source code and
  11 choose among the variants every time a message has to be printed.  This is
  12 certainly not a good solution since extending the set of languages is
  13 cumbersome (the code must be changed) and the code itself can become
  14 really big with dozens of message sets.
  15
  16 A better solution is to keep the message sets for each language
  17 in separate files which are loaded at runtime depending on the language
  18 selection of the user.
  19
  20 @Theglibc{} provides two different sets of functions to support
  21 message translation.  The problem is that neither of the interfaces is
  22 officially defined by the POSIX standard.  The @code{catgets} family of
  23 functions is defined in the X/Open standard but this is derived from
  24 industry decisions and therefore not necessarily based on reasonable
  25 decisions.
  26
  27 As mentioned above, the message catalog handling provides easy
  28 extendability by using external data files which contain the message
  29 translations.  I.e., these files contain for each of the messages used
  30 in the program a translation for the appropriate language.  So the tasks
  31 of the message handling functions are
  32
  33 @itemize @bullet
  34 @item
  35 locate the external data file with the appropriate translations
  36 @item
  37 load the data and make it possible to address the messages
  38 @item
  39 map a given key to the translated message
  40 @end itemize
  41
  42 The two approaches mainly differ in the implementation of this last
  43 step.  Decisions made in the last step influence the rest of the design.
  44
  45 @menu
  46 * Message catalogs a la X/Open::  The @code{catgets} family of functions.
  47 * The Uniforum approach::         The @code{gettext} family of functions.
  48 @end menu
  49
  50
  51 @node Message catalogs a la X/Open
  52 @section X/Open Message Catalog Handling
  53
  54 The @code{catgets} functions are based on the simple scheme:
  55
  56 @quotation
  57 Associate every message to translate in the source code with a unique
  58 identifier.  To retrieve a message from a catalog file solely the
  59 identifier is used.
  60 @end quotation
  61
  62 This means for the author of the program that s/he will have to make
  63 sure the meaning of the identifier in the program code and in the
  64 message catalogs is always the same.
  65
  66 Before a message can be translated the catalog file must be located.
  67 The user of the program must be able to guide the responsible function
  68 to find whatever catalog the user wants.  This is separated from what
  69 the programmer had in mind.
  70
  71 All the types, constants and functions for the @code{catgets} functions
  72 are defined/declared in the @file{nl_types.h} header file.
  73
  74 @menu
  75 * The catgets Functions::      The @code{catgets} function family.
  76 * The message catalog files::  Format of the message catalog files.
  77 * The gencat program::         How to generate message catalogs files which
  78                                 can be used by the functions.
  79 * Common Usage::               How to use the @code{catgets} interface.
  80 @end menu
  81
  82
  83 @node The catgets Functions
  84 @subsection The @code{catgets} function family
  85
  86 @comment nl_types.h
  87 @comment X/Open
  88 @deftypefun nl_catd catopen (const char *@var{cat_name}, int @var{flag})
  89 @safety{@prelim{}@mtsafe{@mtsenv{}}@asunsafe{@ascuheap{}}@acunsafe{@acsmem{}}}
  90 @c catopen @mtsenv @ascuheap @acsmem
  91 @c  strchr ok
  92 @c  setlocale(,NULL) ok
  93 @c  getenv @mtsenv
  94 @c  strlen ok
  95 @c  alloca ok
  96 @c  stpcpy ok
  97 @c  malloc @ascuheap @acsmem
  98 @c  __open_catalog @ascuheap @acsmem
  99 @c   strchr ok
 100 @c   open_not_cancel_2 @acsfd
 101 @c   strlen ok
 102 @c   ENOUGH ok
 103 @c    alloca ok
 104 @c    memcpy ok
 105 @c   fxstat64 ok
 106 @c   __set_errno ok
 107 @c   mmap @acsmem
 108 @c   malloc dup @ascuheap @acsmem
 109 @c   read_not_cancel ok
 110 @c   free dup @ascuheap @acsmem
 111 @c   munmap ok
 112 @c   close_not_cancel_no_status ok
 113 @c  free @ascuheap @acsmem
 114 The @code{catopen} function tries to locate the message data file named
 115 @var{cat_name} and loads it when found.  The return value is of an
 116 opaque type and can be used in calls to the other functions to refer to
 117 this loaded catalog.
 118
 119 The return value is @code{(nl_catd) -1} in case the function failed and
 120 no catalog was loaded.  The global variable @var{errno} contains a code
 121 for the error causing the failure.  But even if the function call
 122 succeeded this does not mean that all messages can be translated.
 123
 124 Locating the catalog file must happen in a way which lets the user of
 125 the program influence the decision.  It is up to the user to decide
 126 about the language to use and sometimes it is useful to use alternate
 127 catalog files.  All this can be specified by the user by setting some
 128 environment variables.
 129
 130 The first problem is to find out where all the message catalogs are
 131 stored.  Every program could have its own place to keep all the
 132 different files but usually the catalog files are grouped by languages
 133 and the catalogs for all programs are kept in the same place.
 134
 135 @cindex NLSPATH environment variable
 136 To tell the @code{catopen} function where the catalog for the program
 137 can be found the user can set the environment variable @code{NLSPATH} to
 138 a value which describes her/his choice.  Since this value must be usable
 139 for different languages and locales it cannot be a simple string.
 140 Instead it is a format string (similar to @code{printf}'s).  An example
 141 is
 142
 143 @smallexample
 144 /usr/share/locale/%L/%N:/usr/share/locale/%L/LC_MESSAGES/%N
 145 @end smallexample
 146
 147 First one can see that more than one directory can be specified (with
 148 the usual syntax of separating them by colons).  The next things to
 149 observe are the format string, @code{%L} and @code{%N} in this case.
 150 The @code{catopen} function knows about several of them and the
 151 replacement for all of them is of course different.
 152
 153 @table @code
 154 @item %N
 155 This format element is substituted with the name of the catalog file.
 156 This is the value of the @var{cat_name} argument given to
 157 @code{catgets}.
 158
 159 @item %L
 160 This format element is substituted with the name of the currently
 161 selected locale for translating messages.  How this is determined is
 162 explained below.
 163
 164 @item %l
 165 (This is the lowercase ell.) This format element is substituted with the
 166 language element of the locale name.  The string describing the selected
 167 locale is expected to have the form
 168 @code{@var{lang}[_@var{terr}[.@var{codeset}]]} and this format uses the
 169 first part @var{lang}.
 170
 171 @item %t
 172 This format element is substituted by the territory part @var{terr} of
 173 the name of the currently selected locale.  See the explanation of the
 174 format above.
 175
 176 @item %c
 177 This format element is substituted by the codeset part @var{codeset} of
 178 the name of the currently selected locale.  See the explanation of the
 179 format above.
 180
 181 @item %%
 182 Since @code{%} is used as a meta character there must be a way to
 183 express the @code{%} character in the result itself.  Using @code{%%}
 184 does this just like it works for @code{printf}.
 185 @end table
 186
 187
 188 Using @code{NLSPATH} allows arbitrary directories to be searched for
 189 message catalogs while still allowing different languages to be used.
 190 If the @code{NLSPATH} environment variable is not set, the default value
 191 is
 192
 193 @smallexample
 194 @var{prefix}/share/locale/%L/%N:@var{prefix}/share/locale/%L/LC_MESSAGES/%N
 195 @end smallexample
 196
 197 @noindent
 198 where @var{prefix} is given to @code{configure} while installing @theglibc{}
 199 (this value is in many cases @code{/usr} or the empty string).
 200
 201 The remaining problem is to decide which must be used.  The value
 202 decides about the substitution of the format elements mentioned above.
 203 First of all the user can specify a path in the message catalog name
 204 (i.e., the name contains a slash character).  In this situation the
 205 @code{NLSPATH} environment variable is not used.  The catalog must exist
 206 as specified in the program, perhaps relative to the current working
 207 directory.  This situation in not desirable and catalogs names never
 208 should be written this way.  Beside this, this behavior is not portable
 209 to all other platforms providing the @code{catgets} interface.
 210
 211 @cindex LC_ALL environment variable
 212 @cindex LC_MESSAGES environment variable
 213 @cindex LANG environment variable
 214 Otherwise the values of environment variables from the standard
 215 environment are examined (@pxref{Standard Environment}).  Which
 216 variables are examined is decided by the @var{flag} parameter of
 217 @code{catopen}.  If the value is @code{NL_CAT_LOCALE} (which is defined
 218 in @file{nl_types.h}) then the @code{catopen} function uses the name of
 219 the locale currently selected for the @code{LC_MESSAGES} category.
 220
 221 If @var{flag} is zero the @code{LANG} environment variable is examined.
 222 This is a left-over from the early days when the concept of locales
 223 had not even reached the level of POSIX locales.
 224
 225 The environment variable and the locale name should have a value of the
 226 form @code{@var{lang}[_@var{terr}[.@var{codeset}]]} as explained above.
 227 If no environment variable is set the @code{"C"} locale is used which
 228 prevents any translation.
 229
 230 The return value of the function is in any case a valid string.  Either
 231 it is a translation from a message catalog or it is the same as the
 232 @var{string} parameter.  So a piece of code to decide whether a
 233 translation actually happened must look like this:
 234
 235 @smallexample
 236 @{
 237   char *trans = catgets (desc, set, msg, input_string);
 238   if (trans == input_string)
 239     @{
 240       /* Something went wrong.  */
 241     @}
 242 @}
 243 @end smallexample
 244
 245 @noindent
 246 When an error occurs the global variable @var{errno} is set to
 247
 248 @table @var
 249 @item EBADF
 250 The catalog does not exist.
 251 @item ENOMSG
 252 The set/message tuple does not name an existing element in the
 253 message catalog.
 254 @end table
 255
 256 While it sometimes can be useful to test for errors programs normally
 257 will avoid any test.  If the translation is not available it is no big
 258 problem if the original, untranslated message is printed.  Either the
 259 user understands this as well or s/he will look for the reason why the
 260 messages are not translated.
 261 @end deftypefun
 262
 263 Please note that the currently selected locale does not depend on a call
 264 to the @code{setlocale} function.  It is not necessary that the locale
 265 data files for this locale exist and calling @code{setlocale} succeeds.
 266 The @code{catopen} function directly reads the values of the environment
 267 variables.
 268
 269
 270 @deftypefun {char *} catgets (nl_catd @var{catalog_desc}, int @var{set}, int @var{message}, const char *@var{string})
 271 @safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}}
 272 The function @code{catgets} has to be used to access the message catalog
 273 previously opened using the @code{catopen} function.  The
 274 @var{catalog_desc} parameter must be a value previously returned by
 275 @code{catopen}.
 276
 277 The next two parameters, @var{set} and @var{message}, reflect the
 278 internal organization of the message catalog files.  This will be
 279 explained in detail below.  For now it is interesting to know that a
 280 catalog can consist of several sets and the messages in each thread are
 281 individually numbered using numbers.  Neither the set number nor the
 282 message number must be consecutive.  They can be arbitrarily chosen.
 283 But each message (unless equal to another one) must have its own unique
 284 pair of set and message numbers.
 285
 286 Since it is not guaranteed that the message catalog for the language
 287 selected by the user exists the last parameter @var{string} helps to
 288 handle this case gracefully.  If no matching string can be found
 289 @var{string} is returned.  This means for the programmer that
 290
 291 @itemize @bullet
 292 @item
 293 the @var{string} parameters should contain reasonable text (this also
 294 helps to understand the program seems otherwise there would be no hint
 295 on the string which is expected to be returned.
 296 @item
 297 all @var{string} arguments should be written in the same language.
 298 @end itemize
 299 @end deftypefun
 300
 301 It is somewhat uncomfortable to write a program using the @code{catgets}
 302 functions if no supporting functionality is available.  Since each
 303 set/message number tuple must be unique the programmer must keep lists
 304 of the messages at the same time the code is written.  And the work
 305 between several people working on the same project must be coordinated.
 306 We will see how some of these problems can be relaxed a bit (@pxref{Common
 307 Usage}).
 308
 309 @deftypefun int catclose (nl_catd @var{catalog_desc})
 310 @safety{@prelim{}@mtsafe{}@asunsafe{@ascuheap{}}@acunsafe{@acucorrupt{} @acsmem{}}}
 311 @c catclose @ascuheap @acucorrupt @acsmem
 312 @c  __set_errno ok
 313 @c  munmap ok
 314 @c  free @ascuheap @acsmem
 315 The @code{catclose} function can be used to free the resources
 316 associated with a message catalog which previously was opened by a call
 317 to @code{catopen}.  If the resources can be successfully freed the
 318 function returns @code{0}.  Otherwise it returns @code{@minus{}1} and the
 319 global variable @var{errno} is set.  Errors can occur if the catalog
 320 descriptor @var{catalog_desc} is not valid in which case @var{errno} is
 321 set to @code{EBADF}.
 322 @end deftypefun
 323
 324
 325 @node The message catalog files
 326 @subsection  Format of the message catalog files
 327
 328 The only reasonable way to translate all the messages of a function and
 329 store the result in a message catalog file which can be read by the
 330 @code{catopen} function is to write all the message text to the
 331 translator and let her/him translate them all.  I.e., we must have a
 332 file with entries which associate the set/message tuple with a specific
 333 translation.  This file format is specified in the X/Open standard and
 334 is as follows:
 335
 336 @itemize @bullet
 337 @item
 338 Lines containing only whitespace characters or empty lines are ignored.
 339
 340 @item
 341 Lines which contain as the first non-whitespace character a @code{$}
 342 followed by a whitespace character are comment and are also ignored.
 343
 344 @item
 345 If a line contains as the first non-whitespace characters the sequence
 346 @code{$set} followed by a whitespace character an additional argument
 347 is required to follow.  This argument can either be:
 348
 349 @itemize @minus
 350 @item
 351 a number.  In this case the value of this number determines the set
 352 to which the following messages are added.
 353
 354 @item
 355 an identifier consisting of alphanumeric characters plus the underscore
 356 character.  In this case the set get automatically a number assigned.
 357 This value is one added to the largest set number which so far appeared.
 358
 359 How to use the symbolic names is explained in section @ref{Common Usage}.
 360
 361 It is an error if a symbol name appears more than once.  All following
 362 messages are placed in a set with this number.
 363 @end itemize
 364
 365 @item
 366 If a line contains as the first non-whitespace characters the sequence
 367 @code{$delset} followed by a whitespace character an additional argument
 368 is required to follow.  This argument can either be:
 369
 370 @itemize @minus
 371 @item
 372 a number.  In this case the value of this number determines the set
 373 which will be deleted.
 374
 375 @item
 376 an identifier consisting of alphanumeric characters plus the underscore
 377 character.  This symbolic identifier must match a name for a set which
 378 previously was defined.  It is an error if the name is unknown.
 379 @end itemize
 380
 381 In both cases all messages in the specified set will be removed.  They
 382 will not appear in the output.  But if this set is later again selected
 383 with a @code{$set} command again messages could be added and these
 384 messages will appear in the output.
 385
 386 @item
 387 If a line contains after leading whitespaces the sequence
 388 @code{$quote}, the quoting character used for this input file is
 389 changed to the first non-whitespace character following
 390 @code{$quote}.  If no non-whitespace character is present before the
 391 line ends quoting is disabled.
 392
 393 By default no quoting character is used.  In this mode strings are
 394 terminated with the first unescaped line break.  If there is a
 395 @code{$quote} sequence present newline need not be escaped.  Instead a
 396 string is terminated with the first unescaped appearance of the quote
 397 character.
 398
 399 A common usage of this feature would be to set the quote character to
 400 @code{"}.  Then any appearance of the @code{"} in the strings must
 401 be escaped using the backslash (i.e., @code{\"} must be written).
 402
 403 @item
 404 Any other line must start with a number or an alphanumeric identifier
 405 (with the underscore character included).  The following characters
 406 (starting after the first whitespace character) will form the string
 407 which gets associated with the currently selected set and the message
 408 number represented by the number and identifier respectively.
 409
 410 If the start of the line is a number the message number is obvious.  It
 411 is an error if the same message number already appeared for this set.
 412
 413 If the leading token was an identifier the message number gets
 414 automatically assigned.  The value is the current maximum message
 415 number for this set plus one.  It is an error if the identifier was
 416 already used for a message in this set.  It is OK to reuse the
 417 identifier for a message in another thread.  How to use the symbolic
 418 identifiers will be explained below (@pxref{Common Usage}).  There is
 419 one limitation with the identifier: it must not be @code{Set}.  The
 420 reason will be explained below.
 421
 422 The text of the messages can contain escape characters.  The usual bunch
 423 of characters known from the @w{ISO C} language are recognized
 424 (@code{\n}, @code{\t}, @code{\v}, @code{\b}, @code{\r}, @code{\f},
 425 @code{\\}, and @code{\@var{nnn}}, where @var{nnn} is the octal coding of
 426 a character code).
 427 @end itemize
 428
 429 @strong{Important:} The handling of identifiers instead of numbers for
 430 the set and messages is a GNU extension.  Systems strictly following the
 431 X/Open specification do not have this feature.  An example for a message
 432 catalog file is this:
 433
 434 @smallexample
 435 $ This is a leading comment.
 436 $quote "
 437
 438 $set SetOne
 439 1 Message with ID 1.
 440 two "   Message with ID \"two\", which gets the value 2 assigned"
 441
 442 $set SetTwo
 443 $ Since the last set got the number 1 assigned this set has number 2.
 444 4000 "The numbers can be arbitrary, they need not start at one."
 445 @end smallexample
 446
 447 This small example shows various aspects:
 448 @itemize @bullet
 449 @item
 450 Lines 1 and 9 are comments since they start with @code{$} followed by
 451 a whitespace.
 452 @item
 453 The quoting character is set to @code{"}.  Otherwise the quotes in the
 454 message definition would have to be omitted and in this case the
 455 message with the identifier @code{two} would lose its leading whitespace.
 456 @item
 457 Mixing numbered messages with messages having symbolic names is no
 458 problem and the numbering happens automatically.
 459 @end itemize
 460
 461
 462 While this file format is pretty easy it is not the best possible for
 463 use in a running program.  The @code{catopen} function would have to
 464 parse the file and handle syntactic errors gracefully.  This is not so
 465 easy and the whole process is pretty slow.  Therefore the @code{catgets}
 466 functions expect the data in another more compact and ready-to-use file
 467 format.  There is a special program @code{gencat} which is explained in
 468 detail in the next section.
 469
 470 Files in this other format are not human readable.  To be easy to use by
 471 programs it is a binary file.  But the format is byte order independent
 472 so translation files can be shared by systems of arbitrary architecture
 473 (as long as they use @theglibc{}).
 474
 475 Details about the binary file format are not important to know since
 476 these files are always created by the @code{gencat} program.  The
 477 sources of @theglibc{} also provide the sources for the
 478 @code{gencat} program and so the interested reader can look through
 479 these source files to learn about the file format.
 480
 481
 482 @node The gencat program
 483 @subsection Generate Message Catalogs files
 484
 485 @cindex gencat
 486 The @code{gencat} program is specified in the X/Open standard and the
 487 GNU implementation follows this specification and so processes
 488 all correctly formed input files.  Additionally some extension are
 489 implemented which help to work in a more reasonable way with the
 490 @code{catgets} functions.
 491
 492 The @code{gencat} program can be invoked in two ways:
 493
 494 @example
 495 `gencat [@var{Option} @dots{}] [@var{Output-File} [@var{Input-File} @dots{}]]`
 496 @end example
 497
 498 This is the interface defined in the X/Open standard.  If no
 499 @var{Input-File} parameter is given, input will be read from standard
 500 input.  Multiple input files will be read as if they were concatenated.
 501 If @var{Output-File} is also missing, the output will be written to
 502 standard output.  To provide the interface one is used to from other
 503 programs a second interface is provided.
 504
 505 @smallexample
 506 `gencat [@var{Option} @dots{}] -o @var{Output-File} [@var{Input-File} @dots{}]`
 507 @end smallexample
 508
 509 The option @samp{-o} is used to specify the output file and all file
 510 arguments are used as input files.
 511
 512 Beside this one can use @file{-} or @file{/dev/stdin} for
 513 @var{Input-File} to denote the standard input.  Corresponding one can
 514 use @file{-} and @file{/dev/stdout} for @var{Output-File} to denote
 515 standard output.  Using @file{-} as a file name is allowed in X/Open
 516 while using the device names is a GNU extension.
 517
 518 The @code{gencat} program works by concatenating all input files and
 519 then @strong{merging} the resulting collection of message sets with a
 520 possibly existing output file.  This is done by removing all messages
 521 with set/message number tuples matching any of the generated messages
 522 from the output file and then adding all the new messages.  To
 523 regenerate a catalog file while ignoring the old contents therefore
 524 requires removing the output file if it exists.  If the output is
 525 written to standard output no merging takes place.
 526
 527 @noindent
 528 The following table shows the options understood by the @code{gencat}
 529 program.  The X/Open standard does not specify any options for the
 530 program so all of these are GNU extensions.
 531
 532 @table @samp
 533 @item -V
 534 @itemx --version
 535 Print the version information and exit.
 536 @item -h
 537 @itemx --help
 538 Print a usage message listing all available options, then exit successfully.
 539 @item --new
 540 Do not merge the new messages from the input files with the old content
 541 of the output file.  The old content of the output file is discarded.
 542 @item -H
 543 @itemx --header=name
 544 This option is used to emit the symbolic names given to sets and
 545 messages in the input files for use in the program.  Details about how
 546 to use this are given in the next section.  The @var{name} parameter to
 547 this option specifies the name of the output file.  It will contain a
 548 number of C preprocessor @code{#define}s to associate a name with a
 549 number.
 550
 551 Please note that the generated file only contains the symbols from the
 552 input files.  If the output is merged with the previous content of the
 553 output file the possibly existing symbols from the file(s) which
 554 generated the old output files are not in the generated header file.
 555 @end table
 556
 557
 558 @node Common Usage
 559 @subsection How to use the @code{catgets} interface
 560
 561 The @code{catgets} functions can be used in two different ways.  By
 562 following slavishly the X/Open specs and not relying on the extension
 563 and by using the GNU extensions.  We will take a look at the former
 564 method first to understand the benefits of extensions.
 565
 566 @subsubsection Not using symbolic names
 567
 568 Since the X/Open format of the message catalog files does not allow
 569 symbol names we have to work with numbers all the time.  When we start
 570 writing a program we have to replace all appearances of translatable
 571 strings with something like
 572
 573 @smallexample
 574 catgets (catdesc, set, msg, "string")
 575 @end smallexample
 576
 577 @noindent
 578 @var{catgets} is retrieved from a call to @code{catopen} which is
 579 normally done once at the program start.  The @code{"string"} is the
 580 string we want to translate.  The problems start with the set and
 581 message numbers.
 582
 583 In a bigger program several programmers usually work at the same time on
 584 the program and so coordinating the number allocation is crucial.
 585 Though no two different strings must be indexed by the same tuple of
 586 numbers it is highly desirable to reuse the numbers for equal strings
 587 with equal translations (please note that there might be strings which
 588 are equal in one language but have different translations due to
 589 difference contexts).
 590
 591 The allocation process can be relaxed a bit by different set numbers for
 592 different parts of the program.  So the number of developers who have to
 593 coordinate the allocation can be reduced.  But still lists must be keep
 594 track of the allocation and errors can easily happen.  These errors
 595 cannot be discovered by the compiler or the @code{catgets} functions.
 596 Only the user of the program might see wrong messages printed.  In the
 597 worst cases the messages are so irritating that they cannot be
 598 recognized as wrong.  Think about the translations for @code{"true"} and
 599 @code{"false"} being exchanged.  This could result in a disaster.
 600
 601
 602 @subsubsection Using symbolic names
 603
 604 The problems mentioned in the last section derive from the fact that:
 605
 606 @enumerate
 607 @item
 608 the numbers are allocated once and due to the possibly frequent use of
 609 them it is difficult to change a number later.
 610 @item
 611 the numbers do not allow guessing anything about the string and
 612 therefore collisions can easily happen.
 613 @end enumerate
 614
 615 By constantly using symbolic names and by providing a method which maps
 616 the string content to a symbolic name (however this will happen) one can
 617 prevent both problems above.  The cost of this is that the programmer
 618 has to write a complete message catalog file while s/he is writing the
 619 program itself.
 620
 621 This is necessary since the symbolic names must be mapped to numbers
 622 before the program sources can be compiled.  In the last section it was
 623 described how to generate a header containing the mapping of the names.
 624 E.g., for the example message file given in the last section we could
 625 call the @code{gencat} program as follows (assume @file{ex.msg} contains
 626 the sources).
 627
 628 @smallexample
 629 gencat -H ex.h -o ex.cat ex.msg
 630 @end smallexample
 631
 632 @noindent
 633 This generates a header file with the following content:
 634
 635 @smallexample
 636 #define SetTwoSet 0x2   /* ex.msg:8 */
 637
 638 #define SetOneSet 0x1   /* ex.msg:4 */
 639 #define SetOnetwo 0x2   /* ex.msg:6 */
 640 @end smallexample
 641
 642 As can be seen the various symbols given in the source file are mangled
 643 to generate unique identifiers and these identifiers get numbers
 644 assigned.  Reading the source file and knowing about the rules will
 645 allow to predict the content of the header file (it is deterministic)
 646 but this is not necessary.  The @code{gencat} program can take care for
 647 everything.  All the programmer has to do is to put the generated header
 648 file in the dependency list of the source files of her/his project and
 649 add a rule to regenerate the header if any of the input files change.
 650
 651 One word about the symbol mangling.  Every symbol consists of two parts:
 652 the name of the message set plus the name of the message or the special
 653 string @code{Set}.  So @code{SetOnetwo} means this macro can be used to
 654 access the translation with identifier @code{two} in the message set
 655 @code{SetOne}.
 656
 657 The other names denote the names of the message sets.  The special
 658 string @code{Set} is used in the place of the message identifier.
 659
 660 If in the code the second string of the set @code{SetOne} is used the C
 661 code should look like this:
 662
 663 @smallexample
 664 catgets (catdesc, SetOneSet, SetOnetwo,
 665          "   Message with ID \"two\", which gets the value 2 assigned")
 666 @end smallexample
 667
 668 Writing the function this way will allow to change the message number
 669 and even the set number without requiring any change in the C source
 670 code.  (The text of the string is normally not the same; this is only
 671 for this example.)
 672
 673
 674 @subsubsection How does to this allow to develop
 675
 676 To illustrate the usual way to work with the symbolic version numbers
 677 here is a little example.  Assume we want to write the very complex and
 678 famous greeting program.  We start by writing the code as usual:
 679
 680 @smallexample
 681 #include <stdio.h>
 682 int
 683 main (void)
 684 @{
 685   printf ("Hello, world!\n");
 686   return 0;
 687 @}
 688 @end smallexample
 689
 690 Now we want to internationalize the message and therefore replace the
 691 message with whatever the user wants.
 692
 693 @smallexample
 694 #include <nl_types.h>
 695 #include <stdio.h>
 696 #include "msgnrs.h"
 697 int
 698 main (void)
 699 @{
 700   nl_catd catdesc = catopen ("hello.cat", NL_CAT_LOCALE);
 701   printf (catgets (catdesc, SetMainSet, SetMainHello,
 702                    "Hello, world!\n"));
 703   catclose (catdesc);
 704   return 0;
 705 @}
 706 @end smallexample
 707
 708 We see how the catalog object is opened and the returned descriptor used
 709 in the other function calls.  It is not really necessary to check for
 710 failure of any of the functions since even in these situations the
 711 functions will behave reasonable.  They simply will be return a
 712 translation.
 713
 714 What remains unspecified here are the constants @code{SetMainSet} and
 715 @code{SetMainHello}.  These are the symbolic names describing the
 716 message.  To get the actual definitions which match the information in
 717 the catalog file we have to create the message catalog source file and
 718 process it using the @code{gencat} program.
 719
 720 @smallexample
 721 $ Messages for the famous greeting program.
 722 $quote "
 723
 724 $set Main
 725 Hello "Hallo, Welt!\n"
 726 @end smallexample
 727
 728 Now we can start building the program (assume the message catalog source
 729 file is named @file{hello.msg} and the program source file @file{hello.c}):
 730
 731 @smallexample
 732 % gencat -H msgnrs.h -o hello.cat hello.msg
 733 % cat msgnrs.h
 734 #define MainSet 0x1     /* hello.msg:4 */
 735 #define MainHello 0x1   /* hello.msg:5 */
 736 % gcc -o hello hello.c -I.
 737 % cp hello.cat /usr/share/locale/de/LC_MESSAGES
 738 % echo $LC_ALL
 739 de
 740 % ./hello
 741 Hallo, Welt!
 742 %
 743 @end smallexample
 744
 745 The call of the @code{gencat} program creates the missing header file
 746 @file{msgnrs.h} as well as the message catalog binary.  The former is
 747 used in the compilation of @file{hello.c} while the later is placed in a
 748 directory in which the @code{catopen} function will try to locate it.
 749 Please check the @code{LC_ALL} environment variable and the default path
 750 for @code{catopen} presented in the description above.
 751
 752
 753 @node The Uniforum approach
 754 @section The Uniforum approach to Message Translation
 755
 756 Sun Microsystems tried to standardize a different approach to message
 757 translation in the Uniforum group.  There never was a real standard
 758 defined but still the interface was used in Sun's operating systems.
 759 Since this approach fits better in the development process of free
 760 software it is also used throughout the GNU project and the GNU
 761 @file{gettext} package provides support for this outside @theglibc{}.
 762
 763 The code of the @file{libintl} from GNU @file{gettext} is the same as
 764 the code in @theglibc{}.  So the documentation in the GNU
 765 @file{gettext} manual is also valid for the functionality here.  The
 766 following text will describe the library functions in detail.  But the
 767 numerous helper programs are not described in this manual.  Instead
 768 people should read the GNU @file{gettext} manual
 769 (@pxref{Top,,GNU gettext utilities,gettext,Native Language Support Library and Tools}).
 770 We will only give a short overview.
 771
 772 Though the @code{catgets} functions are available by default on more
 773 systems the @code{gettext} interface is at least as portable as the
 774 former.  The GNU @file{gettext} package can be used wherever the
 775 functions are not available.
 776
 777
 778 @menu
 779 * Message catalogs with gettext::  The @code{gettext} family of functions.
 780 * Helper programs for gettext::    Programs to handle message catalogs
 781                                     for @code{gettext}.
 782 @end menu
 783
 784
 785 @node Message catalogs with gettext
 786 @subsection The @code{gettext} family of functions
 787
 788 The paradigms underlying the @code{gettext} approach to message
 789 translations is different from that of the @code{catgets} functions the
 790 basic functionally is equivalent.  There are functions of the following
 791 categories:
 792
 793 @menu
 794 * Translation with gettext::       What has to be done to translate a message.
 795 * Locating gettext catalog::       How to determine which catalog to be used.
 796 * Advanced gettext functions::     Additional functions for more complicated
 797                                     situations.
 798 * Charset conversion in gettext::  How to specify the output character set
 799                                     @code{gettext} uses.
 800 * GUI program problems::           How to use @code{gettext} in GUI programs.
 801 * Using gettextized software::     The possibilities of the user to influence
 802                                     the way @code{gettext} works.
 803 @end menu
 804
 805 @node Translation with gettext
 806 @subsubsection What has to be done to translate a message?
 807
 808 The @code{gettext} functions have a very simple interface.  The most
 809 basic function just takes the string which shall be translated as the
 810 argument and it returns the translation.  This is fundamentally
 811 different from the @code{catgets} approach where an extra key is
 812 necessary and the original string is only used for the error case.
 813
 814 If the string which has to be translated is the only argument this of
 815 course means the string itself is the key.  I.e., the translation will
 816 be selected based on the original string.  The message catalogs must
 817 therefore contain the original strings plus one translation for any such
 818 string.  The task of the @code{gettext} function is to compare the
 819 argument string with the available strings in the catalog and return the
 820 appropriate translation.  Of course this process is optimized so that
 821 this process is not more expensive than an access using an atomic key
 822 like in @code{catgets}.
 823
 824 The @code{gettext} approach has some advantages but also some
 825 disadvantages.  Please see the GNU @file{gettext} manual for a detailed
 826 discussion of the pros and cons.
 827
 828 All the definitions and declarations for @code{gettext} can be found in
 829 the @file{libintl.h} header file.  On systems where these functions are
 830 not part of the C library they can be found in a separate library named
 831 @file{libintl.a} (or accordingly different for shared libraries).
 832
 833 @comment libintl.h
 834 @comment GNU
 835 @deftypefun {char *} gettext (const char *@var{msgid})
 836 @safety{@prelim{}@mtsafe{@mtsenv{}}@asunsafe{@asucorrupt{} @ascuheap{} @asulock{} @ascudlopen{}}@acunsafe{@acucorrupt{} @aculock{} @acsfd{} @acsmem{}}}
 837 @c Wrapper for dcgettext.
 838 The @code{gettext} function searches the currently selected message
 839 catalogs for a string which is equal to @var{msgid}.  If there is such a
 840 string available it is returned.  Otherwise the argument string
 841 @var{msgid} is returned.
 842
 843 Please note that although the return value is @code{char *} the
 844 returned string must not be changed.  This broken type results from the
 845 history of the function and does not reflect the way the function should
 846 be used.
 847
 848 Please note that above we wrote ``message catalogs'' (plural).  This is
 849 a specialty of the GNU implementation of these functions and we will
 850 say more about this when we talk about the ways message catalogs are
 851 selected (@pxref{Locating gettext catalog}).
 852
 853 The @code{gettext} function does not modify the value of the global
 854 @var{errno} variable.  This is necessary to make it possible to write
 855 something like
 856
 857 @smallexample
 858   printf (gettext ("Operation failed: %m\n"));
 859 @end smallexample
 860
 861 Here the @var{errno} value is used in the @code{printf} function while
 862 processing the @code{%m} format element and if the @code{gettext}
 863 function would change this value (it is called before @code{printf} is
 864 called) we would get a wrong message.
 865
 866 So there is no easy way to detect a missing message catalog besides
 867 comparing the argument string with the result.  But it is normally the
 868 task of the user to react on missing catalogs.  The program cannot guess
 869 when a message catalog is really necessary since for a user who speaks
 870 the language the program was developed in, the message does not need any translation.
 871 @end deftypefun
 872
 873 The remaining two functions to access the message catalog add some
 874 functionality to select a message catalog which is not the default one.
 875 This is important if parts of the program are developed independently.
 876 Every part can have its own message catalog and all of them can be used
 877 at the same time.  The C library itself is an example: internally it
 878 uses the @code{gettext} functions but since it must not depend on a
 879 currently selected default message catalog it must specify all ambiguous
 880 information.
 881
 882 @comment libintl.h
 883 @comment GNU
 884 @deftypefun {char *} dgettext (const char *@var{domainname}, const char *@var{msgid})
 885 @safety{@prelim{}@mtsafe{@mtsenv{}}@asunsafe{@asucorrupt{} @ascuheap{} @asulock{} @ascudlopen{}}@acunsafe{@acucorrupt{} @aculock{} @acsfd{} @acsmem{}}}
 886 @c Wrapper for dcgettext.
 887 The @code{dgettext} function acts just like the @code{gettext}
 888 function.  It only takes an additional first argument @var{domainname}
 889 which guides the selection of the message catalogs which are searched
 890 for the translation.  If the @var{domainname} parameter is the null
 891 pointer the @code{dgettext} function is exactly equivalent to
 892 @code{gettext} since the default value for the domain name is used.
 893
 894 As for @code{gettext} the return value type is @code{char *} which is an
 895 anachronism.  The returned string must never be modified.
 896 @end deftypefun
 897
 898 @comment libintl.h
 899 @comment GNU
 900 @deftypefun {char *} dcgettext (const char *@var{domainname}, const char *@var{msgid}, int @var{category})
 901 @safety{@prelim{}@mtsafe{@mtsenv{}}@asunsafe{@asucorrupt{} @ascuheap{} @asulock{} @ascudlopen{}}@acunsafe{@acucorrupt{} @aculock{} @acsfd{} @acsmem{}}}
 902 @c dcgettext @mtsenv @asucorrupt @ascuheap @asulock @ascudlopen @acucorrupt @aculock @acsfd @acsmem
 903 @c  dcigettext @mtsenv @asucorrupt @ascuheap @asulock @ascudlopen @acucorrupt @aculock @acsfd @acsmem
 904 @c   libc_rwlock_rdlock @asulock @aculock
 905 @c   current_locale_name ok [protected from @mtslocale]
 906 @c   tfind ok
 907 @c   libc_rwlock_unlock ok
 908 @c   plural_lookup ok
 909 @c    plural_eval ok
 910 @c    rawmemchr ok
 911 @c   DETERMINE_SECURE ok, nothing
 912 @c   strcmp ok
 913 @c   strlen ok
 914 @c   getcwd @ascuheap @acsmem @acsfd
 915 @c   strchr ok
 916 @c   stpcpy ok
 917 @c   category_to_name ok
 918 @c   guess_category_value @mtsenv
 919 @c    getenv @mtsenv
 920 @c    current_locale_name dup ok [protected from @mtslocale by dcigettext]
 921 @c    strcmp ok
 922 @c   ENABLE_SECURE ok
 923 @c   _nl_find_domain @mtsenv @asucorrupt @ascuheap @asulock @ascudlopen @acucorrupt @aculock @acsfd @acsmem
 924 @c    libc_rwlock_rdlock dup @asulock @aculock
 925 @c    _nl_make_l10nflist dup @ascuheap @acsmem
 926 @c    libc_rwlock_unlock dup ok
 927 @c    _nl_load_domain @mtsenv @asucorrupt @ascuheap @asulock @ascudlopen @acucorrupt @aculock @acsfd @acsmem
 928 @c     libc_lock_lock_recursive @aculock
 929 @c     libc_lock_unlock_recursive @aculock
 930 @c     open->open_not_cancel_2 @acsfd
 931 @c     fstat ok
 932 @c     mmap dup @acsmem
 933 @c     close->close_not_cancel_no_status @acsfd
 934 @c     malloc dup @ascuheap @acsmem
 935 @c     read->read_not_cancel ok
 936 @c     munmap dup @acsmem
 937 @c     W dup ok
 938 @c     strlen dup ok
 939 @c     get_sysdep_segment_value ok
 940 @c     memcpy dup ok
 941 @c     hash_string dup ok
 942 @c     free dup @ascuheap @acsmem
 943 @c     libc_rwlock_init ok
 944 @c     _nl_find_msg dup @mtsenv @asucorrupt @ascuheap @asulock @ascudlopen @acucorrupt @aculock @acsfd @acsmem
 945 @c     libc_rwlock_fini ok
 946 @c     EXTRACT_PLURAL_EXPRESSION @ascuheap @acsmem
 947 @c      strstr dup ok
 948 @c      isspace ok
 949 @c      strtoul ok
 950 @c      PLURAL_PARSE @ascuheap @acsmem
 951 @c       malloc dup @ascuheap @acsmem
 952 @c       free dup @ascuheap @acsmem
 953 @c      INIT_GERMANIC_PLURAL ok, nothing
 954 @c        the pre-C99 variant is @acucorrupt [protected from @mtuinit by dcigettext]
 955 @c    _nl_expand_alias dup @ascuheap @asulock @acsmem @acsfd @aculock
 956 @c    _nl_explode_name dup @ascuheap @acsmem
 957 @c    libc_rwlock_wrlock dup @asulock @aculock
 958 @c    free dup @asulock @aculock @acsfd @acsmem
 959 @c   _nl_find_msg @mtsenv @asucorrupt @ascuheap @asulock @ascudlopen @acucorrupt @aculock @acsfd @acsmem
 960 @c    _nl_load_domain dup @mtsenv @asucorrupt @ascuheap @asulock @ascudlopen @acucorrupt @aculock @acsfd @acsmem
 961 @c    strlen ok
 962 @c    hash_string ok
 963 @c    W ok
 964 @c     SWAP ok
 965 @c      bswap_32 ok
 966 @c    strcmp ok
 967 @c    get_output_charset @mtsenv @ascuheap @acsmem
 968 @c     getenv dup @mtsenv
 969 @c     strlen dup ok
 970 @c     malloc dup @ascuheap @acsmem
 971 @c     memcpy dup ok
 972 @c    libc_rwlock_rdlock dup @asulock @aculock
 973 @c    libc_rwlock_unlock dup ok
 974 @c    libc_rwlock_wrlock dup @asulock @aculock
 975 @c    realloc @ascuheap @acsmem
 976 @c    strdup @ascuheap @acsmem
 977 @c    strstr ok
 978 @c    strcspn ok
 979 @c    mempcpy dup ok
 980 @c    norm_add_slashes dup ok
 981 @c    gconv_open @asucorrupt @ascuheap @asulock @ascudlopen @acucorrupt @aculock @acsmem @acsfd
 982 @c     [protected from @mtslocale by dcigettext locale lock]
 983 @c    free dup @ascuheap @acsmem
 984 @c    libc_lock_lock @asulock @aculock
 985 @c    calloc @ascuheap @acsmem
 986 @c    gconv dup @acucorrupt [protected from @mtsrace and @asucorrupt by lock]
 987 @c    libc_lock_unlock ok
 988 @c   malloc @ascuheap @acsmem
 989 @c   mempcpy ok
 990 @c   memcpy ok
 991 @c   strcpy ok
 992 @c   libc_rwlock_wrlock @asulock @aculock
 993 @c   tsearch @ascuheap @acucorrupt @acsmem [protected from @mtsrace and @asucorrupt]
 994 @c    transcmp ok
 995 @c     strmp dup ok
 996 @c   free @ascuheap @acsmem
 997 The @code{dcgettext} adds another argument to those which
 998 @code{dgettext} takes.  This argument @var{category} specifies the last
 999 piece of information needed to localize the message catalog.  I.e., the
1000 domain name and the locale category exactly specify which message
1001 catalog has to be used (relative to a given directory, see below).
1002
1003 The @code{dgettext} function can be expressed in terms of
1004 @code{dcgettext} by using
1005
1006 @smallexample
1007 dcgettext (domain, string, LC_MESSAGES)
1008 @end smallexample
1009
1010 @noindent
1011 instead of
1012
1013 @smallexample
1014 dgettext (domain, string)
1015 @end smallexample
1016
1017 This also shows which values are expected for the third parameter.  One
1018 has to use the available selectors for the categories available in
1019 @file{locale.h}.  Normally the available values are @code{LC_CTYPE},
1020 @code{LC_COLLATE}, @code{LC_MESSAGES}, @code{LC_MONETARY},
1021 @code{LC_NUMERIC}, and @code{LC_TIME}.  Please note that @code{LC_ALL}
1022 must not be used and even though the names might suggest this, there is
1023 no relation to the environment variable of this name.
1024
1025 The @code{dcgettext} function is only implemented for compatibility with
1026 other systems which have @code{gettext} functions.  There is not really
1027 any situation where it is necessary (or useful) to use a different value
1028 than @code{LC_MESSAGES} for the @var{category} parameter.  We are
1029 dealing with messages here and any other choice can only be irritating.
1030
1031 As for @code{gettext} the return value type is @code{char *} which is an
1032 anachronism.  The returned string must never be modified.
1033 @end deftypefun
1034
1035 When using the three functions above in a program it is a frequent case
1036 that the @var{msgid} argument is a constant string.  So it is worthwhile to
1037 optimize this case.  Thinking shortly about this one will realize that
1038 as long as no new message catalog is loaded the translation of a message
1039 will not change.  This optimization is actually implemented by the
1040 @code{gettext}, @code{dgettext} and @code{dcgettext} functions.
1041
1042
1043 @node Locating gettext catalog
1044 @subsubsection How to determine which catalog to be used
1045
1046 The functions to retrieve the translations for a given message have a
1047 remarkable simple interface.  But to provide the user of the program
1048 still the opportunity to select exactly the translation s/he wants and
1049 also to provide the programmer the possibility to influence the way to
1050 locate the search for catalogs files there is a quite complicated
1051 underlying mechanism which controls all this.  The code is complicated
1052 the use is easy.
1053
1054 Basically we have two different tasks to perform which can also be
1055 performed by the @code{catgets} functions:
1056
1057 @enumerate
1058 @item
1059 Locate the set of message catalogs.  There are a number of files for
1060 different languages which all belong to the package.  Usually they
1061 are all stored in the filesystem below a certain directory.
1062
1063 There can be arbitrarily many packages installed and they can follow
1064 different guidelines for the placement of their files.
1065
1066 @item
1067 Relative to the location specified by the package the actual translation
1068 files must be searched, based on the wishes of the user.  I.e., for each
1069 language the user selects the program should be able to locate the
1070 appropriate file.
1071 @end enumerate
1072
1073 This is the functionality required by the specifications for
1074 @code{gettext} and this is also what the @code{catgets} functions are
1075 able to do.  But there are some problems unresolved:
1076
1077 @itemize @bullet
1078 @item
1079 The language to be used can be specified in several different ways.
1080 There is no generally accepted standard for this and the user always
1081 expects the program to understand what s/he means.  E.g., to select the
1082 German translation one could write @code{de}, @code{german}, or
1083 @code{deutsch} and the program should always react the same.
1084
1085 @item
1086 Sometimes the specification of the user is too detailed.  If s/he, e.g.,
1087 specifies @code{de_DE.ISO-8859-1} which means German, spoken in Germany,
1088 coded using the @w{ISO 8859-1} character set there is the possibility
1089 that a message catalog matching this exactly is not available.  But
1090 there could be a catalog matching @code{de} and if the character set
1091 used on the machine is always @w{ISO 8859-1} there is no reason why this
1092 later message catalog should not be used.  (We call this @dfn{message
1093 inheritance}.)
1094
1095 @item
1096 If a catalog for a wanted language is not available it is not always the
1097 second best choice to fall back on the language of the developer and
1098 simply not translate any message.  Instead a user might be better able
1099 to read the messages in another language and so the user of the program
1100 should be able to define a precedence order of languages.
1101 @end itemize
1102
1103 We can divide the configuration actions in two parts: the one is
1104 performed by the programmer, the other by the user.  We will start with
1105 the functions the programmer can use since the user configuration will
1106 be based on this.
1107
1108 As the functions described in the last sections already mention separate
1109 sets of messages can be selected by a @dfn{domain name}.  This is a
1110 simple string which should be unique for each program part that uses a
1111 separate domain.  It is possible to use in one program arbitrarily many
1112 domains at the same time.  E.g., @theglibc{} itself uses a domain
1113 named @code{libc} while the program using the C Library could use a
1114 domain named @code{foo}.  The important point is that at any time
1115 exactly one domain is active.  This is controlled with the following
1116 function.
1117
1118 @comment libintl.h
1119 @comment GNU
1120 @deftypefun {char *} textdomain (const char *@var{domainname})
1121 @safety{@prelim{}@mtsafe{}@asunsafe{@asulock{} @ascuheap{}}@acunsafe{@aculock{} @acsmem{}}}
1122 @c textdomain @asulock @ascuheap @aculock @acsmem
1123 @c  libc_rwlock_wrlock @asulock @aculock
1124 @c  strcmp ok
1125 @c  strdup @ascuheap @acsmem
1126 @c  free @ascuheap @acsmem
1127 @c  libc_rwlock_unlock ok
1128 The @code{textdomain} function sets the default domain, which is used in
1129 all future @code{gettext} calls, to @var{domainname}.  Please note that
1130 @code{dgettext} and @code{dcgettext} calls are not influenced if the
1131 @var{domainname} parameter of these functions is not the null pointer.
1132
1133 Before the first call to @code{textdomain} the default domain is
1134 @code{messages}.  This is the name specified in the specification of
1135 the @code{gettext} API.  This name is as good as any other name.  No
1136 program should ever really use a domain with this name since this can
1137 only lead to problems.
1138
1139 The function returns the value which is from now on taken as the default
1140 domain.  If the system went out of memory the returned value is
1141 @code{NULL} and the global variable @var{errno} is set to @code{ENOMEM}.
1142 Despite the return value type being @code{char *} the return string must
1143 not be changed.  It is allocated internally by the @code{textdomain}
1144 function.
1145
1146 If the @var{domainname} parameter is the null pointer no new default
1147 domain is set.  Instead the currently selected default domain is
1148 returned.
1149
1150 If the @var{domainname} parameter is the empty string the default domain
1151 is reset to its initial value, the domain with the name @code{messages}.
1152 This possibility is questionable to use since the domain @code{messages}
1153 really never should be used.
1154 @end deftypefun
1155
1156 @comment libintl.h
1157 @comment GNU
1158 @deftypefun {char *} bindtextdomain (const char *@var{domainname}, const char *@var{dirname})
1159 @safety{@prelim{}@mtsafe{}@asunsafe{@ascuheap{}}@acunsafe{@acsmem{}}}
1160 @c bindtextdomain @ascuheap @acsmem
1161 @c  set_binding_values @ascuheap @acsmem
1162 @c   libc_rwlock_wrlock dup @asulock @aculock
1163 @c   strcmp dup ok
1164 @c   strdup dup @ascuheap @acsmem
1165 @c   free dup @ascuheap @acsmem
1166 @c   malloc dup @ascuheap @acsmem
1167 The @code{bindtextdomain} function can be used to specify the directory
1168 which contains the message catalogs for domain @var{domainname} for the
1169 different languages.  To be correct, this is the directory where the
1170 hierarchy of directories is expected.  Details are explained below.
1171
1172 For the programmer it is important to note that the translations which
1173 come with the program have to be placed in a directory hierarchy starting
1174 at, say, @file{/foo/bar}.  Then the program should make a
1175 @code{bindtextdomain} call to bind the domain for the current program to
1176 this directory.  So it is made sure the catalogs are found.  A correctly
1177 running program does not depend on the user setting an environment
1178 variable.
1179
1180 The @code{bindtextdomain} function can be used several times and if the
1181 @var{domainname} argument is different the previously bound domains
1182 will not be overwritten.
1183
1184 If the program which wish to use @code{bindtextdomain} at some point of
1185 time use the @code{chdir} function to change the current working
1186 directory it is important that the @var{dirname} strings ought to be an
1187 absolute pathname.  Otherwise the addressed directory might vary with
1188 the time.
1189
1190 If the @var{dirname} parameter is the null pointer @code{bindtextdomain}
1191 returns the currently selected directory for the domain with the name
1192 @var{domainname}.
1193
1194 The @code{bindtextdomain} function returns a pointer to a string
1195 containing the name of the selected directory name.  The string is
1196 allocated internally in the function and must not be changed by the
1197 user.  If the system went out of core during the execution of
1198 @code{bindtextdomain} the return value is @code{NULL} and the global
1199 variable @var{errno} is set accordingly.
1200 @end deftypefun
1201
1202
1203 @node Advanced gettext functions
1204 @subsubsection Additional functions for more complicated situations
1205
1206 The functions of the @code{gettext} family described so far (and all the
1207 @code{catgets} functions as well) have one problem in the real world
1208 which has been neglected completely in all existing approaches.  What
1209 is meant here is the handling of plural forms.
1210
1211 Looking through Unix source code before the time anybody thought about
1212 internationalization (and, sadly, even afterwards) one can often find
1213 code similar to the following:
1214
1215 @smallexample
1216    printf ("%d file%s deleted", n, n == 1 ? "" : "s");
1217 @end smallexample
1218
1219 @noindent
1220 After the first complaints from people internationalizing the code people
1221 either completely avoided formulations like this or used strings like
1222 @code{"file(s)"}.  Both look unnatural and should be avoided.  First
1223 tries to solve the problem correctly looked like this:
1224
1225 @smallexample
1226    if (n == 1)
1227      printf ("%d file deleted", n);
1228    else
1229      printf ("%d files deleted", n);
1230 @end smallexample
1231
1232 But this does not solve the problem.  It helps languages where the
1233 plural form of a noun is not simply constructed by adding an `s' but
1234 that is all.  Once again people fell into the trap of believing the
1235 rules their language uses are universal.  But the handling of plural
1236 forms differs widely between the language families.  There are two
1237 things we can differ between (and even inside language families);
1238
1239 @itemize @bullet
1240 @item
1241 The form how plural forms are build differs.  This is a problem with
1242 language which have many irregularities.  German, for instance, is a
1243 drastic case.  Though English and German are part of the same language
1244 family (Germanic), the almost regular forming of plural noun forms
1245 (appending an `s') is hardly found in German.
1246
1247 @item
1248 The number of plural forms differ.  This is somewhat surprising for
1249 those who only have experiences with Romanic and Germanic languages
1250 since here the number is the same (there are two).
1251
1252 But other language families have only one form or many forms.  More
1253 information on this in an extra section.
1254 @end itemize
1255
1256 The consequence of this is that application writers should not try to
1257 solve the problem in their code.  This would be localization since it is
1258 only usable for certain, hardcoded language environments.  Instead the
1259 extended @code{gettext} interface should be used.
1260
1261 These extra functions are taking instead of the one key string two
1262 strings and a numerical argument.  The idea behind this is that using
1263 the numerical argument and the first string as a key, the implementation
1264 can select using rules specified by the translator the right plural
1265 form.  The two string arguments then will be used to provide a return
1266 value in case no message catalog is found (similar to the normal
1267 @code{gettext} behavior).  In this case the rules for Germanic language
1268 are used and it is assumed that the first string argument is the singular
1269 form, the second the plural form.
1270
1271 This has the consequence that programs without language catalogs can
1272 display the correct strings only if the program itself is written using
1273 a Germanic language.  This is a limitation but since @theglibc{}
1274 (as well as the GNU @code{gettext} package) is written as part of the
1275 GNU package and the coding standards for the GNU project require programs
1276 to be written in English, this solution nevertheless fulfills its
1277 purpose.
1278
1279 @comment libintl.h
1280 @comment GNU
1281 @deftypefun {char *} ngettext (const char *@var{msgid1}, const char *@var{msgid2}, unsigned long int @var{n})
1282 @safety{@prelim{}@mtsafe{@mtsenv{}}@asunsafe{@asucorrupt{} @ascuheap{} @asulock{} @ascudlopen{}}@acunsafe{@acucorrupt{} @aculock{} @acsfd{} @acsmem{}}}
1283 @c Wrapper for dcngettext.
1284 The @code{ngettext} function is similar to the @code{gettext} function
1285 as it finds the message catalogs in the same way.  But it takes two
1286 extra arguments.  The @var{msgid1} parameter must contain the singular
1287 form of the string to be converted.  It is also used as the key for the
1288 search in the catalog.  The @var{msgid2} parameter is the plural form.
1289 The parameter @var{n} is used to determine the plural form.  If no
1290 message catalog is found @var{msgid1} is returned if @code{n == 1},
1291 otherwise @code{msgid2}.
1292
1293 An example for the use of this function is:
1294
1295 @smallexample
1296   printf (ngettext ("%d file removed", "%d files removed", n), n);
1297 @end smallexample
1298
1299 Please note that the numeric value @var{n} has to be passed to the
1300 @code{printf} function as well.  It is not sufficient to pass it only to
1301 @code{ngettext}.
1302 @end deftypefun
1303
1304 @comment libintl.h
1305 @comment GNU
1306 @deftypefun {char *} dngettext (const char *@var{domain}, const char *@var{msgid1}, const char *@var{msgid2}, unsigned long int @var{n})
1307 @safety{@prelim{}@mtsafe{@mtsenv{}}@asunsafe{@asucorrupt{} @ascuheap{} @asulock{} @ascudlopen{}}@acunsafe{@acucorrupt{} @aculock{} @acsfd{} @acsmem{}}}
1308 @c Wrapper for dcngettext.
1309 The @code{dngettext} is similar to the @code{dgettext} function in the
1310 way the message catalog is selected.  The difference is that it takes
1311 two extra parameters to provide the correct plural form.  These two
1312 parameters are handled in the same way @code{ngettext} handles them.
1313 @end deftypefun
1314
1315 @comment libintl.h
1316 @comment GNU
1317 @deftypefun {char *} dcngettext (const char *@var{domain}, const char *@var{msgid1}, const char *@var{msgid2}, unsigned long int @var{n}, int @var{category})
1318 @safety{@prelim{}@mtsafe{@mtsenv{}}@asunsafe{@asucorrupt{} @ascuheap{} @asulock{} @ascudlopen{}}@acunsafe{@acucorrupt{} @aculock{} @acsfd{} @acsmem{}}}
1319 @c Wrapper for dcigettext.
1320 The @code{dcngettext} is similar to the @code{dcgettext} function in the
1321 way the message catalog is selected.  The difference is that it takes
1322 two extra parameters to provide the correct plural form.  These two
1323 parameters are handled in the same way @code{ngettext} handles them.
1324 @end deftypefun
1325
1326 @subsubheading The problem of plural forms
1327
1328 A description of the problem can be found at the beginning of the last
1329 section.  Now there is the question how to solve it.  Without the input
1330 of linguists (which was not available) it was not possible to determine
1331 whether there are only a few different forms in which plural forms are
1332 formed or whether the number can increase with every new supported
1333 language.
1334
1335 Therefore the solution implemented is to allow the translator to specify
1336 the rules of how to select the plural form.  Since the formula varies
1337 with every language this is the only viable solution except for
1338 hardcoding the information in the code (which still would require the
1339 possibility of extensions to not prevent the use of new languages).  The
1340 details are explained in the GNU @code{gettext} manual.  Here only a
1341 bit of information is provided.
1342
1343 The information about the plural form selection has to be stored in the
1344 header entry (the one with the empty @code{msgid} string).  It looks
1345 like this:
1346
1347 @smallexample
1348 Plural-Forms: nplurals=2; plural=n == 1 ? 0 : 1;
1349 @end smallexample
1350
1351 The @code{nplurals} value must be a decimal number which specifies how
1352 many different plural forms exist for this language.  The string
1353 following @code{plural} is an expression using the C language
1354 syntax.  Exceptions are that no negative numbers are allowed, numbers
1355 must be decimal, and the only variable allowed is @code{n}.  This
1356 expression will be evaluated whenever one of the functions
1357 @code{ngettext}, @code{dngettext}, or @code{dcngettext} is called.  The
1358 numeric value passed to these functions is then substituted for all uses
1359 of the variable @code{n} in the expression.  The resulting value then
1360 must be greater or equal to zero and smaller than the value given as the
1361 value of @code{nplurals}.
1362
1363 @noindent
1364 The following rules are known at this point.  The language with families
1365 are listed.  But this does not necessarily mean the information can be
1366 generalized for the whole family (as can be easily seen in the table
1367 below).@footnote{Additions are welcome.  Send appropriate information to
1368 @email{bug-glibc-manual@@gnu.org}.}
1369
1370 @table @asis
1371 @item Only one form:
1372 Some languages only require one single form.  There is no distinction
1373 between the singular and plural form.  An appropriate header entry
1374 would look like this:
1375
1376 @smallexample
1377 Plural-Forms: nplurals=1; plural=0;
1378 @end smallexample
1379
1380 @noindent
1381 Languages with this property include:
1382
1383 @table @asis
1384 @item Finno-Ugric family
1385 Hungarian
1386 @item Asian family
1387 Japanese, Korean
1388 @item Turkic/Altaic family
1389 Turkish
1390 @end table
1391
1392 @item Two forms, singular used for one only
1393 This is the form used in most existing programs since it is what English
1394 uses.  A header entry would look like this:
1395
1396 @smallexample
1397 Plural-Forms: nplurals=2; plural=n != 1;
1398 @end smallexample
1399
1400 (Note: this uses the feature of C expressions that boolean expressions
1401 have to value zero or one.)
1402
1403 @noindent
1404 Languages with this property include:
1405
1406 @table @asis
1407 @item Germanic family
1408 Danish, Dutch, English, German, Norwegian, Swedish
1409 @item Finno-Ugric family
1410 Estonian, Finnish
1411 @item Latin/Greek family
1412 Greek
1413 @item Semitic family
1414 Hebrew
1415 @item Romance family
1416 Italian, Portuguese, Spanish
1417 @item Artificial
1418 Esperanto
1419 @end table
1420
1421 @item Two forms, singular used for zero and one
1422 Exceptional case in the language family.  The header entry would be:
1423
1424 @smallexample
1425 Plural-Forms: nplurals=2; plural=n>1;
1426 @end smallexample
1427
1428 @noindent
1429 Languages with this property include:
1430
1431 @table @asis
1432 @item Romanic family
1433 French, Brazilian Portuguese
1434 @end table
1435
1436 @item Three forms, special case for zero
1437 The header entry would be:
1438
1439 @smallexample
1440 Plural-Forms: nplurals=3; plural=n%10==1 && n%100!=11 ? 0 : n != 0 ? 1 : 2;
1441 @end smallexample
1442
1443 @noindent
1444 Languages with this property include:
1445
1446 @table @asis
1447 @item Baltic family
1448 Latvian
1449 @end table
1450
1451 @item Three forms, special cases for one and two
1452 The header entry would be:
1453
1454 @smallexample
1455 Plural-Forms: nplurals=3; plural=n==1 ? 0 : n==2 ? 1 : 2;
1456 @end smallexample
1457
1458 @noindent
1459 Languages with this property include:
1460
1461 @table @asis
1462 @item Celtic
1463 Gaeilge (Irish)
1464 @end table
1465
1466 @item Three forms, special case for numbers ending in 1[2-9]
1467 The header entry would look like this:
1468
1469 @smallexample
1470 Plural-Forms: nplurals=3; \
1471     plural=n%10==1 && n%100!=11 ? 0 : \
1472            n%10>=2 && (n%100<10 || n%100>=20) ? 1 : 2;
1473 @end smallexample
1474
1475 @noindent
1476 Languages with this property include:
1477
1478 @table @asis
1479 @item Baltic family
1480 Lithuanian
1481 @end table
1482
1483 @item Three forms, special cases for numbers ending in 1 and 2, 3, 4, except those ending in 1[1-4]
1484 The header entry would look like this:
1485
1486 @smallexample
1487 Plural-Forms: nplurals=3; \
1488     plural=n%100/10==1 ? 2 : n%10==1 ? 0 : (n+9)%10>3 ? 2 : 1;
1489 @end smallexample
1490
1491 @noindent
1492 Languages with this property include:
1493
1494 @table @asis
1495 @item Slavic family
1496 Croatian, Czech, Russian, Ukrainian
1497 @end table
1498
1499 @item Three forms, special cases for 1 and 2, 3, 4
1500 The header entry would look like this:
1501
1502 @smallexample
1503 Plural-Forms: nplurals=3; \
1504     plural=(n==1) ? 1 : (n>=2 && n<=4) ? 2 : 0;
1505 @end smallexample
1506
1507 @noindent
1508 Languages with this property include:
1509
1510 @table @asis
1511 @item Slavic family
1512 Slovak
1513 @end table
1514
1515 @item Three forms, special case for one and some numbers ending in 2, 3, or 4
1516 The header entry would look like this:
1517
1518 @smallexample
1519 Plural-Forms: nplurals=3; \
1520     plural=n==1 ? 0 : \
1521            n%10>=2 && n%10<=4 && (n%100<10 || n%100>=20) ? 1 : 2;
1522 @end smallexample
1523
1524 @noindent
1525 Languages with this property include:
1526
1527 @table @asis
1528 @item Slavic family
1529 Polish
1530 @end table
1531
1532 @item Four forms, special case for one and all numbers ending in 02, 03, or 04
1533 The header entry would look like this:
1534
1535 @smallexample
1536 Plural-Forms: nplurals=4; \
1537     plural=n%100==1 ? 0 : n%100==2 ? 1 : n%100==3 || n%100==4 ? 2 : 3;
1538 @end smallexample
1539
1540 @noindent
1541 Languages with this property include:
1542
1543 @table @asis
1544 @item Slavic family
1545 Slovenian
1546 @end table
1547 @end table
1548
1549
1550 @node Charset conversion in gettext
1551 @subsubsection How to specify the output character set @code{gettext} uses
1552
1553 @code{gettext} not only looks up a translation in a message catalog, it
1554 also converts the translation on the fly to the desired output character
1555 set.  This is useful if the user is working in a different character set
1556 than the translator who created the message catalog, because it avoids
1557 distributing variants of message catalogs which differ only in the
1558 character set.
1559
1560 The output character set is, by default, the value of @code{nl_langinfo
1561 (CODESET)}, which depends on the @code{LC_CTYPE} part of the current
1562 locale.  But programs which store strings in a locale independent way
1563 (e.g. UTF-8) can request that @code{gettext} and related functions
1564 return the translations in that encoding, by use of the
1565 @code{bind_textdomain_codeset} function.
1566
1567 Note that the @var{msgid} argument to @code{gettext} is not subject to
1568 character set conversion.  Also, when @code{gettext} does not find a
1569 translation for @var{msgid}, it returns @var{msgid} unchanged --
1570 independently of the current output character set.  It is therefore
1571 recommended that all @var{msgid}s be US-ASCII strings.
1572
1573 @comment libintl.h
1574 @comment GNU
1575 @deftypefun {char *} bind_textdomain_codeset (const char *@var{domainname}, const char *@var{codeset})
1576 @safety{@prelim{}@mtsafe{}@asunsafe{@ascuheap{}}@acunsafe{@acsmem{}}}
1577 @c bind_textdomain_codeset @ascuheap @acsmem
1578 @c  set_binding_values dup @ascuheap @acsmem
1579 The @code{bind_textdomain_codeset} function can be used to specify the
1580 output character set for message catalogs for domain @var{domainname}.
1581 The @var{codeset} argument must be a valid codeset name which can be used
1582 for the @code{iconv_open} function, or a null pointer.
1583
1584 If the @var{codeset} parameter is the null pointer,
1585 @code{bind_textdomain_codeset} returns the currently selected codeset
1586 for the domain with the name @var{domainname}.  It returns @code{NULL} if
1587 no codeset has yet been selected.
1588
1589 The @code{bind_textdomain_codeset} function can be used several times.
1590 If used multiple times with the same @var{domainname} argument, the
1591 later call overrides the settings made by the earlier one.
1592
1593 The @code{bind_textdomain_codeset} function returns a pointer to a
1594 string containing the name of the selected codeset.  The string is
1595 allocated internally in the function and must not be changed by the
1596 user.  If the system went out of core during the execution of
1597 @code{bind_textdomain_codeset}, the return value is @code{NULL} and the
1598 global variable @var{errno} is set accordingly.
1599 @end deftypefun
1600
1601
1602 @node GUI program problems
1603 @subsubsection How to use @code{gettext} in GUI programs
1604
1605 One place where the @code{gettext} functions, if used normally, have big
1606 problems is within programs with graphical user interfaces (GUIs).  The
1607 problem is that many of the strings which have to be translated are very
1608 short.  They have to appear in pull-down menus which restricts the
1609 length.  But strings which are not containing entire sentences or at
1610 least large fragments of a sentence may appear in more than one
1611 situation in the program but might have different translations.  This is
1612 especially true for the one-word strings which are frequently used in
1613 GUI programs.
1614
1615 As a consequence many people say that the @code{gettext} approach is
1616 wrong and instead @code{catgets} should be used which indeed does not
1617 have this problem.  But there is a very simple and powerful method to
1618 handle these kind of problems with the @code{gettext} functions.
1619
1620 @noindent
1621 As an example consider the following fictional situation.  A GUI program
1622 has a menu bar with the following entries:
1623
1624 @smallexample
1625 +------------+------------+--------------------------------------+
1626 | File       | Printer    |                                      |
1627 +------------+------------+--------------------------------------+
1628 | Open     | | Select   |
1629 | New      | | Open     |
1630 +----------+ | Connect  |
1631              +----------+
1632 @end smallexample
1633
1634 To have the strings @code{File}, @code{Printer}, @code{Open},
1635 @code{New}, @code{Select}, and @code{Connect} translated there has to be
1636 at some point in the code a call to a function of the @code{gettext}
1637 family.  But in two places the string passed into the function would be
1638 @code{Open}.  The translations might not be the same and therefore we
1639 are in the dilemma described above.
1640
1641 One solution to this problem is to artificially extend the strings
1642 to make them unambiguous.  But what would the program do if no
1643 translation is available?  The extended string is not what should be
1644 printed.  So we should use a slightly modified version of the functions.
1645
1646 To extend the strings a uniform method should be used.  E.g., in the
1647 example above, the strings could be chosen as
1648
1649 @smallexample
1650 Menu|File
1651 Menu|Printer
1652 Menu|File|Open
1653 Menu|File|New
1654 Menu|Printer|Select
1655 Menu|Printer|Open
1656 Menu|Printer|Connect
1657 @end smallexample
1658
1659 Now all the strings are different and if now instead of @code{gettext}
1660 the following little wrapper function is used, everything works just
1661 fine:
1662
1663 @cindex sgettext
1664 @smallexample
1665   char *
1666   sgettext (const char *msgid)
1667   @{
1668     char *msgval = gettext (msgid);
1669     if (msgval == msgid)
1670       msgval = strrchr (msgid, '|') + 1;
1671     return msgval;
1672   @}
1673 @end smallexample
1674
1675 What this little function does is to recognize the case when no
1676 translation is available.  This can be done very efficiently by a
1677 pointer comparison since the return value is the input value.  If there
1678 is no translation we know that the input string is in the format we used
1679 for the Menu entries and therefore contains a @code{|} character.  We
1680 simply search for the last occurrence of this character and return a
1681 pointer to the character following it.  That's it!
1682
1683 If one now consistently uses the extended string form and replaces
1684 the @code{gettext} calls with calls to @code{sgettext} (this is normally
1685 limited to very few places in the GUI implementation) then it is
1686 possible to produce a program which can be internationalized.
1687
1688 With advanced compilers (such as GNU C) one can write the
1689 @code{sgettext} functions as an inline function or as a macro like this:
1690
1691 @cindex sgettext
1692 @smallexample
1693 #define sgettext(msgid) \
1694   (@{ const char *__msgid = (msgid);            \
1695      char *__msgstr = gettext (__msgid);       \
1696      if (__msgval == __msgid)                  \
1697        __msgval = strrchr (__msgid, '|') + 1;  \
1698      __msgval; @})
1699 @end smallexample
1700
1701 The other @code{gettext} functions (@code{dgettext}, @code{dcgettext}
1702 and the @code{ngettext} equivalents) can and should have corresponding
1703 functions as well which look almost identical, except for the parameters
1704 and the call to the underlying function.
1705
1706 Now there is of course the question why such functions do not exist in
1707 @theglibc{}?  There are two parts of the answer to this question.
1708
1709 @itemize @bullet
1710 @item
1711 They are easy to write and therefore can be provided by the project they
1712 are used in.  This is not an answer by itself and must be seen together
1713 with the second part which is:
1714
1715 @item
1716 There is no way the C library can contain a version which can work
1717 everywhere.  The problem is the selection of the character to separate
1718 the prefix from the actual string in the extended string.  The
1719 examples above used @code{|} which is a quite good choice because it
1720 resembles a notation frequently used in this context and it also is a
1721 character not often used in message strings.
1722
1723 But what if the character is used in message strings.  Or if the chose
1724 character is not available in the character set on the machine one
1725 compiles (e.g., @code{|} is not required to exist for @w{ISO C}; this is
1726 why the @file{iso646.h} file exists in @w{ISO C} programming environments).
1727 @end itemize
1728
1729 There is only one more comment to make left.  The wrapper function above
1730 requires that the translations strings are not extended themselves.
1731 This is only logical.  There is no need to disambiguate the strings
1732 (since they are never used as keys for a search) and one also saves
1733 quite some memory and disk space by doing this.
1734
1735
1736 @node Using gettextized software
1737 @subsubsection User influence on @code{gettext}
1738
1739 The last sections described what the programmer can do to
1740 internationalize the messages of the program.  But it is finally up to
1741 the user to select the message s/he wants to see.  S/He must understand
1742 them.
1743
1744 The POSIX locale model uses the environment variables @code{LC_COLLATE},
1745 @code{LC_CTYPE}, @code{LC_MESSAGES}, @code{LC_MONETARY}, @code{LC_NUMERIC},
1746 and @code{LC_TIME} to select the locale which is to be used.  This way
1747 the user can influence lots of functions.  As we mentioned above, the
1748 @code{gettext} functions also take advantage of this.
1749
1750 To understand how this happens it is necessary to take a look at the
1751 various components of the filename which gets computed to locate a
1752 message catalog.  It is composed as follows:
1753
1754 @smallexample
1755 @var{dir_name}/@var{locale}/LC_@var{category}/@var{domain_name}.mo
1756 @end smallexample
1757
1758 The default value for @var{dir_name} is system specific.  It is computed
1759 from the value given as the prefix while configuring the C library.
1760 This value normally is @file{/usr} or @file{/}.  For the former the
1761 complete @var{dir_name} is:
1762
1763 @smallexample
1764 /usr/share/locale
1765 @end smallexample
1766
1767 We can use @file{/usr/share} since the @file{.mo} files containing the
1768 message catalogs are system independent, so all systems can use the same
1769 files.  If the program executed the @code{bindtextdomain} function for
1770 the message domain that is currently handled, the @code{dir_name}
1771 component is exactly the value which was given to the function as
1772 the second parameter.  I.e., @code{bindtextdomain} allows overwriting
1773 the only system dependent and fixed value to make it possible to
1774 address files anywhere in the filesystem.
1775
1776 The @var{category} is the name of the locale category which was selected
1777 in the program code.  For @code{gettext} and @code{dgettext} this is
1778 always @code{LC_MESSAGES}, for @code{dcgettext} this is selected by the
1779 value of the third parameter.  As said above it should be avoided to
1780 ever use a category other than @code{LC_MESSAGES}.
1781
1782 The @var{locale} component is computed based on the category used.  Just
1783 like for the @code{setlocale} function here comes the user selection
1784 into the play.  Some environment variables are examined in a fixed order
1785 and the first environment variable set determines the return value of
1786 the lookup process.  In detail, for the category @code{LC_xxx} the
1787 following variables in this order are examined:
1788
1789 @table @code
1790 @item LANGUAGE
1791 @item LC_ALL
1792 @item LC_xxx
1793 @item LANG
1794 @end table
1795
1796 This looks very familiar.  With the exception of the @code{LANGUAGE}
1797 environment variable this is exactly the lookup order the
1798 @code{setlocale} function uses.  But why introduce the @code{LANGUAGE}
1799 variable?
1800
1801 The reason is that the syntax of the values these variables can have is
1802 different to what is expected by the @code{setlocale} function.  If we
1803 would set @code{LC_ALL} to a value following the extended syntax that
1804 would mean the @code{setlocale} function will never be able to use the
1805 value of this variable as well.  An additional variable removes this
1806 problem plus we can select the language independently of the locale
1807 setting which sometimes is useful.
1808
1809 While for the @code{LC_xxx} variables the value should consist of
1810 exactly one specification of a locale the @code{LANGUAGE} variable's
1811 value can consist of a colon separated list of locale names.  The
1812 attentive reader will realize that this is the way we manage to
1813 implement one of our additional demands above: we want to be able to
1814 specify an ordered list of languages.
1815
1816 Back to the constructed filename we have only one component missing.
1817 The @var{domain_name} part is the name which was either registered using
1818 the @code{textdomain} function or which was given to @code{dgettext} or
1819 @code{dcgettext} as the first parameter.  Now it becomes obvious that a
1820 good choice for the domain name in the program code is a string which is
1821 closely related to the program/package name.  E.g., for @theglibc{}
1822 the domain name is @code{libc}.
1823
1824 @noindent
1825 A limited piece of example code should show how the program is supposed
1826 to work:
1827
1828 @smallexample
1829 @{
1830   setlocale (LC_ALL, "");
1831   textdomain ("test-package");
1832   bindtextdomain ("test-package", "/usr/local/share/locale");
1833   puts (gettext ("Hello, world!"));
1834 @}
1835 @end smallexample
1836
1837 At the program start the default domain is @code{messages}, and the
1838 default locale is "C".  The @code{setlocale} call sets the locale
1839 according to the user's environment variables; remember that correct
1840 functioning of @code{gettext} relies on the correct setting of the
1841 @code{LC_MESSAGES} locale (for looking up the message catalog) and
1842 of the @code{LC_CTYPE} locale (for the character set conversion).
1843 The @code{textdomain} call changes the default domain to
1844 @code{test-package}.  The @code{bindtextdomain} call specifies that
1845 the message catalogs for the domain @code{test-package} can be found
1846 below the directory @file{/usr/local/share/locale}.
1847
1848 If the user sets in her/his environment the variable @code{LANGUAGE}
1849 to @code{de} the @code{gettext} function will try to use the
1850 translations from the file
1851
1852 @smallexample
1853 /usr/local/share/locale/de/LC_MESSAGES/test-package.mo
1854 @end smallexample
1855
1856 From the above descriptions it should be clear which component of this
1857 filename is determined by which source.
1858
1859 In the above example we assumed the @code{LANGUAGE} environment
1860 variable to be @code{de}.  This might be an appropriate selection but what
1861 happens if the user wants to use @code{LC_ALL} because of the wider
1862 usability and here the required value is @code{de_DE.ISO-8859-1}?  We
1863 already mentioned above that a situation like this is not infrequent.
1864 E.g., a person might prefer reading a dialect and if this is not
1865 available fall back on the standard language.
1866
1867 The @code{gettext} functions know about situations like this and can
1868 handle them gracefully.  The functions recognize the format of the value
1869 of the environment variable.  It can split the value is different pieces
1870 and by leaving out the only or the other part it can construct new
1871 values.  This happens of course in a predictable way.  To understand
1872 this one must know the format of the environment variable value.  There
1873 is one more or less standardized form, originally from the X/Open
1874 specification:
1875
1876 @code{language[_territory[.codeset]][@@modifier]}
1877
1878 Less specific locale names will be stripped in the order of the
1879 following list:
1880
1881 @enumerate
1882 @item
1883 @code{codeset}
1884 @item
1885 @code{normalized codeset}
1886 @item
1887 @code{territory}
1888 @item
1889 @code{modifier}
1890 @end enumerate
1891
1892 The @code{language} field will never be dropped for obvious reasons.
1893
1894 The only new thing is the @code{normalized codeset} entry.  This is
1895 another goodie which is introduced to help reduce the chaos which
1896 derives from the inability of people to standardize the names of
1897 character sets.  Instead of @w{ISO-8859-1} one can often see @w{8859-1},
1898 @w{88591}, @w{iso8859-1}, or @w{iso_8859-1}.  The @code{normalized
1899 codeset} value is generated from the user-provided character set name by
1900 applying the following rules:
1901
1902 @enumerate
1903 @item
1904 Remove all characters besides numbers and letters.
1905 @item
1906 Fold letters to lowercase.
1907 @item
1908 If the same only contains digits prepend the string @code{"iso"}.
1909 @end enumerate
1910
1911 @noindent
1912 So all of the above names will be normalized to @code{iso88591}.  This
1913 allows the program user much more freedom in choosing the locale name.
1914
1915 Even this extended functionality still does not help to solve the
1916 problem that completely different names can be used to denote the same
1917 locale (e.g., @code{de} and @code{german}).  To be of help in this
1918 situation the locale implementation and also the @code{gettext}
1919 functions know about aliases.
1920
1921 The file @file{/usr/share/locale/locale.alias} (replace @file{/usr} with
1922 whatever prefix you used for configuring the C library) contains a
1923 mapping of alternative names to more regular names.  The system manager
1924 is free to add new entries to fill her/his own needs.  The selected
1925 locale from the environment is compared with the entries in the first
1926 column of this file ignoring the case.  If they match, the value of the
1927 second column is used instead for the further handling.
1928
1929 In the description of the format of the environment variables we already
1930 mentioned the character set as a factor in the selection of the message
1931 catalog.  In fact, only catalogs which contain text written using the
1932 character set of the system/program can be used (directly; there will
1933 come a solution for this some day).  This means for the user that s/he
1934 will always have to take care of this.  If in the collection of the
1935 message catalogs there are files for the same language but coded using
1936 different character sets the user has to be careful.
1937
1938
1939 @node Helper programs for gettext
1940 @subsection Programs to handle message catalogs for @code{gettext}
1941
1942 @Theglibc{} does not contain the source code for the programs to
1943 handle message catalogs for the @code{gettext} functions.  As part of
1944 the GNU project the GNU gettext package contains everything the
1945 developer needs.  The functionality provided by the tools in this
1946 package by far exceeds the abilities of the @code{gencat} program
1947 described above for the @code{catgets} functions.
1948
1949 There is a program @code{msgfmt} which is the equivalent program to the
1950 @code{gencat} program.  It generates from the human-readable and
1951 -editable form of the message catalog a binary file which can be used by
1952 the @code{gettext} functions.  But there are several more programs
1953 available.
1954
1955 The @code{xgettext} program can be used to automatically extract the
1956 translatable messages from a source file.  I.e., the programmer need not
1957 take care of the translations and the list of messages which have to be
1958 translated.  S/He will simply wrap the translatable string in calls to
1959 @code{gettext} et.al and the rest will be done by @code{xgettext}.  This
1960 program has a lot of options which help to customize the output or
1961 help to understand the input better.
1962
1963 Other programs help to manage the development cycle when new messages appear
1964 in the source files or when a new translation of the messages appears.
1965 Here it should only be noted that using all the tools in GNU gettext it
1966 is possible to @emph{completely} automate the handling of message
1967 catalogs.  Besides marking the translatable strings in the source code and
1968 generating the translations the developers do not have anything to do
1969 themselves.