manual/message.texi

   1 @node Message Translation, Searching and Sorting, Locales, Top
   2 @c %MENU% How to make the program speak the user's language
   3 @chapter Message Translation
   4
   5 The program's interface with the user should be designed to ease the user's
   6 task.  One way to ease the user's task is to use messages in whatever
   7 language the user prefers.
   8
   9 Printing messages in different languages can be implemented in different
  10 ways.  One could add all the different languages in the source code and
  11 choose among the variants every time a message has to be printed.  This is
  12 certainly not a good solution since extending the set of languages is
  13 cumbersome (the code must be changed) and the code itself can become
  14 really big with dozens of message sets.
  15
  16 A better solution is to keep the message sets for each language
  17 in separate files which are loaded at runtime depending on the language
  18 selection of the user.
  19
  20 @Theglibc{} provides two different sets of functions to support
  21 message translation.  The problem is that neither of the interfaces is
  22 officially defined by the POSIX standard.  The @code{catgets} family of
  23 functions is defined in the X/Open standard but this is derived from
  24 industry decisions and therefore not necessarily based on reasonable
  25 decisions.
  26
  27 As mentioned above, the message catalog handling provides easy
  28 extendability by using external data files which contain the message
  29 translations.  I.e., these files contain for each of the messages used
  30 in the program a translation for the appropriate language.  So the tasks
  31 of the message handling functions are
  32
  33 @itemize @bullet
  34 @item
  35 locate the external data file with the appropriate translations
  36 @item
  37 load the data and make it possible to address the messages
  38 @item
  39 map a given key to the translated message
  40 @end itemize
  41
  42 The two approaches mainly differ in the implementation of this last
  43 step.  Decisions made in the last step influence the rest of the design.
  44
  45 @menu
  46 * Message catalogs a la X/Open::  The @code{catgets} family of functions.
  47 * The Uniforum approach::         The @code{gettext} family of functions.
  48 @end menu
  49
  50
  51 @node Message catalogs a la X/Open
  52 @section X/Open Message Catalog Handling
  53
  54 The @code{catgets} functions are based on the simple scheme:
  55
  56 @quotation
  57 Associate every message to translate in the source code with a unique
  58 identifier.  To retrieve a message from a catalog file solely the
  59 identifier is used.
  60 @end quotation
  61
  62 This means for the author of the program that s/he will have to make
  63 sure the meaning of the identifier in the program code and in the
  64 message catalogs is always the same.
  65
  66 Before a message can be translated the catalog file must be located.
  67 The user of the program must be able to guide the responsible function
  68 to find whatever catalog the user wants.  This is separated from what
  69 the programmer had in mind.
  70
  71 All the types, constants and functions for the @code{catgets} functions
  72 are defined/declared in the @file{nl_types.h} header file.
  73
  74 @menu
  75 * The catgets Functions::      The @code{catgets} function family.
  76 * The message catalog files::  Format of the message catalog files.
  77 * The gencat program::         How to generate message catalogs files which
  78                                 can be used by the functions.
  79 * Common Usage::               How to use the @code{catgets} interface.
  80 @end menu
  81
  82
  83 @node The catgets Functions
  84 @subsection The @code{catgets} function family
  85
  86 @deftypefun nl_catd catopen (const char *@var{cat_name}, int @var{flag})
  87 @standards{X/Open, nl_types.h}
  88 @safety{@prelim{}@mtsafe{@mtsenv{}}@asunsafe{@ascuheap{}}@acunsafe{@acsmem{}}}
  89 @c catopen @mtsenv @ascuheap @acsmem
  90 @c  strchr ok
  91 @c  setlocale(,NULL) ok
  92 @c  getenv @mtsenv
  93 @c  strlen ok
  94 @c  alloca ok
  95 @c  stpcpy ok
  96 @c  malloc @ascuheap @acsmem
  97 @c  __open_catalog @ascuheap @acsmem
  98 @c   strchr ok
  99 @c   open_not_cancel_2 @acsfd
 100 @c   strlen ok
 101 @c   ENOUGH ok
 102 @c    alloca ok
 103 @c    memcpy ok
 104 @c   fxstat64 ok
 105 @c   __set_errno ok
 106 @c   mmap @acsmem
 107 @c   malloc dup @ascuheap @acsmem
 108 @c   read_not_cancel ok
 109 @c   free dup @ascuheap @acsmem
 110 @c   munmap ok
 111 @c   close_not_cancel_no_status ok
 112 @c  free @ascuheap @acsmem
 113 The @code{catopen} function tries to locate the message data file named
 114 @var{cat_name} and loads it when found.  The return value is of an
 115 opaque type and can be used in calls to the other functions to refer to
 116 this loaded catalog.
 117
 118 The return value is @code{(nl_catd) -1} in case the function failed and
 119 no catalog was loaded.  The global variable @var{errno} contains a code
 120 for the error causing the failure.  But even if the function call
 121 succeeded this does not mean that all messages can be translated.
 122
 123 Locating the catalog file must happen in a way which lets the user of
 124 the program influence the decision.  It is up to the user to decide
 125 about the language to use and sometimes it is useful to use alternate
 126 catalog files.  All this can be specified by the user by setting some
 127 environment variables.
 128
 129 The first problem is to find out where all the message catalogs are
 130 stored.  Every program could have its own place to keep all the
 131 different files but usually the catalog files are grouped by languages
 132 and the catalogs for all programs are kept in the same place.
 133
 134 @cindex NLSPATH environment variable
 135 To tell the @code{catopen} function where the catalog for the program
 136 can be found the user can set the environment variable @code{NLSPATH} to
 137 a value which describes her/his choice.  Since this value must be usable
 138 for different languages and locales it cannot be a simple string.
 139 Instead it is a format string (similar to @code{printf}'s).  An example
 140 is
 141
 142 @smallexample
 143 /usr/share/locale/%L/%N:/usr/share/locale/%L/LC_MESSAGES/%N
 144 @end smallexample
 145
 146 First one can see that more than one directory can be specified (with
 147 the usual syntax of separating them by colons).  The next things to
 148 observe are the format string, @code{%L} and @code{%N} in this case.
 149 The @code{catopen} function knows about several of them and the
 150 replacement for all of them is of course different.
 151
 152 @table @code
 153 @item %N
 154 This format element is substituted with the name of the catalog file.
 155 This is the value of the @var{cat_name} argument given to
 156 @code{catgets}.
 157
 158 @item %L
 159 This format element is substituted with the name of the currently
 160 selected locale for translating messages.  How this is determined is
 161 explained below.
 162
 163 @item %l
 164 (This is the lowercase ell.) This format element is substituted with the
 165 language element of the locale name.  The string describing the selected
 166 locale is expected to have the form
 167 @code{@var{lang}[_@var{terr}[.@var{codeset}]]} and this format uses the
 168 first part @var{lang}.
 169
 170 @item %t
 171 This format element is substituted by the territory part @var{terr} of
 172 the name of the currently selected locale.  See the explanation of the
 173 format above.
 174
 175 @item %c
 176 This format element is substituted by the codeset part @var{codeset} of
 177 the name of the currently selected locale.  See the explanation of the
 178 format above.
 179
 180 @item %%
 181 Since @code{%} is used as a meta character there must be a way to
 182 express the @code{%} character in the result itself.  Using @code{%%}
 183 does this just like it works for @code{printf}.
 184 @end table
 185
 186
 187 Using @code{NLSPATH} allows arbitrary directories to be searched for
 188 message catalogs while still allowing different languages to be used.
 189 If the @code{NLSPATH} environment variable is not set, the default value
 190 is
 191
 192 @smallexample
 193 @var{prefix}/share/locale/%L/%N:@var{prefix}/share/locale/%L/LC_MESSAGES/%N
 194 @end smallexample
 195
 196 @noindent
 197 where @var{prefix} is given to @code{configure} while installing @theglibc{}
 198 (this value is in many cases @code{/usr} or the empty string).
 199
 200 The remaining problem is to decide which must be used.  The value
 201 decides about the substitution of the format elements mentioned above.
 202 First of all the user can specify a path in the message catalog name
 203 (i.e., the name contains a slash character).  In this situation the
 204 @code{NLSPATH} environment variable is not used.  The catalog must exist
 205 as specified in the program, perhaps relative to the current working
 206 directory.  This situation in not desirable and catalogs names never
 207 should be written this way.  Beside this, this behavior is not portable
 208 to all other platforms providing the @code{catgets} interface.
 209
 210 @cindex LC_ALL environment variable
 211 @cindex LC_MESSAGES environment variable
 212 @cindex LANG environment variable
 213 Otherwise the values of environment variables from the standard
 214 environment are examined (@pxref{Standard Environment}).  Which
 215 variables are examined is decided by the @var{flag} parameter of
 216 @code{catopen}.  If the value is @code{NL_CAT_LOCALE} (which is defined
 217 in @file{nl_types.h}) then the @code{catopen} function uses the name of
 218 the locale currently selected for the @code{LC_MESSAGES} category.
 219
 220 If @var{flag} is zero the @code{LANG} environment variable is examined.
 221 This is a left-over from the early days when the concept of locales
 222 had not even reached the level of POSIX locales.
 223
 224 The environment variable and the locale name should have a value of the
 225 form @code{@var{lang}[_@var{terr}[.@var{codeset}]]} as explained above.
 226 If no environment variable is set the @code{"C"} locale is used which
 227 prevents any translation.
 228
 229 The return value of the function is in any case a valid string.  Either
 230 it is a translation from a message catalog or it is the same as the
 231 @var{string} parameter.  So a piece of code to decide whether a
 232 translation actually happened must look like this:
 233
 234 @smallexample
 235 @{
 236   char *trans = catgets (desc, set, msg, input_string);
 237   if (trans == input_string)
 238     @{
 239       /* Something went wrong.  */
 240     @}
 241 @}
 242 @end smallexample
 243
 244 @noindent
 245 When an error occurs the global variable @var{errno} is set to
 246
 247 @table @var
 248 @item EBADF
 249 The catalog does not exist.
 250 @item ENOMSG
 251 The set/message tuple does not name an existing element in the
 252 message catalog.
 253 @end table
 254
 255 While it sometimes can be useful to test for errors programs normally
 256 will avoid any test.  If the translation is not available it is no big
 257 problem if the original, untranslated message is printed.  Either the
 258 user understands this as well or s/he will look for the reason why the
 259 messages are not translated.
 260 @end deftypefun
 261
 262 Please note that the currently selected locale does not depend on a call
 263 to the @code{setlocale} function.  It is not necessary that the locale
 264 data files for this locale exist and calling @code{setlocale} succeeds.
 265 The @code{catopen} function directly reads the values of the environment
 266 variables.
 267
 268
 269 @deftypefun {char *} catgets (nl_catd @var{catalog_desc}, int @var{set}, int @var{message}, const char *@var{string})
 270 @safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}}
 271 The function @code{catgets} has to be used to access the message catalog
 272 previously opened using the @code{catopen} function.  The
 273 @var{catalog_desc} parameter must be a value previously returned by
 274 @code{catopen}.
 275
 276 The next two parameters, @var{set} and @var{message}, reflect the
 277 internal organization of the message catalog files.  This will be
 278 explained in detail below.  For now it is interesting to know that a
 279 catalog can consist of several sets and the messages in each thread are
 280 individually numbered using numbers.  Neither the set number nor the
 281 message number must be consecutive.  They can be arbitrarily chosen.
 282 But each message (unless equal to another one) must have its own unique
 283 pair of set and message numbers.
 284
 285 Since it is not guaranteed that the message catalog for the language
 286 selected by the user exists the last parameter @var{string} helps to
 287 handle this case gracefully.  If no matching string can be found
 288 @var{string} is returned.  This means for the programmer that
 289
 290 @itemize @bullet
 291 @item
 292 the @var{string} parameters should contain reasonable text (this also
 293 helps to understand the program seems otherwise there would be no hint
 294 on the string which is expected to be returned.
 295 @item
 296 all @var{string} arguments should be written in the same language.
 297 @end itemize
 298 @end deftypefun
 299
 300 It is somewhat uncomfortable to write a program using the @code{catgets}
 301 functions if no supporting functionality is available.  Since each
 302 set/message number tuple must be unique the programmer must keep lists
 303 of the messages at the same time the code is written.  And the work
 304 between several people working on the same project must be coordinated.
 305 We will see how some of these problems can be relaxed a bit (@pxref{Common
 306 Usage}).
 307
 308 @deftypefun int catclose (nl_catd @var{catalog_desc})
 309 @safety{@prelim{}@mtsafe{}@asunsafe{@ascuheap{}}@acunsafe{@acucorrupt{} @acsmem{}}}
 310 @c catclose @ascuheap @acucorrupt @acsmem
 311 @c  __set_errno ok
 312 @c  munmap ok
 313 @c  free @ascuheap @acsmem
 314 The @code{catclose} function can be used to free the resources
 315 associated with a message catalog which previously was opened by a call
 316 to @code{catopen}.  If the resources can be successfully freed the
 317 function returns @code{0}.  Otherwise it returns @code{@minus{}1} and the
 318 global variable @var{errno} is set.  Errors can occur if the catalog
 319 descriptor @var{catalog_desc} is not valid in which case @var{errno} is
 320 set to @code{EBADF}.
 321 @end deftypefun
 322
 323
 324 @node The message catalog files
 325 @subsection  Format of the message catalog files
 326
 327 The only reasonable way to translate all the messages of a function and
 328 store the result in a message catalog file which can be read by the
 329 @code{catopen} function is to write all the message text to the
 330 translator and let her/him translate them all.  I.e., we must have a
 331 file with entries which associate the set/message tuple with a specific
 332 translation.  This file format is specified in the X/Open standard and
 333 is as follows:
 334
 335 @itemize @bullet
 336 @item
 337 Lines containing only whitespace characters or empty lines are ignored.
 338
 339 @item
 340 Lines which contain as the first non-whitespace character a @code{$}
 341 followed by a whitespace character are comment and are also ignored.
 342
 343 @item
 344 If a line contains as the first non-whitespace characters the sequence
 345 @code{$set} followed by a whitespace character an additional argument
 346 is required to follow.  This argument can either be:
 347
 348 @itemize @minus
 349 @item
 350 a number.  In this case the value of this number determines the set
 351 to which the following messages are added.
 352
 353 @item
 354 an identifier consisting of alphanumeric characters plus the underscore
 355 character.  In this case the set get automatically a number assigned.
 356 This value is one added to the largest set number which so far appeared.
 357
 358 How to use the symbolic names is explained in section @ref{Common Usage}.
 359
 360 It is an error if a symbol name appears more than once.  All following
 361 messages are placed in a set with this number.
 362 @end itemize
 363
 364 @item
 365 If a line contains as the first non-whitespace characters the sequence
 366 @code{$delset} followed by a whitespace character an additional argument
 367 is required to follow.  This argument can either be:
 368
 369 @itemize @minus
 370 @item
 371 a number.  In this case the value of this number determines the set
 372 which will be deleted.
 373
 374 @item
 375 an identifier consisting of alphanumeric characters plus the underscore
 376 character.  This symbolic identifier must match a name for a set which
 377 previously was defined.  It is an error if the name is unknown.
 378 @end itemize
 379
 380 In both cases all messages in the specified set will be removed.  They
 381 will not appear in the output.  But if this set is later again selected
 382 with a @code{$set} command again messages could be added and these
 383 messages will appear in the output.
 384
 385 @item
 386 If a line contains after leading whitespaces the sequence
 387 @code{$quote}, the quoting character used for this input file is
 388 changed to the first non-whitespace character following
 389 @code{$quote}.  If no non-whitespace character is present before the
 390 line ends quoting is disabled.
 391
 392 By default no quoting character is used.  In this mode strings are
 393 terminated with the first unescaped line break.  If there is a
 394 @code{$quote} sequence present newline need not be escaped.  Instead a
 395 string is terminated with the first unescaped appearance of the quote
 396 character.
 397
 398 A common usage of this feature would be to set the quote character to
 399 @code{"}.  Then any appearance of the @code{"} in the strings must
 400 be escaped using the backslash (i.e., @code{\"} must be written).
 401
 402 @item
 403 Any other line must start with a number or an alphanumeric identifier
 404 (with the underscore character included).  The following characters
 405 (starting after the first whitespace character) will form the string
 406 which gets associated with the currently selected set and the message
 407 number represented by the number and identifier respectively.
 408
 409 If the start of the line is a number the message number is obvious.  It
 410 is an error if the same message number already appeared for this set.
 411
 412 If the leading token was an identifier the message number gets
 413 automatically assigned.  The value is the current maximum message
 414 number for this set plus one.  It is an error if the identifier was
 415 already used for a message in this set.  It is OK to reuse the
 416 identifier for a message in another thread.  How to use the symbolic
 417 identifiers will be explained below (@pxref{Common Usage}).  There is
 418 one limitation with the identifier: it must not be @code{Set}.  The
 419 reason will be explained below.
 420
 421 The text of the messages can contain escape characters.  The usual bunch
 422 of characters known from the @w{ISO C} language are recognized
 423 (@code{\n}, @code{\t}, @code{\v}, @code{\b}, @code{\r}, @code{\f},
 424 @code{\\}, and @code{\@var{nnn}}, where @var{nnn} is the octal coding of
 425 a character code).
 426 @end itemize
 427
 428 @strong{Important:} The handling of identifiers instead of numbers for
 429 the set and messages is a GNU extension.  Systems strictly following the
 430 X/Open specification do not have this feature.  An example for a message
 431 catalog file is this:
 432
 433 @smallexample
 434 $ This is a leading comment.
 435 $quote "
 436
 437 $set SetOne
 438 1 Message with ID 1.
 439 two "   Message with ID \"two\", which gets the value 2 assigned"
 440
 441 $set SetTwo
 442 $ Since the last set got the number 1 assigned this set has number 2.
 443 4000 "The numbers can be arbitrary, they need not start at one."
 444 @end smallexample
 445
 446 This small example shows various aspects:
 447 @itemize @bullet
 448 @item
 449 Lines 1 and 9 are comments since they start with @code{$} followed by
 450 a whitespace.
 451 @item
 452 The quoting character is set to @code{"}.  Otherwise the quotes in the
 453 message definition would have to be omitted and in this case the
 454 message with the identifier @code{two} would lose its leading whitespace.
 455 @item
 456 Mixing numbered messages with messages having symbolic names is no
 457 problem and the numbering happens automatically.
 458 @end itemize
 459
 460
 461 While this file format is pretty easy it is not the best possible for
 462 use in a running program.  The @code{catopen} function would have to
 463 parse the file and handle syntactic errors gracefully.  This is not so
 464 easy and the whole process is pretty slow.  Therefore the @code{catgets}
 465 functions expect the data in another more compact and ready-to-use file
 466 format.  There is a special program @code{gencat} which is explained in
 467 detail in the next section.
 468
 469 Files in this other format are not human readable.  To be easy to use by
 470 programs it is a binary file.  But the format is byte order independent
 471 so translation files can be shared by systems of arbitrary architecture
 472 (as long as they use @theglibc{}).
 473
 474 Details about the binary file format are not important to know since
 475 these files are always created by the @code{gencat} program.  The
 476 sources of @theglibc{} also provide the sources for the
 477 @code{gencat} program and so the interested reader can look through
 478 these source files to learn about the file format.
 479
 480
 481 @node The gencat program
 482 @subsection Generate Message Catalogs files
 483
 484 @cindex gencat
 485 The @code{gencat} program is specified in the X/Open standard and the
 486 GNU implementation follows this specification and so processes
 487 all correctly formed input files.  Additionally some extension are
 488 implemented which help to work in a more reasonable way with the
 489 @code{catgets} functions.
 490
 491 The @code{gencat} program can be invoked in two ways:
 492
 493 @example
 494 `gencat [@var{Option} @dots{}] [@var{Output-File} [@var{Input-File} @dots{}]]`
 495 @end example
 496
 497 This is the interface defined in the X/Open standard.  If no
 498 @var{Input-File} parameter is given, input will be read from standard
 499 input.  Multiple input files will be read as if they were concatenated.
 500 If @var{Output-File} is also missing, the output will be written to
 501 standard output.  To provide the interface one is used to from other
 502 programs a second interface is provided.
 503
 504 @smallexample
 505 `gencat [@var{Option} @dots{}] -o @var{Output-File} [@var{Input-File} @dots{}]`
 506 @end smallexample
 507
 508 The option @samp{-o} is used to specify the output file and all file
 509 arguments are used as input files.
 510
 511 Beside this one can use @file{-} or @file{/dev/stdin} for
 512 @var{Input-File} to denote the standard input.  Corresponding one can
 513 use @file{-} and @file{/dev/stdout} for @var{Output-File} to denote
 514 standard output.  Using @file{-} as a file name is allowed in X/Open
 515 while using the device names is a GNU extension.
 516
 517 The @code{gencat} program works by concatenating all input files and
 518 then @strong{merging} the resulting collection of message sets with a
 519 possibly existing output file.  This is done by removing all messages
 520 with set/message number tuples matching any of the generated messages
 521 from the output file and then adding all the new messages.  To
 522 regenerate a catalog file while ignoring the old contents therefore
 523 requires removing the output file if it exists.  If the output is
 524 written to standard output no merging takes place.
 525
 526 @noindent
 527 The following table shows the options understood by the @code{gencat}
 528 program.  The X/Open standard does not specify any options for the
 529 program so all of these are GNU extensions.
 530
 531 @table @samp
 532 @item -V
 533 @itemx --version
 534 Print the version information and exit.
 535 @item -h
 536 @itemx --help
 537 Print a usage message listing all available options, then exit successfully.
 538 @item --new
 539 Do not merge the new messages from the input files with the old content
 540 of the output file.  The old content of the output file is discarded.
 541 @item -H
 542 @itemx --header=name
 543 This option is used to emit the symbolic names given to sets and
 544 messages in the input files for use in the program.  Details about how
 545 to use this are given in the next section.  The @var{name} parameter to
 546 this option specifies the name of the output file.  It will contain a
 547 number of C preprocessor @code{#define}s to associate a name with a
 548 number.
 549
 550 Please note that the generated file only contains the symbols from the
 551 input files.  If the output is merged with the previous content of the
 552 output file the possibly existing symbols from the file(s) which
 553 generated the old output files are not in the generated header file.
 554 @end table
 555
 556
 557 @node Common Usage
 558 @subsection How to use the @code{catgets} interface
 559
 560 The @code{catgets} functions can be used in two different ways.  By
 561 following slavishly the X/Open specs and not relying on the extension
 562 and by using the GNU extensions.  We will take a look at the former
 563 method first to understand the benefits of extensions.
 564
 565 @subsubsection Not using symbolic names
 566
 567 Since the X/Open format of the message catalog files does not allow
 568 symbol names we have to work with numbers all the time.  When we start
 569 writing a program we have to replace all appearances of translatable
 570 strings with something like
 571
 572 @smallexample
 573 catgets (catdesc, set, msg, "string")
 574 @end smallexample
 575
 576 @noindent
 577 @var{catgets} is retrieved from a call to @code{catopen} which is
 578 normally done once at the program start.  The @code{"string"} is the
 579 string we want to translate.  The problems start with the set and
 580 message numbers.
 581
 582 In a bigger program several programmers usually work at the same time on
 583 the program and so coordinating the number allocation is crucial.
 584 Though no two different strings must be indexed by the same tuple of
 585 numbers it is highly desirable to reuse the numbers for equal strings
 586 with equal translations (please note that there might be strings which
 587 are equal in one language but have different translations due to
 588 difference contexts).
 589
 590 The allocation process can be relaxed a bit by different set numbers for
 591 different parts of the program.  So the number of developers who have to
 592 coordinate the allocation can be reduced.  But still lists must be keep
 593 track of the allocation and errors can easily happen.  These errors
 594 cannot be discovered by the compiler or the @code{catgets} functions.
 595 Only the user of the program might see wrong messages printed.  In the
 596 worst cases the messages are so irritating that they cannot be
 597 recognized as wrong.  Think about the translations for @code{"true"} and
 598 @code{"false"} being exchanged.  This could result in a disaster.
 599
 600
 601 @subsubsection Using symbolic names
 602
 603 The problems mentioned in the last section derive from the fact that:
 604
 605 @enumerate
 606 @item
 607 the numbers are allocated once and due to the possibly frequent use of
 608 them it is difficult to change a number later.
 609 @item
 610 the numbers do not allow guessing anything about the string and
 611 therefore collisions can easily happen.
 612 @end enumerate
 613
 614 By constantly using symbolic names and by providing a method which maps
 615 the string content to a symbolic name (however this will happen) one can
 616 prevent both problems above.  The cost of this is that the programmer
 617 has to write a complete message catalog file while s/he is writing the
 618 program itself.
 619
 620 This is necessary since the symbolic names must be mapped to numbers
 621 before the program sources can be compiled.  In the last section it was
 622 described how to generate a header containing the mapping of the names.
 623 E.g., for the example message file given in the last section we could
 624 call the @code{gencat} program as follows (assume @file{ex.msg} contains
 625 the sources).
 626
 627 @smallexample
 628 gencat -H ex.h -o ex.cat ex.msg
 629 @end smallexample
 630
 631 @noindent
 632 This generates a header file with the following content:
 633
 634 @smallexample
 635 #define SetTwoSet 0x2   /* ex.msg:8 */
 636
 637 #define SetOneSet 0x1   /* ex.msg:4 */
 638 #define SetOnetwo 0x2   /* ex.msg:6 */
 639 @end smallexample
 640
 641 As can be seen the various symbols given in the source file are mangled
 642 to generate unique identifiers and these identifiers get numbers
 643 assigned.  Reading the source file and knowing about the rules will
 644 allow to predict the content of the header file (it is deterministic)
 645 but this is not necessary.  The @code{gencat} program can take care for
 646 everything.  All the programmer has to do is to put the generated header
 647 file in the dependency list of the source files of her/his project and
 648 add a rule to regenerate the header if any of the input files change.
 649
 650 One word about the symbol mangling.  Every symbol consists of two parts:
 651 the name of the message set plus the name of the message or the special
 652 string @code{Set}.  So @code{SetOnetwo} means this macro can be used to
 653 access the translation with identifier @code{two} in the message set
 654 @code{SetOne}.
 655
 656 The other names denote the names of the message sets.  The special
 657 string @code{Set} is used in the place of the message identifier.
 658
 659 If in the code the second string of the set @code{SetOne} is used the C
 660 code should look like this:
 661
 662 @smallexample
 663 catgets (catdesc, SetOneSet, SetOnetwo,
 664          "   Message with ID \"two\", which gets the value 2 assigned")
 665 @end smallexample
 666
 667 Writing the function this way will allow to change the message number
 668 and even the set number without requiring any change in the C source
 669 code.  (The text of the string is normally not the same; this is only
 670 for this example.)
 671
 672
 673 @subsubsection How does to this allow to develop
 674
 675 To illustrate the usual way to work with the symbolic version numbers
 676 here is a little example.  Assume we want to write the very complex and
 677 famous greeting program.  We start by writing the code as usual:
 678
 679 @smallexample
 680 #include <stdio.h>
 681 int
 682 main (void)
 683 @{
 684   printf ("Hello, world!\n");
 685   return 0;
 686 @}
 687 @end smallexample
 688
 689 Now we want to internationalize the message and therefore replace the
 690 message with whatever the user wants.
 691
 692 @smallexample
 693 #include <nl_types.h>
 694 #include <stdio.h>
 695 #include "msgnrs.h"
 696 int
 697 main (void)
 698 @{
 699   nl_catd catdesc = catopen ("hello.cat", NL_CAT_LOCALE);
 700   printf (catgets (catdesc, SetMainSet, SetMainHello,
 701                    "Hello, world!\n"));
 702   catclose (catdesc);
 703   return 0;
 704 @}
 705 @end smallexample
 706
 707 We see how the catalog object is opened and the returned descriptor used
 708 in the other function calls.  It is not really necessary to check for
 709 failure of any of the functions since even in these situations the
 710 functions will behave reasonable.  They simply will be return a
 711 translation.
 712
 713 What remains unspecified here are the constants @code{SetMainSet} and
 714 @code{SetMainHello}.  These are the symbolic names describing the
 715 message.  To get the actual definitions which match the information in
 716 the catalog file we have to create the message catalog source file and
 717 process it using the @code{gencat} program.
 718
 719 @smallexample
 720 $ Messages for the famous greeting program.
 721 $quote "
 722
 723 $set Main
 724 Hello "Hallo, Welt!\n"
 725 @end smallexample
 726
 727 Now we can start building the program (assume the message catalog source
 728 file is named @file{hello.msg} and the program source file @file{hello.c}):
 729
 730 @smallexample
 731 % gencat -H msgnrs.h -o hello.cat hello.msg
 732 % cat msgnrs.h
 733 #define MainSet 0x1     /* hello.msg:4 */
 734 #define MainHello 0x1   /* hello.msg:5 */
 735 % gcc -o hello hello.c -I.
 736 % cp hello.cat /usr/share/locale/de/LC_MESSAGES
 737 % echo $LC_ALL
 738 de
 739 % ./hello
 740 Hallo, Welt!
 741 %
 742 @end smallexample
 743
 744 The call of the @code{gencat} program creates the missing header file
 745 @file{msgnrs.h} as well as the message catalog binary.  The former is
 746 used in the compilation of @file{hello.c} while the later is placed in a
 747 directory in which the @code{catopen} function will try to locate it.
 748 Please check the @code{LC_ALL} environment variable and the default path
 749 for @code{catopen} presented in the description above.
 750
 751
 752 @node The Uniforum approach
 753 @section The Uniforum approach to Message Translation
 754
 755 Sun Microsystems tried to standardize a different approach to message
 756 translation in the Uniforum group.  There never was a real standard
 757 defined but still the interface was used in Sun's operating systems.
 758 Since this approach fits better in the development process of free
 759 software it is also used throughout the GNU project and the GNU
 760 @file{gettext} package provides support for this outside @theglibc{}.
 761
 762 The code of the @file{libintl} from GNU @file{gettext} is the same as
 763 the code in @theglibc{}.  So the documentation in the GNU
 764 @file{gettext} manual is also valid for the functionality here.  The
 765 following text will describe the library functions in detail.  But the
 766 numerous helper programs are not described in this manual.  Instead
 767 people should read the GNU @file{gettext} manual
 768 (@pxref{Top,,GNU gettext utilities,gettext,Native Language Support Library and Tools}).
 769 We will only give a short overview.
 770
 771 Though the @code{catgets} functions are available by default on more
 772 systems the @code{gettext} interface is at least as portable as the
 773 former.  The GNU @file{gettext} package can be used wherever the
 774 functions are not available.
 775
 776
 777 @menu
 778 * Message catalogs with gettext::  The @code{gettext} family of functions.
 779 * Helper programs for gettext::    Programs to handle message catalogs
 780                                     for @code{gettext}.
 781 @end menu
 782
 783
 784 @node Message catalogs with gettext
 785 @subsection The @code{gettext} family of functions
 786
 787 The paradigms underlying the @code{gettext} approach to message
 788 translations is different from that of the @code{catgets} functions the
 789 basic functionally is equivalent.  There are functions of the following
 790 categories:
 791
 792 @menu
 793 * Translation with gettext::       What has to be done to translate a message.
 794 * Locating gettext catalog::       How to determine which catalog to be used.
 795 * Advanced gettext functions::     Additional functions for more complicated
 796                                     situations.
 797 * Charset conversion in gettext::  How to specify the output character set
 798                                     @code{gettext} uses.
 799 * GUI program problems::           How to use @code{gettext} in GUI programs.
 800 * Using gettextized software::     The possibilities of the user to influence
 801                                     the way @code{gettext} works.
 802 @end menu
 803
 804 @node Translation with gettext
 805 @subsubsection What has to be done to translate a message?
 806
 807 The @code{gettext} functions have a very simple interface.  The most
 808 basic function just takes the string which shall be translated as the
 809 argument and it returns the translation.  This is fundamentally
 810 different from the @code{catgets} approach where an extra key is
 811 necessary and the original string is only used for the error case.
 812
 813 If the string which has to be translated is the only argument this of
 814 course means the string itself is the key.  I.e., the translation will
 815 be selected based on the original string.  The message catalogs must
 816 therefore contain the original strings plus one translation for any such
 817 string.  The task of the @code{gettext} function is to compare the
 818 argument string with the available strings in the catalog and return the
 819 appropriate translation.  Of course this process is optimized so that
 820 this process is not more expensive than an access using an atomic key
 821 like in @code{catgets}.
 822
 823 The @code{gettext} approach has some advantages but also some
 824 disadvantages.  Please see the GNU @file{gettext} manual for a detailed
 825 discussion of the pros and cons.
 826
 827 All the definitions and declarations for @code{gettext} can be found in
 828 the @file{libintl.h} header file.  On systems where these functions are
 829 not part of the C library they can be found in a separate library named
 830 @file{libintl.a} (or accordingly different for shared libraries).
 831
 832 @deftypefun {char *} gettext (const char *@var{msgid})
 833 @standards{GNU, libintl.h}
 834 @safety{@prelim{}@mtsafe{@mtsenv{}}@asunsafe{@asucorrupt{} @ascuheap{} @asulock{} @ascudlopen{}}@acunsafe{@acucorrupt{} @aculock{} @acsfd{} @acsmem{}}}
 835 @c Wrapper for dcgettext.
 836 The @code{gettext} function searches the currently selected message
 837 catalogs for a string which is equal to @var{msgid}.  If there is such a
 838 string available it is returned.  Otherwise the argument string
 839 @var{msgid} is returned.
 840
 841 Please note that although the return value is @code{char *} the
 842 returned string must not be changed.  This broken type results from the
 843 history of the function and does not reflect the way the function should
 844 be used.
 845
 846 Please note that above we wrote ``message catalogs'' (plural).  This is
 847 a specialty of the GNU implementation of these functions and we will
 848 say more about this when we talk about the ways message catalogs are
 849 selected (@pxref{Locating gettext catalog}).
 850
 851 The @code{gettext} function does not modify the value of the global
 852 @var{errno} variable.  This is necessary to make it possible to write
 853 something like
 854
 855 @smallexample
 856   printf (gettext ("Operation failed: %m\n"));
 857 @end smallexample
 858
 859 Here the @var{errno} value is used in the @code{printf} function while
 860 processing the @code{%m} format element and if the @code{gettext}
 861 function would change this value (it is called before @code{printf} is
 862 called) we would get a wrong message.
 863
 864 So there is no easy way to detect a missing message catalog besides
 865 comparing the argument string with the result.  But it is normally the
 866 task of the user to react on missing catalogs.  The program cannot guess
 867 when a message catalog is really necessary since for a user who speaks
 868 the language the program was developed in, the message does not need any translation.
 869 @end deftypefun
 870
 871 The remaining two functions to access the message catalog add some
 872 functionality to select a message catalog which is not the default one.
 873 This is important if parts of the program are developed independently.
 874 Every part can have its own message catalog and all of them can be used
 875 at the same time.  The C library itself is an example: internally it
 876 uses the @code{gettext} functions but since it must not depend on a
 877 currently selected default message catalog it must specify all ambiguous
 878 information.
 879
 880 @deftypefun {char *} dgettext (const char *@var{domainname}, const char *@var{msgid})
 881 @standards{GNU, libintl.h}
 882 @safety{@prelim{}@mtsafe{@mtsenv{}}@asunsafe{@asucorrupt{} @ascuheap{} @asulock{} @ascudlopen{}}@acunsafe{@acucorrupt{} @aculock{} @acsfd{} @acsmem{}}}
 883 @c Wrapper for dcgettext.
 884 The @code{dgettext} function acts just like the @code{gettext}
 885 function.  It only takes an additional first argument @var{domainname}
 886 which guides the selection of the message catalogs which are searched
 887 for the translation.  If the @var{domainname} parameter is the null
 888 pointer the @code{dgettext} function is exactly equivalent to
 889 @code{gettext} since the default value for the domain name is used.
 890
 891 As for @code{gettext} the return value type is @code{char *} which is an
 892 anachronism.  The returned string must never be modified.
 893 @end deftypefun
 894
 895 @deftypefun {char *} dcgettext (const char *@var{domainname}, const char *@var{msgid}, int @var{category})
 896 @standards{GNU, libintl.h}
 897 @safety{@prelim{}@mtsafe{@mtsenv{}}@asunsafe{@asucorrupt{} @ascuheap{} @asulock{} @ascudlopen{}}@acunsafe{@acucorrupt{} @aculock{} @acsfd{} @acsmem{}}}
 898 @c dcgettext @mtsenv @asucorrupt @ascuheap @asulock @ascudlopen @acucorrupt @aculock @acsfd @acsmem
 899 @c  dcigettext @mtsenv @asucorrupt @ascuheap @asulock @ascudlopen @acucorrupt @aculock @acsfd @acsmem
 900 @c   libc_rwlock_rdlock @asulock @aculock
 901 @c   current_locale_name ok [protected from @mtslocale]
 902 @c   tfind ok
 903 @c   libc_rwlock_unlock ok
 904 @c   plural_lookup ok
 905 @c    plural_eval ok
 906 @c    rawmemchr ok
 907 @c   DETERMINE_SECURE ok, nothing
 908 @c   strcmp ok
 909 @c   strlen ok
 910 @c   getcwd @ascuheap @acsmem @acsfd
 911 @c   strchr ok
 912 @c   stpcpy ok
 913 @c   category_to_name ok
 914 @c   guess_category_value @mtsenv
 915 @c    getenv @mtsenv
 916 @c    current_locale_name dup ok [protected from @mtslocale by dcigettext]
 917 @c    strcmp ok
 918 @c   ENABLE_SECURE ok
 919 @c   _nl_find_domain @mtsenv @asucorrupt @ascuheap @asulock @ascudlopen @acucorrupt @aculock @acsfd @acsmem
 920 @c    libc_rwlock_rdlock dup @asulock @aculock
 921 @c    _nl_make_l10nflist dup @ascuheap @acsmem
 922 @c    libc_rwlock_unlock dup ok
 923 @c    _nl_load_domain @mtsenv @asucorrupt @ascuheap @asulock @ascudlopen @acucorrupt @aculock @acsfd @acsmem
 924 @c     libc_lock_lock_recursive @aculock
 925 @c     libc_lock_unlock_recursive @aculock
 926 @c     open->open_not_cancel_2 @acsfd
 927 @c     fstat ok
 928 @c     mmap dup @acsmem
 929 @c     close->close_not_cancel_no_status @acsfd
 930 @c     malloc dup @ascuheap @acsmem
 931 @c     read->read_not_cancel ok
 932 @c     munmap dup @acsmem
 933 @c     W dup ok
 934 @c     strlen dup ok
 935 @c     get_sysdep_segment_value ok
 936 @c     memcpy dup ok
 937 @c     hash_string dup ok
 938 @c     free dup @ascuheap @acsmem
 939 @c     libc_rwlock_init ok
 940 @c     _nl_find_msg dup @mtsenv @asucorrupt @ascuheap @asulock @ascudlopen @acucorrupt @aculock @acsfd @acsmem
 941 @c     libc_rwlock_fini ok
 942 @c     EXTRACT_PLURAL_EXPRESSION @ascuheap @acsmem
 943 @c      strstr dup ok
 944 @c      isspace ok
 945 @c      strtoul ok
 946 @c      PLURAL_PARSE @ascuheap @acsmem
 947 @c       malloc dup @ascuheap @acsmem
 948 @c       free dup @ascuheap @acsmem
 949 @c      INIT_GERMANIC_PLURAL ok, nothing
 950 @c        the pre-C99 variant is @acucorrupt [protected from @mtuinit by dcigettext]
 951 @c    _nl_expand_alias dup @ascuheap @asulock @acsmem @acsfd @aculock
 952 @c    _nl_explode_name dup @ascuheap @acsmem
 953 @c    libc_rwlock_wrlock dup @asulock @aculock
 954 @c    free dup @asulock @aculock @acsfd @acsmem
 955 @c   _nl_find_msg @mtsenv @asucorrupt @ascuheap @asulock @ascudlopen @acucorrupt @aculock @acsfd @acsmem
 956 @c    _nl_load_domain dup @mtsenv @asucorrupt @ascuheap @asulock @ascudlopen @acucorrupt @aculock @acsfd @acsmem
 957 @c    strlen ok
 958 @c    hash_string ok
 959 @c    W ok
 960 @c     SWAP ok
 961 @c      bswap_32 ok
 962 @c    strcmp ok
 963 @c    get_output_charset @mtsenv @ascuheap @acsmem
 964 @c     getenv dup @mtsenv
 965 @c     strlen dup ok
 966 @c     malloc dup @ascuheap @acsmem
 967 @c     memcpy dup ok
 968 @c    libc_rwlock_rdlock dup @asulock @aculock
 969 @c    libc_rwlock_unlock dup ok
 970 @c    libc_rwlock_wrlock dup @asulock @aculock
 971 @c    realloc @ascuheap @acsmem
 972 @c    strdup @ascuheap @acsmem
 973 @c    strstr ok
 974 @c    strcspn ok
 975 @c    mempcpy dup ok
 976 @c    norm_add_slashes dup ok
 977 @c    gconv_open @asucorrupt @ascuheap @asulock @ascudlopen @acucorrupt @aculock @acsmem @acsfd
 978 @c     [protected from @mtslocale by dcigettext locale lock]
 979 @c    free dup @ascuheap @acsmem
 980 @c    libc_lock_lock @asulock @aculock
 981 @c    calloc @ascuheap @acsmem
 982 @c    gconv dup @acucorrupt [protected from @mtsrace and @asucorrupt by lock]
 983 @c    libc_lock_unlock ok
 984 @c   malloc @ascuheap @acsmem
 985 @c   mempcpy ok
 986 @c   memcpy ok
 987 @c   strcpy ok
 988 @c   libc_rwlock_wrlock @asulock @aculock
 989 @c   tsearch @ascuheap @acucorrupt @acsmem [protected from @mtsrace and @asucorrupt]
 990 @c    transcmp ok
 991 @c     strmp dup ok
 992 @c   free @ascuheap @acsmem
 993 The @code{dcgettext} adds another argument to those which
 994 @code{dgettext} takes.  This argument @var{category} specifies the last
 995 piece of information needed to localize the message catalog.  I.e., the
 996 domain name and the locale category exactly specify which message
 997 catalog has to be used (relative to a given directory, see below).
 998
 999 The @code{dgettext} function can be expressed in terms of
1000 @code{dcgettext} by using
1001
1002 @smallexample
1003 dcgettext (domain, string, LC_MESSAGES)
1004 @end smallexample
1005
1006 @noindent
1007 instead of
1008
1009 @smallexample
1010 dgettext (domain, string)
1011 @end smallexample
1012
1013 This also shows which values are expected for the third parameter.  One
1014 has to use the available selectors for the categories available in
1015 @file{locale.h}.  Normally the available values are @code{LC_CTYPE},
1016 @code{LC_COLLATE}, @code{LC_MESSAGES}, @code{LC_MONETARY},
1017 @code{LC_NUMERIC}, and @code{LC_TIME}.  Please note that @code{LC_ALL}
1018 must not be used and even though the names might suggest this, there is
1019 no relation to the environment variable of this name.
1020
1021 The @code{dcgettext} function is only implemented for compatibility with
1022 other systems which have @code{gettext} functions.  There is not really
1023 any situation where it is necessary (or useful) to use a different value
1024 than @code{LC_MESSAGES} for the @var{category} parameter.  We are
1025 dealing with messages here and any other choice can only be irritating.
1026
1027 As for @code{gettext} the return value type is @code{char *} which is an
1028 anachronism.  The returned string must never be modified.
1029 @end deftypefun
1030
1031 When using the three functions above in a program it is a frequent case
1032 that the @var{msgid} argument is a constant string.  So it is worthwhile to
1033 optimize this case.  Thinking shortly about this one will realize that
1034 as long as no new message catalog is loaded the translation of a message
1035 will not change.  This optimization is actually implemented by the
1036 @code{gettext}, @code{dgettext} and @code{dcgettext} functions.
1037
1038
1039 @node Locating gettext catalog
1040 @subsubsection How to determine which catalog to be used
1041
1042 The functions to retrieve the translations for a given message have a
1043 remarkable simple interface.  But to provide the user of the program
1044 still the opportunity to select exactly the translation s/he wants and
1045 also to provide the programmer the possibility to influence the way to
1046 locate the search for catalogs files there is a quite complicated
1047 underlying mechanism which controls all this.  The code is complicated
1048 the use is easy.
1049
1050 Basically we have two different tasks to perform which can also be
1051 performed by the @code{catgets} functions:
1052
1053 @enumerate
1054 @item
1055 Locate the set of message catalogs.  There are a number of files for
1056 different languages which all belong to the package.  Usually they
1057 are all stored in the filesystem below a certain directory.
1058
1059 There can be arbitrarily many packages installed and they can follow
1060 different guidelines for the placement of their files.
1061
1062 @item
1063 Relative to the location specified by the package the actual translation
1064 files must be searched, based on the wishes of the user.  I.e., for each
1065 language the user selects the program should be able to locate the
1066 appropriate file.
1067 @end enumerate
1068
1069 This is the functionality required by the specifications for
1070 @code{gettext} and this is also what the @code{catgets} functions are
1071 able to do.  But there are some problems unresolved:
1072
1073 @itemize @bullet
1074 @item
1075 The language to be used can be specified in several different ways.
1076 There is no generally accepted standard for this and the user always
1077 expects the program to understand what s/he means.  E.g., to select the
1078 German translation one could write @code{de}, @code{german}, or
1079 @code{deutsch} and the program should always react the same.
1080
1081 @item
1082 Sometimes the specification of the user is too detailed.  If s/he, e.g.,
1083 specifies @code{de_DE.ISO-8859-1} which means German, spoken in Germany,
1084 coded using the @w{ISO 8859-1} character set there is the possibility
1085 that a message catalog matching this exactly is not available.  But
1086 there could be a catalog matching @code{de} and if the character set
1087 used on the machine is always @w{ISO 8859-1} there is no reason why this
1088 later message catalog should not be used.  (We call this @dfn{message
1089 inheritance}.)
1090
1091 @item
1092 If a catalog for a wanted language is not available it is not always the
1093 second best choice to fall back on the language of the developer and
1094 simply not translate any message.  Instead a user might be better able
1095 to read the messages in another language and so the user of the program
1096 should be able to define a precedence order of languages.
1097 @end itemize
1098
1099 We can divide the configuration actions in two parts: the one is
1100 performed by the programmer, the other by the user.  We will start with
1101 the functions the programmer can use since the user configuration will
1102 be based on this.
1103
1104 As the functions described in the last sections already mention separate
1105 sets of messages can be selected by a @dfn{domain name}.  This is a
1106 simple string which should be unique for each program part that uses a
1107 separate domain.  It is possible to use in one program arbitrarily many
1108 domains at the same time.  E.g., @theglibc{} itself uses a domain
1109 named @code{libc} while the program using the C Library could use a
1110 domain named @code{foo}.  The important point is that at any time
1111 exactly one domain is active.  This is controlled with the following
1112 function.
1113
1114 @deftypefun {char *} textdomain (const char *@var{domainname})
1115 @standards{GNU, libintl.h}
1116 @safety{@prelim{}@mtsafe{}@asunsafe{@asulock{} @ascuheap{}}@acunsafe{@aculock{} @acsmem{}}}
1117 @c textdomain @asulock @ascuheap @aculock @acsmem
1118 @c  libc_rwlock_wrlock @asulock @aculock
1119 @c  strcmp ok
1120 @c  strdup @ascuheap @acsmem
1121 @c  free @ascuheap @acsmem
1122 @c  libc_rwlock_unlock ok
1123 The @code{textdomain} function sets the default domain, which is used in
1124 all future @code{gettext} calls, to @var{domainname}.  Please note that
1125 @code{dgettext} and @code{dcgettext} calls are not influenced if the
1126 @var{domainname} parameter of these functions is not the null pointer.
1127
1128 Before the first call to @code{textdomain} the default domain is
1129 @code{messages}.  This is the name specified in the specification of
1130 the @code{gettext} API.  This name is as good as any other name.  No
1131 program should ever really use a domain with this name since this can
1132 only lead to problems.
1133
1134 The function returns the value which is from now on taken as the default
1135 domain.  If the system went out of memory the returned value is
1136 @code{NULL} and the global variable @var{errno} is set to @code{ENOMEM}.
1137 Despite the return value type being @code{char *} the return string must
1138 not be changed.  It is allocated internally by the @code{textdomain}
1139 function.
1140
1141 If the @var{domainname} parameter is the null pointer no new default
1142 domain is set.  Instead the currently selected default domain is
1143 returned.
1144
1145 If the @var{domainname} parameter is the empty string the default domain
1146 is reset to its initial value, the domain with the name @code{messages}.
1147 This possibility is questionable to use since the domain @code{messages}
1148 really never should be used.
1149 @end deftypefun
1150
1151 @deftypefun {char *} bindtextdomain (const char *@var{domainname}, const char *@var{dirname})
1152 @standards{GNU, libintl.h}
1153 @safety{@prelim{}@mtsafe{}@asunsafe{@ascuheap{}}@acunsafe{@acsmem{}}}
1154 @c bindtextdomain @ascuheap @acsmem
1155 @c  set_binding_values @ascuheap @acsmem
1156 @c   libc_rwlock_wrlock dup @asulock @aculock
1157 @c   strcmp dup ok
1158 @c   strdup dup @ascuheap @acsmem
1159 @c   free dup @ascuheap @acsmem
1160 @c   malloc dup @ascuheap @acsmem
1161 The @code{bindtextdomain} function can be used to specify the directory
1162 which contains the message catalogs for domain @var{domainname} for the
1163 different languages.  To be correct, this is the directory where the
1164 hierarchy of directories is expected.  Details are explained below.
1165
1166 For the programmer it is important to note that the translations which
1167 come with the program have to be placed in a directory hierarchy starting
1168 at, say, @file{/foo/bar}.  Then the program should make a
1169 @code{bindtextdomain} call to bind the domain for the current program to
1170 this directory.  So it is made sure the catalogs are found.  A correctly
1171 running program does not depend on the user setting an environment
1172 variable.
1173
1174 The @code{bindtextdomain} function can be used several times and if the
1175 @var{domainname} argument is different the previously bound domains
1176 will not be overwritten.
1177
1178 If the program which wish to use @code{bindtextdomain} at some point of
1179 time use the @code{chdir} function to change the current working
1180 directory it is important that the @var{dirname} strings ought to be an
1181 absolute pathname.  Otherwise the addressed directory might vary with
1182 the time.
1183
1184 If the @var{dirname} parameter is the null pointer @code{bindtextdomain}
1185 returns the currently selected directory for the domain with the name
1186 @var{domainname}.
1187
1188 The @code{bindtextdomain} function returns a pointer to a string
1189 containing the name of the selected directory name.  The string is
1190 allocated internally in the function and must not be changed by the
1191 user.  If the system went out of core during the execution of
1192 @code{bindtextdomain} the return value is @code{NULL} and the global
1193 variable @var{errno} is set accordingly.
1194 @end deftypefun
1195
1196
1197 @node Advanced gettext functions
1198 @subsubsection Additional functions for more complicated situations
1199
1200 The functions of the @code{gettext} family described so far (and all the
1201 @code{catgets} functions as well) have one problem in the real world
1202 which has been neglected completely in all existing approaches.  What
1203 is meant here is the handling of plural forms.
1204
1205 Looking through Unix source code before the time anybody thought about
1206 internationalization (and, sadly, even afterwards) one can often find
1207 code similar to the following:
1208
1209 @smallexample
1210    printf ("%d file%s deleted", n, n == 1 ? "" : "s");
1211 @end smallexample
1212
1213 @noindent
1214 After the first complaints from people internationalizing the code people
1215 either completely avoided formulations like this or used strings like
1216 @code{"file(s)"}.  Both look unnatural and should be avoided.  First
1217 tries to solve the problem correctly looked like this:
1218
1219 @smallexample
1220    if (n == 1)
1221      printf ("%d file deleted", n);
1222    else
1223      printf ("%d files deleted", n);
1224 @end smallexample
1225
1226 But this does not solve the problem.  It helps languages where the
1227 plural form of a noun is not simply constructed by adding an `s' but
1228 that is all.  Once again people fell into the trap of believing the
1229 rules their language uses are universal.  But the handling of plural
1230 forms differs widely between the language families.  There are two
1231 things we can differ between (and even inside language families);
1232
1233 @itemize @bullet
1234 @item
1235 The form how plural forms are build differs.  This is a problem with
1236 language which have many irregularities.  German, for instance, is a
1237 drastic case.  Though English and German are part of the same language
1238 family (Germanic), the almost regular forming of plural noun forms
1239 (appending an `s') is hardly found in German.
1240
1241 @item
1242 The number of plural forms differ.  This is somewhat surprising for
1243 those who only have experiences with Romanic and Germanic languages
1244 since here the number is the same (there are two).
1245
1246 But other language families have only one form or many forms.  More
1247 information on this in an extra section.
1248 @end itemize
1249
1250 The consequence of this is that application writers should not try to
1251 solve the problem in their code.  This would be localization since it is
1252 only usable for certain, hardcoded language environments.  Instead the
1253 extended @code{gettext} interface should be used.
1254
1255 These extra functions are taking instead of the one key string two
1256 strings and a numerical argument.  The idea behind this is that using
1257 the numerical argument and the first string as a key, the implementation
1258 can select using rules specified by the translator the right plural
1259 form.  The two string arguments then will be used to provide a return
1260 value in case no message catalog is found (similar to the normal
1261 @code{gettext} behavior).  In this case the rules for Germanic language
1262 are used and it is assumed that the first string argument is the singular
1263 form, the second the plural form.
1264
1265 This has the consequence that programs without language catalogs can
1266 display the correct strings only if the program itself is written using
1267 a Germanic language.  This is a limitation but since @theglibc{}
1268 (as well as the GNU @code{gettext} package) is written as part of the
1269 GNU package and the coding standards for the GNU project require programs
1270 to be written in English, this solution nevertheless fulfills its
1271 purpose.
1272
1273 @deftypefun {char *} ngettext (const char *@var{msgid1}, const char *@var{msgid2}, unsigned long int @var{n})
1274 @standards{GNU, libintl.h}
1275 @safety{@prelim{}@mtsafe{@mtsenv{}}@asunsafe{@asucorrupt{} @ascuheap{} @asulock{} @ascudlopen{}}@acunsafe{@acucorrupt{} @aculock{} @acsfd{} @acsmem{}}}
1276 @c Wrapper for dcngettext.
1277 The @code{ngettext} function is similar to the @code{gettext} function
1278 as it finds the message catalogs in the same way.  But it takes two
1279 extra arguments.  The @var{msgid1} parameter must contain the singular
1280 form of the string to be converted.  It is also used as the key for the
1281 search in the catalog.  The @var{msgid2} parameter is the plural form.
1282 The parameter @var{n} is used to determine the plural form.  If no
1283 message catalog is found @var{msgid1} is returned if @code{n == 1},
1284 otherwise @code{msgid2}.
1285
1286 An example for the use of this function is:
1287
1288 @smallexample
1289   printf (ngettext ("%d file removed", "%d files removed", n), n);
1290 @end smallexample
1291
1292 Please note that the numeric value @var{n} has to be passed to the
1293 @code{printf} function as well.  It is not sufficient to pass it only to
1294 @code{ngettext}.
1295 @end deftypefun
1296
1297 @deftypefun {char *} dngettext (const char *@var{domain}, const char *@var{msgid1}, const char *@var{msgid2}, unsigned long int @var{n})
1298 @standards{GNU, libintl.h}
1299 @safety{@prelim{}@mtsafe{@mtsenv{}}@asunsafe{@asucorrupt{} @ascuheap{} @asulock{} @ascudlopen{}}@acunsafe{@acucorrupt{} @aculock{} @acsfd{} @acsmem{}}}
1300 @c Wrapper for dcngettext.
1301 The @code{dngettext} is similar to the @code{dgettext} function in the
1302 way the message catalog is selected.  The difference is that it takes
1303 two extra parameters to provide the correct plural form.  These two
1304 parameters are handled in the same way @code{ngettext} handles them.
1305 @end deftypefun
1306
1307 @deftypefun {char *} dcngettext (const char *@var{domain}, const char *@var{msgid1}, const char *@var{msgid2}, unsigned long int @var{n}, int @var{category})
1308 @standards{GNU, libintl.h}
1309 @safety{@prelim{}@mtsafe{@mtsenv{}}@asunsafe{@asucorrupt{} @ascuheap{} @asulock{} @ascudlopen{}}@acunsafe{@acucorrupt{} @aculock{} @acsfd{} @acsmem{}}}
1310 @c Wrapper for dcigettext.
1311 The @code{dcngettext} is similar to the @code{dcgettext} function in the
1312 way the message catalog is selected.  The difference is that it takes
1313 two extra parameters to provide the correct plural form.  These two
1314 parameters are handled in the same way @code{ngettext} handles them.
1315 @end deftypefun
1316
1317 @subsubheading The problem of plural forms
1318
1319 A description of the problem can be found at the beginning of the last
1320 section.  Now there is the question how to solve it.  Without the input
1321 of linguists (which was not available) it was not possible to determine
1322 whether there are only a few different forms in which plural forms are
1323 formed or whether the number can increase with every new supported
1324 language.
1325
1326 Therefore the solution implemented is to allow the translator to specify
1327 the rules of how to select the plural form.  Since the formula varies
1328 with every language this is the only viable solution except for
1329 hardcoding the information in the code (which still would require the
1330 possibility of extensions to not prevent the use of new languages).  The
1331 details are explained in the GNU @code{gettext} manual.  Here only a
1332 bit of information is provided.
1333
1334 The information about the plural form selection has to be stored in the
1335 header entry (the one with the empty @code{msgid} string).  It looks
1336 like this:
1337
1338 @smallexample
1339 Plural-Forms: nplurals=2; plural=n == 1 ? 0 : 1;
1340 @end smallexample
1341
1342 The @code{nplurals} value must be a decimal number which specifies how
1343 many different plural forms exist for this language.  The string
1344 following @code{plural} is an expression using the C language
1345 syntax.  Exceptions are that no negative numbers are allowed, numbers
1346 must be decimal, and the only variable allowed is @code{n}.  This
1347 expression will be evaluated whenever one of the functions
1348 @code{ngettext}, @code{dngettext}, or @code{dcngettext} is called.  The
1349 numeric value passed to these functions is then substituted for all uses
1350 of the variable @code{n} in the expression.  The resulting value then
1351 must be greater or equal to zero and smaller than the value given as the
1352 value of @code{nplurals}.
1353
1354 @noindent
1355 The following rules are known at this point.  The language with families
1356 are listed.  But this does not necessarily mean the information can be
1357 generalized for the whole family (as can be easily seen in the table
1358 below).@footnote{Additions are welcome.  Send appropriate information to
1359 @email{bug-glibc-manual@@gnu.org}.}
1360
1361 @table @asis
1362 @item Only one form:
1363 Some languages only require one single form.  There is no distinction
1364 between the singular and plural form.  An appropriate header entry
1365 would look like this:
1366
1367 @smallexample
1368 Plural-Forms: nplurals=1; plural=0;
1369 @end smallexample
1370
1371 @noindent
1372 Languages with this property include:
1373
1374 @table @asis
1375 @item Finno-Ugric family
1376 Hungarian
1377 @item Asian family
1378 Japanese, Korean
1379 @item Turkic/Altaic family
1380 Turkish
1381 @end table
1382
1383 @item Two forms, singular used for one only
1384 This is the form used in most existing programs since it is what English
1385 uses.  A header entry would look like this:
1386
1387 @smallexample
1388 Plural-Forms: nplurals=2; plural=n != 1;
1389 @end smallexample
1390
1391 (Note: this uses the feature of C expressions that boolean expressions
1392 have to value zero or one.)
1393
1394 @noindent
1395 Languages with this property include:
1396
1397 @table @asis
1398 @item Germanic family
1399 Danish, Dutch, English, German, Norwegian, Swedish
1400 @item Finno-Ugric family
1401 Estonian, Finnish
1402 @item Latin/Greek family
1403 Greek
1404 @item Semitic family
1405 Hebrew
1406 @item Romance family
1407 Italian, Portuguese, Spanish
1408 @item Artificial
1409 Esperanto
1410 @end table
1411
1412 @item Two forms, singular used for zero and one
1413 Exceptional case in the language family.  The header entry would be:
1414
1415 @smallexample
1416 Plural-Forms: nplurals=2; plural=n>1;
1417 @end smallexample
1418
1419 @noindent
1420 Languages with this property include:
1421
1422 @table @asis
1423 @item Romanic family
1424 French, Brazilian Portuguese
1425 @end table
1426
1427 @item Three forms, special case for zero
1428 The header entry would be:
1429
1430 @smallexample
1431 Plural-Forms: nplurals=3; plural=n%10==1 && n%100!=11 ? 0 : n != 0 ? 1 : 2;
1432 @end smallexample
1433
1434 @noindent
1435 Languages with this property include:
1436
1437 @table @asis
1438 @item Baltic family
1439 Latvian
1440 @end table
1441
1442 @item Three forms, special cases for one and two
1443 The header entry would be:
1444
1445 @smallexample
1446 Plural-Forms: nplurals=3; plural=n==1 ? 0 : n==2 ? 1 : 2;
1447 @end smallexample
1448
1449 @noindent
1450 Languages with this property include:
1451
1452 @table @asis
1453 @item Celtic
1454 Gaeilge (Irish)
1455 @end table
1456
1457 @item Three forms, special case for numbers ending in 1[2-9]
1458 The header entry would look like this:
1459
1460 @smallexample
1461 Plural-Forms: nplurals=3; \
1462     plural=n%10==1 && n%100!=11 ? 0 : \
1463            n%10>=2 && (n%100<10 || n%100>=20) ? 1 : 2;
1464 @end smallexample
1465
1466 @noindent
1467 Languages with this property include:
1468
1469 @table @asis
1470 @item Baltic family
1471 Lithuanian
1472 @end table
1473
1474 @item Three forms, special cases for numbers ending in 1 and 2, 3, 4, except those ending in 1[1-4]
1475 The header entry would look like this:
1476
1477 @smallexample
1478 Plural-Forms: nplurals=3; \
1479     plural=n%100/10==1 ? 2 : n%10==1 ? 0 : (n+9)%10>3 ? 2 : 1;
1480 @end smallexample
1481
1482 @noindent
1483 Languages with this property include:
1484
1485 @table @asis
1486 @item Slavic family
1487 Croatian, Czech, Russian, Ukrainian
1488 @end table
1489
1490 @item Three forms, special cases for 1 and 2, 3, 4
1491 The header entry would look like this:
1492
1493 @smallexample
1494 Plural-Forms: nplurals=3; \
1495     plural=(n==1) ? 1 : (n>=2 && n<=4) ? 2 : 0;
1496 @end smallexample
1497
1498 @noindent
1499 Languages with this property include:
1500
1501 @table @asis
1502 @item Slavic family
1503 Slovak
1504 @end table
1505
1506 @item Three forms, special case for one and some numbers ending in 2, 3, or 4
1507 The header entry would look like this:
1508
1509 @smallexample
1510 Plural-Forms: nplurals=3; \
1511     plural=n==1 ? 0 : \
1512            n%10>=2 && n%10<=4 && (n%100<10 || n%100>=20) ? 1 : 2;
1513 @end smallexample
1514
1515 @noindent
1516 Languages with this property include:
1517
1518 @table @asis
1519 @item Slavic family
1520 Polish
1521 @end table
1522
1523 @item Four forms, special case for one and all numbers ending in 02, 03, or 04
1524 The header entry would look like this:
1525
1526 @smallexample
1527 Plural-Forms: nplurals=4; \
1528     plural=n%100==1 ? 0 : n%100==2 ? 1 : n%100==3 || n%100==4 ? 2 : 3;
1529 @end smallexample
1530
1531 @noindent
1532 Languages with this property include:
1533
1534 @table @asis
1535 @item Slavic family
1536 Slovenian
1537 @end table
1538 @end table
1539
1540
1541 @node Charset conversion in gettext
1542 @subsubsection How to specify the output character set @code{gettext} uses
1543
1544 @code{gettext} not only looks up a translation in a message catalog, it
1545 also converts the translation on the fly to the desired output character
1546 set.  This is useful if the user is working in a different character set
1547 than the translator who created the message catalog, because it avoids
1548 distributing variants of message catalogs which differ only in the
1549 character set.
1550
1551 The output character set is, by default, the value of @code{nl_langinfo
1552 (CODESET)}, which depends on the @code{LC_CTYPE} part of the current
1553 locale.  But programs which store strings in a locale independent way
1554 (e.g. UTF-8) can request that @code{gettext} and related functions
1555 return the translations in that encoding, by use of the
1556 @code{bind_textdomain_codeset} function.
1557
1558 Note that the @var{msgid} argument to @code{gettext} is not subject to
1559 character set conversion.  Also, when @code{gettext} does not find a
1560 translation for @var{msgid}, it returns @var{msgid} unchanged --
1561 independently of the current output character set.  It is therefore
1562 recommended that all @var{msgid}s be US-ASCII strings.
1563
1564 @deftypefun {char *} bind_textdomain_codeset (const char *@var{domainname}, const char *@var{codeset})
1565 @standards{GNU, libintl.h}
1566 @safety{@prelim{}@mtsafe{}@asunsafe{@ascuheap{}}@acunsafe{@acsmem{}}}
1567 @c bind_textdomain_codeset @ascuheap @acsmem
1568 @c  set_binding_values dup @ascuheap @acsmem
1569 The @code{bind_textdomain_codeset} function can be used to specify the
1570 output character set for message catalogs for domain @var{domainname}.
1571 The @var{codeset} argument must be a valid codeset name which can be used
1572 for the @code{iconv_open} function, or a null pointer.
1573
1574 If the @var{codeset} parameter is the null pointer,
1575 @code{bind_textdomain_codeset} returns the currently selected codeset
1576 for the domain with the name @var{domainname}.  It returns @code{NULL} if
1577 no codeset has yet been selected.
1578
1579 The @code{bind_textdomain_codeset} function can be used several times.
1580 If used multiple times with the same @var{domainname} argument, the
1581 later call overrides the settings made by the earlier one.
1582
1583 The @code{bind_textdomain_codeset} function returns a pointer to a
1584 string containing the name of the selected codeset.  The string is
1585 allocated internally in the function and must not be changed by the
1586 user.  If the system went out of core during the execution of
1587 @code{bind_textdomain_codeset}, the return value is @code{NULL} and the
1588 global variable @var{errno} is set accordingly.
1589 @end deftypefun
1590
1591
1592 @node GUI program problems
1593 @subsubsection How to use @code{gettext} in GUI programs
1594
1595 One place where the @code{gettext} functions, if used normally, have big
1596 problems is within programs with graphical user interfaces (GUIs).  The
1597 problem is that many of the strings which have to be translated are very
1598 short.  They have to appear in pull-down menus which restricts the
1599 length.  But strings which are not containing entire sentences or at
1600 least large fragments of a sentence may appear in more than one
1601 situation in the program but might have different translations.  This is
1602 especially true for the one-word strings which are frequently used in
1603 GUI programs.
1604
1605 As a consequence many people say that the @code{gettext} approach is
1606 wrong and instead @code{catgets} should be used which indeed does not
1607 have this problem.  But there is a very simple and powerful method to
1608 handle these kind of problems with the @code{gettext} functions.
1609
1610 @noindent
1611 As an example consider the following fictional situation.  A GUI program
1612 has a menu bar with the following entries:
1613
1614 @smallexample
1615 +------------+------------+--------------------------------------+
1616 | File       | Printer    |                                      |
1617 +------------+------------+--------------------------------------+
1618 | Open     | | Select   |
1619 | New      | | Open     |
1620 +----------+ | Connect  |
1621              +----------+
1622 @end smallexample
1623
1624 To have the strings @code{File}, @code{Printer}, @code{Open},
1625 @code{New}, @code{Select}, and @code{Connect} translated there has to be
1626 at some point in the code a call to a function of the @code{gettext}
1627 family.  But in two places the string passed into the function would be
1628 @code{Open}.  The translations might not be the same and therefore we
1629 are in the dilemma described above.
1630
1631 One solution to this problem is to artificially extend the strings
1632 to make them unambiguous.  But what would the program do if no
1633 translation is available?  The extended string is not what should be
1634 printed.  So we should use a slightly modified version of the functions.
1635
1636 To extend the strings a uniform method should be used.  E.g., in the
1637 example above, the strings could be chosen as
1638
1639 @smallexample
1640 Menu|File
1641 Menu|Printer
1642 Menu|File|Open
1643 Menu|File|New
1644 Menu|Printer|Select
1645 Menu|Printer|Open
1646 Menu|Printer|Connect
1647 @end smallexample
1648
1649 Now all the strings are different and if now instead of @code{gettext}
1650 the following little wrapper function is used, everything works just
1651 fine:
1652
1653 @cindex sgettext
1654 @smallexample
1655   char *
1656   sgettext (const char *msgid)
1657   @{
1658     char *msgval = gettext (msgid);
1659     if (msgval == msgid)
1660       msgval = strrchr (msgid, '|') + 1;
1661     return msgval;
1662   @}
1663 @end smallexample
1664
1665 What this little function does is to recognize the case when no
1666 translation is available.  This can be done very efficiently by a
1667 pointer comparison since the return value is the input value.  If there
1668 is no translation we know that the input string is in the format we used
1669 for the Menu entries and therefore contains a @code{|} character.  We
1670 simply search for the last occurrence of this character and return a
1671 pointer to the character following it.  That's it!
1672
1673 If one now consistently uses the extended string form and replaces
1674 the @code{gettext} calls with calls to @code{sgettext} (this is normally
1675 limited to very few places in the GUI implementation) then it is
1676 possible to produce a program which can be internationalized.
1677
1678 With advanced compilers (such as GNU C) one can write the
1679 @code{sgettext} functions as an inline function or as a macro like this:
1680
1681 @cindex sgettext
1682 @smallexample
1683 #define sgettext(msgid) \
1684   (@{ const char *__msgid = (msgid);            \
1685      char *__msgstr = gettext (__msgid);       \
1686      if (__msgval == __msgid)                  \
1687        __msgval = strrchr (__msgid, '|') + 1;  \
1688      __msgval; @})
1689 @end smallexample
1690
1691 The other @code{gettext} functions (@code{dgettext}, @code{dcgettext}
1692 and the @code{ngettext} equivalents) can and should have corresponding
1693 functions as well which look almost identical, except for the parameters
1694 and the call to the underlying function.
1695
1696 Now there is of course the question why such functions do not exist in
1697 @theglibc{}?  There are two parts of the answer to this question.
1698
1699 @itemize @bullet
1700 @item
1701 They are easy to write and therefore can be provided by the project they
1702 are used in.  This is not an answer by itself and must be seen together
1703 with the second part which is:
1704
1705 @item
1706 There is no way the C library can contain a version which can work
1707 everywhere.  The problem is the selection of the character to separate
1708 the prefix from the actual string in the extended string.  The
1709 examples above used @code{|} which is a quite good choice because it
1710 resembles a notation frequently used in this context and it also is a
1711 character not often used in message strings.
1712
1713 But what if the character is used in message strings.  Or if the chose
1714 character is not available in the character set on the machine one
1715 compiles (e.g., @code{|} is not required to exist for @w{ISO C}; this is
1716 why the @file{iso646.h} file exists in @w{ISO C} programming environments).
1717 @end itemize
1718
1719 There is only one more comment to make left.  The wrapper function above
1720 requires that the translations strings are not extended themselves.
1721 This is only logical.  There is no need to disambiguate the strings
1722 (since they are never used as keys for a search) and one also saves
1723 quite some memory and disk space by doing this.
1724
1725
1726 @node Using gettextized software
1727 @subsubsection User influence on @code{gettext}
1728
1729 The last sections described what the programmer can do to
1730 internationalize the messages of the program.  But it is finally up to
1731 the user to select the message s/he wants to see.  S/He must understand
1732 them.
1733
1734 The POSIX locale model uses the environment variables @code{LC_COLLATE},
1735 @code{LC_CTYPE}, @code{LC_MESSAGES}, @code{LC_MONETARY}, @code{LC_NUMERIC},
1736 and @code{LC_TIME} to select the locale which is to be used.  This way
1737 the user can influence lots of functions.  As we mentioned above, the
1738 @code{gettext} functions also take advantage of this.
1739
1740 To understand how this happens it is necessary to take a look at the
1741 various components of the filename which gets computed to locate a
1742 message catalog.  It is composed as follows:
1743
1744 @smallexample
1745 @var{dir_name}/@var{locale}/LC_@var{category}/@var{domain_name}.mo
1746 @end smallexample
1747
1748 The default value for @var{dir_name} is system specific.  It is computed
1749 from the value given as the prefix while configuring the C library.
1750 This value normally is @file{/usr} or @file{/}.  For the former the
1751 complete @var{dir_name} is:
1752
1753 @smallexample
1754 /usr/share/locale
1755 @end smallexample
1756
1757 We can use @file{/usr/share} since the @file{.mo} files containing the
1758 message catalogs are system independent, so all systems can use the same
1759 files.  If the program executed the @code{bindtextdomain} function for
1760 the message domain that is currently handled, the @code{dir_name}
1761 component is exactly the value which was given to the function as
1762 the second parameter.  I.e., @code{bindtextdomain} allows overwriting
1763 the only system dependent and fixed value to make it possible to
1764 address files anywhere in the filesystem.
1765
1766 The @var{category} is the name of the locale category which was selected
1767 in the program code.  For @code{gettext} and @code{dgettext} this is
1768 always @code{LC_MESSAGES}, for @code{dcgettext} this is selected by the
1769 value of the third parameter.  As said above it should be avoided to
1770 ever use a category other than @code{LC_MESSAGES}.
1771
1772 The @var{locale} component is computed based on the category used.  Just
1773 like for the @code{setlocale} function here comes the user selection
1774 into the play.  Some environment variables are examined in a fixed order
1775 and the first environment variable set determines the return value of
1776 the lookup process.  In detail, for the category @code{LC_xxx} the
1777 following variables in this order are examined:
1778
1779 @table @code
1780 @item LANGUAGE
1781 @item LC_ALL
1782 @item LC_xxx
1783 @item LANG
1784 @end table
1785
1786 This looks very familiar.  With the exception of the @code{LANGUAGE}
1787 environment variable this is exactly the lookup order the
1788 @code{setlocale} function uses.  But why introduce the @code{LANGUAGE}
1789 variable?
1790
1791 The reason is that the syntax of the values these variables can have is
1792 different to what is expected by the @code{setlocale} function.  If we
1793 would set @code{LC_ALL} to a value following the extended syntax that
1794 would mean the @code{setlocale} function will never be able to use the
1795 value of this variable as well.  An additional variable removes this
1796 problem plus we can select the language independently of the locale
1797 setting which sometimes is useful.
1798
1799 While for the @code{LC_xxx} variables the value should consist of
1800 exactly one specification of a locale the @code{LANGUAGE} variable's
1801 value can consist of a colon separated list of locale names.  The
1802 attentive reader will realize that this is the way we manage to
1803 implement one of our additional demands above: we want to be able to
1804 specify an ordered list of languages.
1805
1806 Back to the constructed filename we have only one component missing.
1807 The @var{domain_name} part is the name which was either registered using
1808 the @code{textdomain} function or which was given to @code{dgettext} or
1809 @code{dcgettext} as the first parameter.  Now it becomes obvious that a
1810 good choice for the domain name in the program code is a string which is
1811 closely related to the program/package name.  E.g., for @theglibc{}
1812 the domain name is @code{libc}.
1813
1814 @noindent
1815 A limited piece of example code should show how the program is supposed
1816 to work:
1817
1818 @smallexample
1819 @{
1820   setlocale (LC_ALL, "");
1821   textdomain ("test-package");
1822   bindtextdomain ("test-package", "/usr/local/share/locale");
1823   puts (gettext ("Hello, world!"));
1824 @}
1825 @end smallexample
1826
1827 At the program start the default domain is @code{messages}, and the
1828 default locale is "C".  The @code{setlocale} call sets the locale
1829 according to the user's environment variables; remember that correct
1830 functioning of @code{gettext} relies on the correct setting of the
1831 @code{LC_MESSAGES} locale (for looking up the message catalog) and
1832 of the @code{LC_CTYPE} locale (for the character set conversion).
1833 The @code{textdomain} call changes the default domain to
1834 @code{test-package}.  The @code{bindtextdomain} call specifies that
1835 the message catalogs for the domain @code{test-package} can be found
1836 below the directory @file{/usr/local/share/locale}.
1837
1838 If the user sets in her/his environment the variable @code{LANGUAGE}
1839 to @code{de} the @code{gettext} function will try to use the
1840 translations from the file
1841
1842 @smallexample
1843 /usr/local/share/locale/de/LC_MESSAGES/test-package.mo
1844 @end smallexample
1845
1846 From the above descriptions it should be clear which component of this
1847 filename is determined by which source.
1848
1849 In the above example we assumed the @code{LANGUAGE} environment
1850 variable to be @code{de}.  This might be an appropriate selection but what
1851 happens if the user wants to use @code{LC_ALL} because of the wider
1852 usability and here the required value is @code{de_DE.ISO-8859-1}?  We
1853 already mentioned above that a situation like this is not infrequent.
1854 E.g., a person might prefer reading a dialect and if this is not
1855 available fall back on the standard language.
1856
1857 The @code{gettext} functions know about situations like this and can
1858 handle them gracefully.  The functions recognize the format of the value
1859 of the environment variable.  It can split the value is different pieces
1860 and by leaving out the only or the other part it can construct new
1861 values.  This happens of course in a predictable way.  To understand
1862 this one must know the format of the environment variable value.  There
1863 is one more or less standardized form, originally from the X/Open
1864 specification:
1865
1866 @code{language[_territory[.codeset]][@@modifier]}
1867
1868 Less specific locale names will be stripped in the order of the
1869 following list:
1870
1871 @enumerate
1872 @item
1873 @code{codeset}
1874 @item
1875 @code{normalized codeset}
1876 @item
1877 @code{territory}
1878 @item
1879 @code{modifier}
1880 @end enumerate
1881
1882 The @code{language} field will never be dropped for obvious reasons.
1883
1884 The only new thing is the @code{normalized codeset} entry.  This is
1885 another goodie which is introduced to help reduce the chaos which
1886 derives from the inability of people to standardize the names of
1887 character sets.  Instead of @w{ISO-8859-1} one can often see @w{8859-1},
1888 @w{88591}, @w{iso8859-1}, or @w{iso_8859-1}.  The @code{normalized
1889 codeset} value is generated from the user-provided character set name by
1890 applying the following rules:
1891
1892 @enumerate
1893 @item
1894 Remove all characters besides numbers and letters.
1895 @item
1896 Fold letters to lowercase.
1897 @item
1898 If the same only contains digits prepend the string @code{"iso"}.
1899 @end enumerate
1900
1901 @noindent
1902 So all of the above names will be normalized to @code{iso88591}.  This
1903 allows the program user much more freedom in choosing the locale name.
1904
1905 Even this extended functionality still does not help to solve the
1906 problem that completely different names can be used to denote the same
1907 locale (e.g., @code{de} and @code{german}).  To be of help in this
1908 situation the locale implementation and also the @code{gettext}
1909 functions know about aliases.
1910
1911 The file @file{/usr/share/locale/locale.alias} (replace @file{/usr} with
1912 whatever prefix you used for configuring the C library) contains a
1913 mapping of alternative names to more regular names.  The system manager
1914 is free to add new entries to fill her/his own needs.  The selected
1915 locale from the environment is compared with the entries in the first
1916 column of this file ignoring the case.  If they match, the value of the
1917 second column is used instead for the further handling.
1918
1919 In the description of the format of the environment variables we already
1920 mentioned the character set as a factor in the selection of the message
1921 catalog.  In fact, only catalogs which contain text written using the
1922 character set of the system/program can be used (directly; there will
1923 come a solution for this some day).  This means for the user that s/he
1924 will always have to take care of this.  If in the collection of the
1925 message catalogs there are files for the same language but coded using
1926 different character sets the user has to be careful.
1927
1928
1929 @node Helper programs for gettext
1930 @subsection Programs to handle message catalogs for @code{gettext}
1931
1932 @Theglibc{} does not contain the source code for the programs to
1933 handle message catalogs for the @code{gettext} functions.  As part of
1934 the GNU project the GNU gettext package contains everything the
1935 developer needs.  The functionality provided by the tools in this
1936 package by far exceeds the abilities of the @code{gencat} program
1937 described above for the @code{catgets} functions.
1938
1939 There is a program @code{msgfmt} which is the equivalent program to the
1940 @code{gencat} program.  It generates from the human-readable and
1941 -editable form of the message catalog a binary file which can be used by
1942 the @code{gettext} functions.  But there are several more programs
1943 available.
1944
1945 The @code{xgettext} program can be used to automatically extract the
1946 translatable messages from a source file.  I.e., the programmer need not
1947 take care of the translations and the list of messages which have to be
1948 translated.  S/He will simply wrap the translatable string in calls to
1949 @code{gettext} et.al and the rest will be done by @code{xgettext}.  This
1950 program has a lot of options which help to customize the output or
1951 help to understand the input better.
1952
1953 Other programs help to manage the development cycle when new messages appear
1954 in the source files or when a new translation of the messages appears.
1955 Here it should only be noted that using all the tools in GNU gettext it
1956 is possible to @emph{completely} automate the handling of message
1957 catalogs.  Besides marking the translatable strings in the source code and
1958 generating the translations the developers do not have anything to do
1959 themselves.