manual/message.texi

   1 @node Message Translation, Searching and Sorting, Locales, Top
   2 @c %MENU% How to make the program speak the user's language
   3 @chapter Message Translation
   4
   5 The program's interface with the human should be designed in a way to
   6 ease the human the task.  One of the possibilities is to use messages in
   7 whatever language the user prefers.
   8
   9 Printing messages in different languages can be implemented in different
  10 ways.  One could add all the different languages in the source code and
  11 add among the variants every time a message has to be printed.  This is
  12 certainly no good solution since extending the set of languages is
  13 difficult (the code must be changed) and the code itself can become
  14 really big with dozens of message sets.
  15
  16 A better solution is to keep the message sets for each language are kept
  17 in separate files which are loaded at runtime depending on the language
  18 selection of the user.
  19
  20 @Theglibc{} provides two different sets of functions to support
  21 message translation.  The problem is that neither of the interfaces is
  22 officially defined by the POSIX standard.  The @code{catgets} family of
  23 functions is defined in the X/Open standard but this is derived from
  24 industry decisions and therefore not necessarily based on reasonable
  25 decisions.
  26
  27 As mentioned above the message catalog handling provides easy
  28 extendibility by using external data files which contain the message
  29 translations.  I.e., these files contain for each of the messages used
  30 in the program a translation for the appropriate language.  So the tasks
  31 of the message handling functions are
  32
  33 @itemize @bullet
  34 @item
  35 locate the external data file with the appropriate translations.
  36 @item
  37 load the data and make it possible to address the messages
  38 @item
  39 map a given key to the translated message
  40 @end itemize
  41
  42 The two approaches mainly differ in the implementation of this last
  43 step.  The design decisions made for this influences the whole rest.
  44
  45 @menu
  46 * Message catalogs a la X/Open::  The @code{catgets} family of functions.
  47 * The Uniforum approach::         The @code{gettext} family of functions.
  48 @end menu
  49
  50
  51 @node Message catalogs a la X/Open
  52 @section X/Open Message Catalog Handling
  53
  54 The @code{catgets} functions are based on the simple scheme:
  55
  56 @quotation
  57 Associate every message to translate in the source code with a unique
  58 identifier.  To retrieve a message from a catalog file solely the
  59 identifier is used.
  60 @end quotation
  61
  62 This means for the author of the program that s/he will have to make
  63 sure the meaning of the identifier in the program code and in the
  64 message catalogs are always the same.
  65
  66 Before a message can be translated the catalog file must be located.
  67 The user of the program must be able to guide the responsible function
  68 to find whatever catalog the user wants.  This is separated from what
  69 the programmer had in mind.
  70
  71 All the types, constants and functions for the @code{catgets} functions
  72 are defined/declared in the @file{nl_types.h} header file.
  73
  74 @menu
  75 * The catgets Functions::      The @code{catgets} function family.
  76 * The message catalog files::  Format of the message catalog files.
  77 * The gencat program::         How to generate message catalogs files which
  78                                 can be used by the functions.
  79 * Common Usage::               How to use the @code{catgets} interface.
  80 @end menu
  81
  82
  83 @node The catgets Functions
  84 @subsection The @code{catgets} function family
  85
  86 @comment nl_types.h
  87 @comment X/Open
  88 @deftypefun nl_catd catopen (const char *@var{cat_name}, int @var{flag})
  89 The @code{catgets} function tries to locate the message data file names
  90 @var{cat_name} and loads it when found.  The return value is of an
  91 opaque type and can be used in calls to the other functions to refer to
  92 this loaded catalog.
  93
  94 The return value is @code{(nl_catd) -1} in case the function failed and
  95 no catalog was loaded.  The global variable @var{errno} contains a code
  96 for the error causing the failure.  But even if the function call
  97 succeeded this does not mean that all messages can be translated.
  98
  99 Locating the catalog file must happen in a way which lets the user of
 100 the program influence the decision.  It is up to the user to decide
 101 about the language to use and sometimes it is useful to use alternate
 102 catalog files.  All this can be specified by the user by setting some
 103 environment variables.
 104
 105 The first problem is to find out where all the message catalogs are
 106 stored.  Every program could have its own place to keep all the
 107 different files but usually the catalog files are grouped by languages
 108 and the catalogs for all programs are kept in the same place.
 109
 110 @cindex NLSPATH environment variable
 111 To tell the @code{catopen} function where the catalog for the program
 112 can be found the user can set the environment variable @code{NLSPATH} to
 113 a value which describes her/his choice.  Since this value must be usable
 114 for different languages and locales it cannot be a simple string.
 115 Instead it is a format string (similar to @code{printf}'s).  An example
 116 is
 117
 118 @smallexample
 119 /usr/share/locale/%L/%N:/usr/share/locale/%L/LC_MESSAGES/%N
 120 @end smallexample
 121
 122 First one can see that more than one directory can be specified (with
 123 the usual syntax of separating them by colons).  The next things to
 124 observe are the format string, @code{%L} and @code{%N} in this case.
 125 The @code{catopen} function knows about several of them and the
 126 replacement for all of them is of course different.
 127
 128 @table @code
 129 @item %N
 130 This format element is substituted with the name of the catalog file.
 131 This is the value of the @var{cat_name} argument given to
 132 @code{catgets}.
 133
 134 @item %L
 135 This format element is substituted with the name of the currently
 136 selected locale for translating messages.  How this is determined is
 137 explained below.
 138
 139 @item %l
 140 (This is the lowercase ell.) This format element is substituted with the
 141 language element of the locale name.  The string describing the selected
 142 locale is expected to have the form
 143 @code{@var{lang}[_@var{terr}[.@var{codeset}]]} and this format uses the
 144 first part @var{lang}.
 145
 146 @item %t
 147 This format element is substituted by the territory part @var{terr} of
 148 the name of the currently selected locale.  See the explanation of the
 149 format above.
 150
 151 @item %c
 152 This format element is substituted by the codeset part @var{codeset} of
 153 the name of the currently selected locale.  See the explanation of the
 154 format above.
 155
 156 @item %%
 157 Since @code{%} is used in a meta character there must be a way to
 158 express the @code{%} character in the result itself.  Using @code{%%}
 159 does this just like it works for @code{printf}.
 160 @end table
 161
 162
 163 Using @code{NLSPATH} allows arbitrary directories to be searched for
 164 message catalogs while still allowing different languages to be used.
 165 If the @code{NLSPATH} environment variable is not set, the default value
 166 is
 167
 168 @smallexample
 169 @var{prefix}/share/locale/%L/%N:@var{prefix}/share/locale/%L/LC_MESSAGES/%N
 170 @end smallexample
 171
 172 @noindent
 173 where @var{prefix} is given to @code{configure} while installing @theglibc{}
 174 (this value is in many cases @code{/usr} or the empty string).
 175
 176 The remaining problem is to decide which must be used.  The value
 177 decides about the substitution of the format elements mentioned above.
 178 First of all the user can specify a path in the message catalog name
 179 (i.e., the name contains a slash character).  In this situation the
 180 @code{NLSPATH} environment variable is not used.  The catalog must exist
 181 as specified in the program, perhaps relative to the current working
 182 directory.  This situation in not desirable and catalogs names never
 183 should be written this way.  Beside this, this behavior is not portable
 184 to all other platforms providing the @code{catgets} interface.
 185
 186 @cindex LC_ALL environment variable
 187 @cindex LC_MESSAGES environment variable
 188 @cindex LANG environment variable
 189 Otherwise the values of environment variables from the standard
 190 environment are examined (@pxref{Standard Environment}).  Which
 191 variables are examined is decided by the @var{flag} parameter of
 192 @code{catopen}.  If the value is @code{NL_CAT_LOCALE} (which is defined
 193 in @file{nl_types.h}) then the @code{catopen} function use the name of
 194 the locale currently selected for the @code{LC_MESSAGES} category.
 195
 196 If @var{flag} is zero the @code{LANG} environment variable is examined.
 197 This is a left-over from the early days where the concept of the locales
 198 had not even reached the level of POSIX locales.
 199
 200 The environment variable and the locale name should have a value of the
 201 form @code{@var{lang}[_@var{terr}[.@var{codeset}]]} as explained above.
 202 If no environment variable is set the @code{"C"} locale is used which
 203 prevents any translation.
 204
 205 The return value of the function is in any case a valid string.  Either
 206 it is a translation from a message catalog or it is the same as the
 207 @var{string} parameter.  So a piece of code to decide whether a
 208 translation actually happened must look like this:
 209
 210 @smallexample
 211 @{
 212   char *trans = catgets (desc, set, msg, input_string);
 213   if (trans == input_string)
 214     @{
 215       /* Something went wrong.  */
 216     @}
 217 @}
 218 @end smallexample
 219
 220 @noindent
 221 When an error occurred the global variable @var{errno} is set to
 222
 223 @table @var
 224 @item EBADF
 225 The catalog does not exist.
 226 @item ENOMSG
 227 The set/message tuple does not name an existing element in the
 228 message catalog.
 229 @end table
 230
 231 While it sometimes can be useful to test for errors programs normally
 232 will avoid any test.  If the translation is not available it is no big
 233 problem if the original, untranslated message is printed.  Either the
 234 user understands this as well or s/he will look for the reason why the
 235 messages are not translated.
 236 @end deftypefun
 237
 238 Please note that the currently selected locale does not depend on a call
 239 to the @code{setlocale} function.  It is not necessary that the locale
 240 data files for this locale exist and calling @code{setlocale} succeeds.
 241 The @code{catopen} function directly reads the values of the environment
 242 variables.
 243
 244
 245 @deftypefun {char *} catgets (nl_catd @var{catalog_desc}, int @var{set}, int @var{message}, const char *@var{string})
 246 The function @code{catgets} has to be used to access the massage catalog
 247 previously opened using the @code{catopen} function.  The
 248 @var{catalog_desc} parameter must be a value previously returned by
 249 @code{catopen}.
 250
 251 The next two parameters, @var{set} and @var{message}, reflect the
 252 internal organization of the message catalog files.  This will be
 253 explained in detail below.  For now it is interesting to know that a
 254 catalog can consists of several set and the messages in each thread are
 255 individually numbered using numbers.  Neither the set number nor the
 256 message number must be consecutive.  They can be arbitrarily chosen.
 257 But each message (unless equal to another one) must have its own unique
 258 pair of set and message number.
 259
 260 Since it is not guaranteed that the message catalog for the language
 261 selected by the user exists the last parameter @var{string} helps to
 262 handle this case gracefully.  If no matching string can be found
 263 @var{string} is returned.  This means for the programmer that
 264
 265 @itemize @bullet
 266 @item
 267 the @var{string} parameters should contain reasonable text (this also
 268 helps to understand the program seems otherwise there would be no hint
 269 on the string which is expected to be returned.
 270 @item
 271 all @var{string} arguments should be written in the same language.
 272 @end itemize
 273 @end deftypefun
 274
 275 It is somewhat uncomfortable to write a program using the @code{catgets}
 276 functions if no supporting functionality is available.  Since each
 277 set/message number tuple must be unique the programmer must keep lists
 278 of the messages at the same time the code is written.  And the work
 279 between several people working on the same project must be coordinated.
 280 We will see some how these problems can be relaxed a bit (@pxref{Common
 281 Usage}).
 282
 283 @deftypefun int catclose (nl_catd @var{catalog_desc})
 284 The @code{catclose} function can be used to free the resources
 285 associated with a message catalog which previously was opened by a call
 286 to @code{catopen}.  If the resources can be successfully freed the
 287 function returns @code{0}.  Otherwise it return @code{@minus{}1} and the
 288 global variable @var{errno} is set.  Errors can occur if the catalog
 289 descriptor @var{catalog_desc} is not valid in which case @var{errno} is
 290 set to @code{EBADF}.
 291 @end deftypefun
 292
 293
 294 @node The message catalog files
 295 @subsection  Format of the message catalog files
 296
 297 The only reasonable way the translate all the messages of a function and
 298 store the result in a message catalog file which can be read by the
 299 @code{catopen} function is to write all the message text to the
 300 translator and let her/him translate them all.  I.e., we must have a
 301 file with entries which associate the set/message tuple with a specific
 302 translation.  This file format is specified in the X/Open standard and
 303 is as follows:
 304
 305 @itemize @bullet
 306 @item
 307 Lines containing only whitespace characters or empty lines are ignored.
 308
 309 @item
 310 Lines which contain as the first non-whitespace character a @code{$}
 311 followed by a whitespace character are comment and are also ignored.
 312
 313 @item
 314 If a line contains as the first non-whitespace characters the sequence
 315 @code{$set} followed by a whitespace character an additional argument
 316 is required to follow.  This argument can either be:
 317
 318 @itemize @minus
 319 @item
 320 a number.  In this case the value of this number determines the set
 321 to which the following messages are added.
 322
 323 @item
 324 an identifier consisting of alphanumeric characters plus the underscore
 325 character.  In this case the set get automatically a number assigned.
 326 This value is one added to the largest set number which so far appeared.
 327
 328 How to use the symbolic names is explained in section @ref{Common Usage}.
 329
 330 It is an error if a symbol name appears more than once.  All following
 331 messages are placed in a set with this number.
 332 @end itemize
 333
 334 @item
 335 If a line contains as the first non-whitespace characters the sequence
 336 @code{$delset} followed by a whitespace character an additional argument
 337 is required to follow.  This argument can either be:
 338
 339 @itemize @minus
 340 @item
 341 a number.  In this case the value of this number determines the set
 342 which will be deleted.
 343
 344 @item
 345 an identifier consisting of alphanumeric characters plus the underscore
 346 character.  This symbolic identifier must match a name for a set which
 347 previously was defined.  It is an error if the name is unknown.
 348 @end itemize
 349
 350 In both cases all messages in the specified set will be removed.  They
 351 will not appear in the output.  But if this set is later again selected
 352 with a @code{$set} command again messages could be added and these
 353 messages will appear in the output.
 354
 355 @item
 356 If a line contains after leading whitespaces the sequence
 357 @code{$quote}, the quoting character used for this input file is
 358 changed to the first non-whitespace character following the
 359 @code{$quote}.  If no non-whitespace character is present before the
 360 line ends quoting is disable.
 361
 362 By default no quoting character is used.  In this mode strings are
 363 terminated with the first unescaped line break.  If there is a
 364 @code{$quote} sequence present newline need not be escaped.  Instead a
 365 string is terminated with the first unescaped appearance of the quote
 366 character.
 367
 368 A common usage of this feature would be to set the quote character to
 369 @code{"}.  Then any appearance of the @code{"} in the strings must
 370 be escaped using the backslash (i.e., @code{\"} must be written).
 371
 372 @item
 373 Any other line must start with a number or an alphanumeric identifier
 374 (with the underscore character included).  The following characters
 375 (starting after the first whitespace character) will form the string
 376 which gets associated with the currently selected set and the message
 377 number represented by the number and identifier respectively.
 378
 379 If the start of the line is a number the message number is obvious.  It
 380 is an error if the same message number already appeared for this set.
 381
 382 If the leading token was an identifier the message number gets
 383 automatically assigned.  The value is the current maximum messages
 384 number for this set plus one.  It is an error if the identifier was
 385 already used for a message in this set.  It is OK to reuse the
 386 identifier for a message in another thread.  How to use the symbolic
 387 identifiers will be explained below (@pxref{Common Usage}).  There is
 388 one limitation with the identifier: it must not be @code{Set}.  The
 389 reason will be explained below.
 390
 391 The text of the messages can contain escape characters.  The usual bunch
 392 of characters known from the @w{ISO C} language are recognized
 393 (@code{\n}, @code{\t}, @code{\v}, @code{\b}, @code{\r}, @code{\f},
 394 @code{\\}, and @code{\@var{nnn}}, where @var{nnn} is the octal coding of
 395 a character code).
 396 @end itemize
 397
 398 @strong{Important:} The handling of identifiers instead of numbers for
 399 the set and messages is a GNU extension.  Systems strictly following the
 400 X/Open specification do not have this feature.  An example for a message
 401 catalog file is this:
 402
 403 @smallexample
 404 $ This is a leading comment.
 405 $quote "
 406
 407 $set SetOne
 408 1 Message with ID 1.
 409 two "   Message with ID \"two\", which gets the value 2 assigned"
 410
 411 $set SetTwo
 412 $ Since the last set got the number 1 assigned this set has number 2.
 413 4000 "The numbers can be arbitrary, they need not start at one."
 414 @end smallexample
 415
 416 This small example shows various aspects:
 417 @itemize @bullet
 418 @item
 419 Lines 1 and 9 are comments since they start with @code{$} followed by
 420 a whitespace.
 421 @item
 422 The quoting character is set to @code{"}.  Otherwise the quotes in the
 423 message definition would have to be left away and in this case the
 424 message with the identifier @code{two} would loose its leading whitespace.
 425 @item
 426 Mixing numbered messages with message having symbolic names is no
 427 problem and the numbering happens automatically.
 428 @end itemize
 429
 430
 431 While this file format is pretty easy it is not the best possible for
 432 use in a running program.  The @code{catopen} function would have to
 433 parser the file and handle syntactic errors gracefully.  This is not so
 434 easy and the whole process is pretty slow.  Therefore the @code{catgets}
 435 functions expect the data in another more compact and ready-to-use file
 436 format.  There is a special program @code{gencat} which is explained in
 437 detail in the next section.
 438
 439 Files in this other format are not human readable.  To be easy to use by
 440 programs it is a binary file.  But the format is byte order independent
 441 so translation files can be shared by systems of arbitrary architecture
 442 (as long as they use @theglibc{}).
 443
 444 Details about the binary file format are not important to know since
 445 these files are always created by the @code{gencat} program.  The
 446 sources of @theglibc{} also provide the sources for the
 447 @code{gencat} program and so the interested reader can look through
 448 these source files to learn about the file format.
 449
 450
 451 @node The gencat program
 452 @subsection Generate Message Catalogs files
 453
 454 @cindex gencat
 455 The @code{gencat} program is specified in the X/Open standard and the
 456 GNU implementation follows this specification and so processes
 457 all correctly formed input files.  Additionally some extension are
 458 implemented which help to work in a more reasonable way with the
 459 @code{catgets} functions.
 460
 461 The @code{gencat} program can be invoked in two ways:
 462
 463 @example
 464 `gencat [@var{Option}]@dots{} [@var{Output-File} [@var{Input-File}]@dots{}]`
 465 @end example
 466
 467 This is the interface defined in the X/Open standard.  If no
 468 @var{Input-File} parameter is given input will be read from standard
 469 input.  Multiple input files will be read as if they are concatenated.
 470 If @var{Output-File} is also missing, the output will be written to
 471 standard output.  To provide the interface one is used to from other
 472 programs a second interface is provided.
 473
 474 @smallexample
 475 `gencat [@var{Option}]@dots{} -o @var{Output-File} [@var{Input-File}]@dots{}`
 476 @end smallexample
 477
 478 The option @samp{-o} is used to specify the output file and all file
 479 arguments are used as input files.
 480
 481 Beside this one can use @file{-} or @file{/dev/stdin} for
 482 @var{Input-File} to denote the standard input.  Corresponding one can
 483 use @file{-} and @file{/dev/stdout} for @var{Output-File} to denote
 484 standard output.  Using @file{-} as a file name is allowed in X/Open
 485 while using the device names is a GNU extension.
 486
 487 The @code{gencat} program works by concatenating all input files and
 488 then @strong{merge} the resulting collection of message sets with a
 489 possibly existing output file.  This is done by removing all messages
 490 with set/message number tuples matching any of the generated messages
 491 from the output file and then adding all the new messages.  To
 492 regenerate a catalog file while ignoring the old contents therefore
 493 requires to remove the output file if it exists.  If the output is
 494 written to standard output no merging takes place.
 495
 496 @noindent
 497 The following table shows the options understood by the @code{gencat}
 498 program.  The X/Open standard does not specify any option for the
 499 program so all of these are GNU extensions.
 500
 501 @table @samp
 502 @item -V
 503 @itemx --version
 504 Print the version information and exit.
 505 @item -h
 506 @itemx --help
 507 Print a usage message listing all available options, then exit successfully.
 508 @item --new
 509 Do never merge the new messages from the input files with the old content
 510 of the output files.  The old content of the output file is discarded.
 511 @item -H
 512 @itemx --header=name
 513 This option is used to emit the symbolic names given to sets and
 514 messages in the input files for use in the program.  Details about how
 515 to use this are given in the next section.  The @var{name} parameter to
 516 this option specifies the name of the output file.  It will contain a
 517 number of C preprocessor @code{#define}s to associate a name with a
 518 number.
 519
 520 Please note that the generated file only contains the symbols from the
 521 input files.  If the output is merged with the previous content of the
 522 output file the possibly existing symbols from the file(s) which
 523 generated the old output files are not in the generated header file.
 524 @end table
 525
 526
 527 @node Common Usage
 528 @subsection How to use the @code{catgets} interface
 529
 530 The @code{catgets} functions can be used in two different ways.  By
 531 following slavishly the X/Open specs and not relying on the extension
 532 and by using the GNU extensions.  We will take a look at the former
 533 method first to understand the benefits of extensions.
 534
 535 @subsubsection Not using symbolic names
 536
 537 Since the X/Open format of the message catalog files does not allow
 538 symbol names we have to work with numbers all the time.  When we start
 539 writing a program we have to replace all appearances of translatable
 540 strings with something like
 541
 542 @smallexample
 543 catgets (catdesc, set, msg, "string")
 544 @end smallexample
 545
 546 @noindent
 547 @var{catgets} is retrieved from a call to @code{catopen} which is
 548 normally done once at the program start.  The @code{"string"} is the
 549 string we want to translate.  The problems start with the set and
 550 message numbers.
 551
 552 In a bigger program several programmers usually work at the same time on
 553 the program and so coordinating the number allocation is crucial.
 554 Though no two different strings must be indexed by the same tuple of
 555 numbers it is highly desirable to reuse the numbers for equal strings
 556 with equal translations (please note that there might be strings which
 557 are equal in one language but have different translations due to
 558 difference contexts).
 559
 560 The allocation process can be relaxed a bit by different set numbers for
 561 different parts of the program.  So the number of developers who have to
 562 coordinate the allocation can be reduced.  But still lists must be keep
 563 track of the allocation and errors can easily happen.  These errors
 564 cannot be discovered by the compiler or the @code{catgets} functions.
 565 Only the user of the program might see wrong messages printed.  In the
 566 worst cases the messages are so irritating that they cannot be
 567 recognized as wrong.  Think about the translations for @code{"true"} and
 568 @code{"false"} being exchanged.  This could result in a disaster.
 569
 570
 571 @subsubsection Using symbolic names
 572
 573 The problems mentioned in the last section derive from the fact that:
 574
 575 @enumerate
 576 @item
 577 the numbers are allocated once and due to the possibly frequent use of
 578 them it is difficult to change a number later.
 579 @item
 580 the numbers do not allow to guess anything about the string and
 581 therefore collisions can easily happen.
 582 @end enumerate
 583
 584 By constantly using symbolic names and by providing a method which maps
 585 the string content to a symbolic name (however this will happen) one can
 586 prevent both problems above.  The cost of this is that the programmer
 587 has to write a complete message catalog file while s/he is writing the
 588 program itself.
 589
 590 This is necessary since the symbolic names must be mapped to numbers
 591 before the program sources can be compiled.  In the last section it was
 592 described how to generate a header containing the mapping of the names.
 593 E.g., for the example message file given in the last section we could
 594 call the @code{gencat} program as follow (assume @file{ex.msg} contains
 595 the sources).
 596
 597 @smallexample
 598 gencat -H ex.h -o ex.cat ex.msg
 599 @end smallexample
 600
 601 @noindent
 602 This generates a header file with the following content:
 603
 604 @smallexample
 605 #define SetTwoSet 0x2   /* ex.msg:8 */
 606
 607 #define SetOneSet 0x1   /* ex.msg:4 */
 608 #define SetOnetwo 0x2   /* ex.msg:6 */
 609 @end smallexample
 610
 611 As can be seen the various symbols given in the source file are mangled
 612 to generate unique identifiers and these identifiers get numbers
 613 assigned.  Reading the source file and knowing about the rules will
 614 allow to predict the content of the header file (it is deterministic)
 615 but this is not necessary.  The @code{gencat} program can take care for
 616 everything.  All the programmer has to do is to put the generated header
 617 file in the dependency list of the source files of her/his project and
 618 to add a rules to regenerate the header of any of the input files
 619 change.
 620
 621 One word about the symbol mangling.  Every symbol consists of two parts:
 622 the name of the message set plus the name of the message or the special
 623 string @code{Set}.  So @code{SetOnetwo} means this macro can be used to
 624 access the translation with identifier @code{two} in the message set
 625 @code{SetOne}.
 626
 627 The other names denote the names of the message sets.  The special
 628 string @code{Set} is used in the place of the message identifier.
 629
 630 If in the code the second string of the set @code{SetOne} is used the C
 631 code should look like this:
 632
 633 @smallexample
 634 catgets (catdesc, SetOneSet, SetOnetwo,
 635          "   Message with ID \"two\", which gets the value 2 assigned")
 636 @end smallexample
 637
 638 Writing the function this way will allow to change the message number
 639 and even the set number without requiring any change in the C source
 640 code.  (The text of the string is normally not the same; this is only
 641 for this example.)
 642
 643
 644 @subsubsection How does to this allow to develop
 645
 646 To illustrate the usual way to work with the symbolic version numbers
 647 here is a little example.  Assume we want to write the very complex and
 648 famous greeting program.  We start by writing the code as usual:
 649
 650 @smallexample
 651 #include <stdio.h>
 652 int
 653 main (void)
 654 @{
 655   printf ("Hello, world!\n");
 656   return 0;
 657 @}
 658 @end smallexample
 659
 660 Now we want to internationalize the message and therefore replace the
 661 message with whatever the user wants.
 662
 663 @smallexample
 664 #include <nl_types.h>
 665 #include <stdio.h>
 666 #include "msgnrs.h"
 667 int
 668 main (void)
 669 @{
 670   nl_catd catdesc = catopen ("hello.cat", NL_CAT_LOCALE);
 671   printf (catgets (catdesc, SetMainSet, SetMainHello,
 672                    "Hello, world!\n"));
 673   catclose (catdesc);
 674   return 0;
 675 @}
 676 @end smallexample
 677
 678 We see how the catalog object is opened and the returned descriptor used
 679 in the other function calls.  It is not really necessary to check for
 680 failure of any of the functions since even in these situations the
 681 functions will behave reasonable.  They simply will be return a
 682 translation.
 683
 684 What remains unspecified here are the constants @code{SetMainSet} and
 685 @code{SetMainHello}.  These are the symbolic names describing the
 686 message.  To get the actual definitions which match the information in
 687 the catalog file we have to create the message catalog source file and
 688 process it using the @code{gencat} program.
 689
 690 @smallexample
 691 $ Messages for the famous greeting program.
 692 $quote "
 693
 694 $set Main
 695 Hello "Hallo, Welt!\n"
 696 @end smallexample
 697
 698 Now we can start building the program (assume the message catalog source
 699 file is named @file{hello.msg} and the program source file @file{hello.c}):
 700
 701 @smallexample
 702 @cartouche
 703 % gencat -H msgnrs.h -o hello.cat hello.msg
 704 % cat msgnrs.h
 705 #define MainSet 0x1     /* hello.msg:4 */
 706 #define MainHello 0x1   /* hello.msg:5 */
 707 % gcc -o hello hello.c -I.
 708 % cp hello.cat /usr/share/locale/de/LC_MESSAGES
 709 % echo $LC_ALL
 710 de
 711 % ./hello
 712 Hallo, Welt!
 713 %
 714 @end cartouche
 715 @end smallexample
 716
 717 The call of the @code{gencat} program creates the missing header file
 718 @file{msgnrs.h} as well as the message catalog binary.  The former is
 719 used in the compilation of @file{hello.c} while the later is placed in a
 720 directory in which the @code{catopen} function will try to locate it.
 721 Please check the @code{LC_ALL} environment variable and the default path
 722 for @code{catopen} presented in the description above.
 723
 724
 725 @node The Uniforum approach
 726 @section The Uniforum approach to Message Translation
 727
 728 Sun Microsystems tried to standardize a different approach to message
 729 translation in the Uniforum group.  There never was a real standard
 730 defined but still the interface was used in Sun's operating systems.
 731 Since this approach fits better in the development process of free
 732 software it is also used throughout the GNU project and the GNU
 733 @file{gettext} package provides support for this outside @theglibc{}.
 734
 735 The code of the @file{libintl} from GNU @file{gettext} is the same as
 736 the code in @theglibc{}.  So the documentation in the GNU
 737 @file{gettext} manual is also valid for the functionality here.  The
 738 following text will describe the library functions in detail.  But the
 739 numerous helper programs are not described in this manual.  Instead
 740 people should read the GNU @file{gettext} manual
 741 (@pxref{Top,,GNU gettext utilities,gettext,Native Language Support Library and Tools}).
 742 We will only give a short overview.
 743
 744 Though the @code{catgets} functions are available by default on more
 745 systems the @code{gettext} interface is at least as portable as the
 746 former.  The GNU @file{gettext} package can be used wherever the
 747 functions are not available.
 748
 749
 750 @menu
 751 * Message catalogs with gettext::  The @code{gettext} family of functions.
 752 * Helper programs for gettext::    Programs to handle message catalogs
 753                                     for @code{gettext}.
 754 @end menu
 755
 756
 757 @node Message catalogs with gettext
 758 @subsection The @code{gettext} family of functions
 759
 760 The paradigms underlying the @code{gettext} approach to message
 761 translations is different from that of the @code{catgets} functions the
 762 basic functionally is equivalent.  There are functions of the following
 763 categories:
 764
 765 @menu
 766 * Translation with gettext::       What has to be done to translate a message.
 767 * Locating gettext catalog::       How to determine which catalog to be used.
 768 * Advanced gettext functions::     Additional functions for more complicated
 769                                     situations.
 770 * Charset conversion in gettext::  How to specify the output character set
 771                                     @code{gettext} uses.
 772 * GUI program problems::           How to use @code{gettext} in GUI programs.
 773 * Using gettextized software::     The possibilities of the user to influence
 774                                     the way @code{gettext} works.
 775 @end menu
 776
 777 @node Translation with gettext
 778 @subsubsection What has to be done to translate a message?
 779
 780 The @code{gettext} functions have a very simple interface.  The most
 781 basic function just takes the string which shall be translated as the
 782 argument and it returns the translation.  This is fundamentally
 783 different from the @code{catgets} approach where an extra key is
 784 necessary and the original string is only used for the error case.
 785
 786 If the string which has to be translated is the only argument this of
 787 course means the string itself is the key.  I.e., the translation will
 788 be selected based on the original string.  The message catalogs must
 789 therefore contain the original strings plus one translation for any such
 790 string.  The task of the @code{gettext} function is it to compare the
 791 argument string with the available strings in the catalog and return the
 792 appropriate translation.  Of course this process is optimized so that
 793 this process is not more expensive than an access using an atomic key
 794 like in @code{catgets}.
 795
 796 The @code{gettext} approach has some advantages but also some
 797 disadvantages.  Please see the GNU @file{gettext} manual for a detailed
 798 discussion of the pros and cons.
 799
 800 All the definitions and declarations for @code{gettext} can be found in
 801 the @file{libintl.h} header file.  On systems where these functions are
 802 not part of the C library they can be found in a separate library named
 803 @file{libintl.a} (or accordingly different for shared libraries).
 804
 805 @comment libintl.h
 806 @comment GNU
 807 @deftypefun {char *} gettext (const char *@var{msgid})
 808 The @code{gettext} function searches the currently selected message
 809 catalogs for a string which is equal to @var{msgid}.  If there is such a
 810 string available it is returned.  Otherwise the argument string
 811 @var{msgid} is returned.
 812
 813 Please note that all though the return value is @code{char *} the
 814 returned string must not be changed.  This broken type results from the
 815 history of the function and does not reflect the way the function should
 816 be used.
 817
 818 Please note that above we wrote ``message catalogs'' (plural).  This is
 819 a specialty of the GNU implementation of these functions and we will
 820 say more about this when we talk about the ways message catalogs are
 821 selected (@pxref{Locating gettext catalog}).
 822
 823 The @code{gettext} function does not modify the value of the global
 824 @var{errno} variable.  This is necessary to make it possible to write
 825 something like
 826
 827 @smallexample
 828   printf (gettext ("Operation failed: %m\n"));
 829 @end smallexample
 830
 831 Here the @var{errno} value is used in the @code{printf} function while
 832 processing the @code{%m} format element and if the @code{gettext}
 833 function would change this value (it is called before @code{printf} is
 834 called) we would get a wrong message.
 835
 836 So there is no easy way to detect a missing message catalog beside
 837 comparing the argument string with the result.  But it is normally the
 838 task of the user to react on missing catalogs.  The program cannot guess
 839 when a message catalog is really necessary since for a user who speaks
 840 the language the program was developed in does not need any translation.
 841 @end deftypefun
 842
 843 The remaining two functions to access the message catalog add some
 844 functionality to select a message catalog which is not the default one.
 845 This is important if parts of the program are developed independently.
 846 Every part can have its own message catalog and all of them can be used
 847 at the same time.  The C library itself is an example: internally it
 848 uses the @code{gettext} functions but since it must not depend on a
 849 currently selected default message catalog it must specify all ambiguous
 850 information.
 851
 852 @comment libintl.h
 853 @comment GNU
 854 @deftypefun {char *} dgettext (const char *@var{domainname}, const char *@var{msgid})
 855 The @code{dgettext} functions acts just like the @code{gettext}
 856 function.  It only takes an additional first argument @var{domainname}
 857 which guides the selection of the message catalogs which are searched
 858 for the translation.  If the @var{domainname} parameter is the null
 859 pointer the @code{dgettext} function is exactly equivalent to
 860 @code{gettext} since the default value for the domain name is used.
 861
 862 As for @code{gettext} the return value type is @code{char *} which is an
 863 anachronism.  The returned string must never be modified.
 864 @end deftypefun
 865
 866 @comment libintl.h
 867 @comment GNU
 868 @deftypefun {char *} dcgettext (const char *@var{domainname}, const char *@var{msgid}, int @var{category})
 869 The @code{dcgettext} adds another argument to those which
 870 @code{dgettext} takes.  This argument @var{category} specifies the last
 871 piece of information needed to localize the message catalog.  I.e., the
 872 domain name and the locale category exactly specify which message
 873 catalog has to be used (relative to a given directory, see below).
 874
 875 The @code{dgettext} function can be expressed in terms of
 876 @code{dcgettext} by using
 877
 878 @smallexample
 879 dcgettext (domain, string, LC_MESSAGES)
 880 @end smallexample
 881
 882 @noindent
 883 instead of
 884
 885 @smallexample
 886 dgettext (domain, string)
 887 @end smallexample
 888
 889 This also shows which values are expected for the third parameter.  One
 890 has to use the available selectors for the categories available in
 891 @file{locale.h}.  Normally the available values are @code{LC_CTYPE},
 892 @code{LC_COLLATE}, @code{LC_MESSAGES}, @code{LC_MONETARY},
 893 @code{LC_NUMERIC}, and @code{LC_TIME}.  Please note that @code{LC_ALL}
 894 must not be used and even though the names might suggest this, there is
 895 no relation to the environments variables of this name.
 896
 897 The @code{dcgettext} function is only implemented for compatibility with
 898 other systems which have @code{gettext} functions.  There is not really
 899 any situation where it is necessary (or useful) to use a different value
 900 but @code{LC_MESSAGES} in for the @var{category} parameter.  We are
 901 dealing with messages here and any other choice can only be irritating.
 902
 903 As for @code{gettext} the return value type is @code{char *} which is an
 904 anachronism.  The returned string must never be modified.
 905 @end deftypefun
 906
 907 When using the three functions above in a program it is a frequent case
 908 that the @var{msgid} argument is a constant string.  So it is worth to
 909 optimize this case.  Thinking shortly about this one will realize that
 910 as long as no new message catalog is loaded the translation of a message
 911 will not change.  This optimization is actually implemented by the
 912 @code{gettext}, @code{dgettext} and @code{dcgettext} functions.
 913
 914
 915 @node Locating gettext catalog
 916 @subsubsection How to determine which catalog to be used
 917
 918 The functions to retrieve the translations for a given message have a
 919 remarkable simple interface.  But to provide the user of the program
 920 still the opportunity to select exactly the translation s/he wants and
 921 also to provide the programmer the possibility to influence the way to
 922 locate the search for catalogs files there is a quite complicated
 923 underlying mechanism which controls all this.  The code is complicated
 924 the use is easy.
 925
 926 Basically we have two different tasks to perform which can also be
 927 performed by the @code{catgets} functions:
 928
 929 @enumerate
 930 @item
 931 Locate the set of message catalogs.  There are a number of files for
 932 different languages and which all belong to the package.  Usually they
 933 are all stored in the filesystem below a certain directory.
 934
 935 There can be arbitrary many packages installed and they can follow
 936 different guidelines for the placement of their files.
 937
 938 @item
 939 Relative to the location specified by the package the actual translation
 940 files must be searched, based on the wishes of the user.  I.e., for each
 941 language the user selects the program should be able to locate the
 942 appropriate file.
 943 @end enumerate
 944
 945 This is the functionality required by the specifications for
 946 @code{gettext} and this is also what the @code{catgets} functions are
 947 able to do.  But there are some problems unresolved:
 948
 949 @itemize @bullet
 950 @item
 951 The language to be used can be specified in several different ways.
 952 There is no generally accepted standard for this and the user always
 953 expects the program understand what s/he means.  E.g., to select the
 954 German translation one could write @code{de}, @code{german}, or
 955 @code{deutsch} and the program should always react the same.
 956
 957 @item
 958 Sometimes the specification of the user is too detailed.  If s/he, e.g.,
 959 specifies @code{de_DE.ISO-8859-1} which means German, spoken in Germany,
 960 coded using the @w{ISO 8859-1} character set there is the possibility
 961 that a message catalog matching this exactly is not available.  But
 962 there could be a catalog matching @code{de} and if the character set
 963 used on the machine is always @w{ISO 8859-1} there is no reason why this
 964 later message catalog should not be used.  (We call this @dfn{message
 965 inheritance}.)
 966
 967 @item
 968 If a catalog for a wanted language is not available it is not always the
 969 second best choice to fall back on the language of the developer and
 970 simply not translate any message.  Instead a user might be better able
 971 to read the messages in another language and so the user of the program
 972 should be able to define an precedence order of languages.
 973 @end itemize
 974
 975 We can divide the configuration actions in two parts: the one is
 976 performed by the programmer, the other by the user.  We will start with
 977 the functions the programmer can use since the user configuration will
 978 be based on this.
 979
 980 As the functions described in the last sections already mention separate
 981 sets of messages can be selected by a @dfn{domain name}.  This is a
 982 simple string which should be unique for each program part with uses a
 983 separate domain.  It is possible to use in one program arbitrary many
 984 domains at the same time.  E.g., @theglibc{} itself uses a domain
 985 named @code{libc} while the program using the C Library could use a
 986 domain named @code{foo}.  The important point is that at any time
 987 exactly one domain is active.  This is controlled with the following
 988 function.
 989
 990 @comment libintl.h
 991 @comment GNU
 992 @deftypefun {char *} textdomain (const char *@var{domainname})
 993 The @code{textdomain} function sets the default domain, which is used in
 994 all future @code{gettext} calls, to @var{domainname}.  Please note that
 995 @code{dgettext} and @code{dcgettext} calls are not influenced if the
 996 @var{domainname} parameter of these functions is not the null pointer.
 997
 998 Before the first call to @code{textdomain} the default domain is
 999 @code{messages}.  This is the name specified in the specification of
1000 the @code{gettext} API.  This name is as good as any other name.  No
1001 program should ever really use a domain with this name since this can
1002 only lead to problems.
1003
1004 The function returns the value which is from now on taken as the default
1005 domain.  If the system went out of memory the returned value is
1006 @code{NULL} and the global variable @var{errno} is set to @code{ENOMEM}.
1007 Despite the return value type being @code{char *} the return string must
1008 not be changed.  It is allocated internally by the @code{textdomain}
1009 function.
1010
1011 If the @var{domainname} parameter is the null pointer no new default
1012 domain is set.  Instead the currently selected default domain is
1013 returned.
1014
1015 If the @var{domainname} parameter is the empty string the default domain
1016 is reset to its initial value, the domain with the name @code{messages}.
1017 This possibility is questionable to use since the domain @code{messages}
1018 really never should be used.
1019 @end deftypefun
1020
1021 @comment libintl.h
1022 @comment GNU
1023 @deftypefun {char *} bindtextdomain (const char *@var{domainname}, const char *@var{dirname})
1024 The @code{bindtextdomain} function can be used to specify the directory
1025 which contains the message catalogs for domain @var{domainname} for the
1026 different languages.  To be correct, this is the directory where the
1027 hierarchy of directories is expected.  Details are explained below.
1028
1029 For the programmer it is important to note that the translations which
1030 come with the program have be placed in a directory hierarchy starting
1031 at, say, @file{/foo/bar}.  Then the program should make a
1032 @code{bindtextdomain} call to bind the domain for the current program to
1033 this directory.  So it is made sure the catalogs are found.  A correctly
1034 running program does not depend on the user setting an environment
1035 variable.
1036
1037 The @code{bindtextdomain} function can be used several times and if the
1038 @var{domainname} argument is different the previously bound domains
1039 will not be overwritten.
1040
1041 If the program which wish to use @code{bindtextdomain} at some point of
1042 time use the @code{chdir} function to change the current working
1043 directory it is important that the @var{dirname} strings ought to be an
1044 absolute pathname.  Otherwise the addressed directory might vary with
1045 the time.
1046
1047 If the @var{dirname} parameter is the null pointer @code{bindtextdomain}
1048 returns the currently selected directory for the domain with the name
1049 @var{domainname}.
1050
1051 The @code{bindtextdomain} function returns a pointer to a string
1052 containing the name of the selected directory name.  The string is
1053 allocated internally in the function and must not be changed by the
1054 user.  If the system went out of core during the execution of
1055 @code{bindtextdomain} the return value is @code{NULL} and the global
1056 variable @var{errno} is set accordingly.
1057 @end deftypefun
1058
1059
1060 @node Advanced gettext functions
1061 @subsubsection Additional functions for more complicated situations
1062
1063 The functions of the @code{gettext} family described so far (and all the
1064 @code{catgets} functions as well) have one problem in the real world
1065 which have been neglected completely in all existing approaches.  What
1066 is meant here is the handling of plural forms.
1067
1068 Looking through Unix source code before the time anybody thought about
1069 internationalization (and, sadly, even afterwards) one can often find
1070 code similar to the following:
1071
1072 @smallexample
1073    printf ("%d file%s deleted", n, n == 1 ? "" : "s");
1074 @end smallexample
1075
1076 @noindent
1077 After the first complaints from people internationalizing the code people
1078 either completely avoided formulations like this or used strings like
1079 @code{"file(s)"}.  Both look unnatural and should be avoided.  First
1080 tries to solve the problem correctly looked like this:
1081
1082 @smallexample
1083    if (n == 1)
1084      printf ("%d file deleted", n);
1085    else
1086      printf ("%d files deleted", n);
1087 @end smallexample
1088
1089 But this does not solve the problem.  It helps languages where the
1090 plural form of a noun is not simply constructed by adding an `s' but
1091 that is all.  Once again people fell into the trap of believing the
1092 rules their language is using are universal.  But the handling of plural
1093 forms differs widely between the language families.  There are two
1094 things we can differ between (and even inside language families);
1095
1096 @itemize @bullet
1097 @item
1098 The form how plural forms are build differs.  This is a problem with
1099 language which have many irregularities.  German, for instance, is a
1100 drastic case.  Though English and German are part of the same language
1101 family (Germanic), the almost regular forming of plural noun forms
1102 (appending an `s') is hardly found in German.
1103
1104 @item
1105 The number of plural forms differ.  This is somewhat surprising for
1106 those who only have experiences with Romanic and Germanic languages
1107 since here the number is the same (there are two).
1108
1109 But other language families have only one form or many forms.  More
1110 information on this in an extra section.
1111 @end itemize
1112
1113 The consequence of this is that application writers should not try to
1114 solve the problem in their code.  This would be localization since it is
1115 only usable for certain, hardcoded language environments.  Instead the
1116 extended @code{gettext} interface should be used.
1117
1118 These extra functions are taking instead of the one key string two
1119 strings and an numerical argument.  The idea behind this is that using
1120 the numerical argument and the first string as a key, the implementation
1121 can select using rules specified by the translator the right plural
1122 form.  The two string arguments then will be used to provide a return
1123 value in case no message catalog is found (similar to the normal
1124 @code{gettext} behavior).  In this case the rules for Germanic language
1125 is used and it is assumed that the first string argument is the singular
1126 form, the second the plural form.
1127
1128 This has the consequence that programs without language catalogs can
1129 display the correct strings only if the program itself is written using
1130 a Germanic language.  This is a limitation but since @theglibc{}
1131 (as well as the GNU @code{gettext} package) are written as part of the
1132 GNU package and the coding standards for the GNU project require program
1133 being written in English, this solution nevertheless fulfills its
1134 purpose.
1135
1136 @comment libintl.h
1137 @comment GNU
1138 @deftypefun {char *} ngettext (const char *@var{msgid1}, const char *@var{msgid2}, unsigned long int @var{n})
1139 The @code{ngettext} function is similar to the @code{gettext} function
1140 as it finds the message catalogs in the same way.  But it takes two
1141 extra arguments.  The @var{msgid1} parameter must contain the singular
1142 form of the string to be converted.  It is also used as the key for the
1143 search in the catalog.  The @var{msgid2} parameter is the plural form.
1144 The parameter @var{n} is used to determine the plural form.  If no
1145 message catalog is found @var{msgid1} is returned if @code{n == 1},
1146 otherwise @code{msgid2}.
1147
1148 An example for the us of this function is:
1149
1150 @smallexample
1151   printf (ngettext ("%d file removed", "%d files removed", n), n);
1152 @end smallexample
1153
1154 Please note that the numeric value @var{n} has to be passed to the
1155 @code{printf} function as well.  It is not sufficient to pass it only to
1156 @code{ngettext}.
1157 @end deftypefun
1158
1159 @comment libintl.h
1160 @comment GNU
1161 @deftypefun {char *} dngettext (const char *@var{domain}, const char *@var{msgid1}, const char *@var{msgid2}, unsigned long int @var{n})
1162 The @code{dngettext} is similar to the @code{dgettext} function in the
1163 way the message catalog is selected.  The difference is that it takes
1164 two extra parameter to provide the correct plural form.  These two
1165 parameters are handled in the same way @code{ngettext} handles them.
1166 @end deftypefun
1167
1168 @comment libintl.h
1169 @comment GNU
1170 @deftypefun {char *} dcngettext (const char *@var{domain}, const char *@var{msgid1}, const char *@var{msgid2}, unsigned long int @var{n}, int @var{category})
1171 The @code{dcngettext} is similar to the @code{dcgettext} function in the
1172 way the message catalog is selected.  The difference is that it takes
1173 two extra parameter to provide the correct plural form.  These two
1174 parameters are handled in the same way @code{ngettext} handles them.
1175 @end deftypefun
1176
1177 @subsubheading The problem of plural forms
1178
1179 A description of the problem can be found at the beginning of the last
1180 section.  Now there is the question how to solve it.  Without the input
1181 of linguists (which was not available) it was not possible to determine
1182 whether there are only a few different forms in which plural forms are
1183 formed or whether the number can increase with every new supported
1184 language.
1185
1186 Therefore the solution implemented is to allow the translator to specify
1187 the rules of how to select the plural form.  Since the formula varies
1188 with every language this is the only viable solution except for
1189 hardcoding the information in the code (which still would require the
1190 possibility of extensions to not prevent the use of new languages).  The
1191 details are explained in the GNU @code{gettext} manual.  Here only a
1192 bit of information is provided.
1193
1194 The information about the plural form selection has to be stored in the
1195 header entry (the one with the empty (@code{msgid} string).  It looks
1196 like this:
1197
1198 @smallexample
1199 Plural-Forms: nplurals=2; plural=n == 1 ? 0 : 1;
1200 @end smallexample
1201
1202 The @code{nplurals} value must be a decimal number which specifies how
1203 many different plural forms exist for this language.  The string
1204 following @code{plural} is an expression which is using the C language
1205 syntax.  Exceptions are that no negative number are allowed, numbers
1206 must be decimal, and the only variable allowed is @code{n}.  This
1207 expression will be evaluated whenever one of the functions
1208 @code{ngettext}, @code{dngettext}, or @code{dcngettext} is called.  The
1209 numeric value passed to these functions is then substituted for all uses
1210 of the variable @code{n} in the expression.  The resulting value then
1211 must be greater or equal to zero and smaller than the value given as the
1212 value of @code{nplurals}.
1213
1214 @noindent
1215 The following rules are known at this point.  The language with families
1216 are listed.  But this does not necessarily mean the information can be
1217 generalized for the whole family (as can be easily seen in the table
1218 below).@footnote{Additions are welcome.  Send appropriate information to
1219 @email{bug-glibc-manual@@gnu.org}.}
1220
1221 @table @asis
1222 @item Only one form:
1223 Some languages only require one single form.  There is no distinction
1224 between the singular and plural form.  An appropriate header entry
1225 would look like this:
1226
1227 @smallexample
1228 Plural-Forms: nplurals=1; plural=0;
1229 @end smallexample
1230
1231 @noindent
1232 Languages with this property include:
1233
1234 @table @asis
1235 @item Finno-Ugric family
1236 Hungarian
1237 @item Asian family
1238 Japanese, Korean
1239 @item Turkic/Altaic family
1240 Turkish
1241 @end table
1242
1243 @item Two forms, singular used for one only
1244 This is the form used in most existing programs since it is what English
1245 is using.  A header entry would look like this:
1246
1247 @smallexample
1248 Plural-Forms: nplurals=2; plural=n != 1;
1249 @end smallexample
1250
1251 (Note: this uses the feature of C expressions that boolean expressions
1252 have to value zero or one.)
1253
1254 @noindent
1255 Languages with this property include:
1256
1257 @table @asis
1258 @item Germanic family
1259 Danish, Dutch, English, German, Norwegian, Swedish
1260 @item Finno-Ugric family
1261 Estonian, Finnish
1262 @item Latin/Greek family
1263 Greek
1264 @item Semitic family
1265 Hebrew
1266 @item Romance family
1267 Italian, Portuguese, Spanish
1268 @item Artificial
1269 Esperanto
1270 @end table
1271
1272 @item Two forms, singular used for zero and one
1273 Exceptional case in the language family.  The header entry would be:
1274
1275 @smallexample
1276 Plural-Forms: nplurals=2; plural=n>1;
1277 @end smallexample
1278
1279 @noindent
1280 Languages with this property include:
1281
1282 @table @asis
1283 @item Romanic family
1284 French, Brazilian Portuguese
1285 @end table
1286
1287 @item Three forms, special case for zero
1288 The header entry would be:
1289
1290 @smallexample
1291 Plural-Forms: nplurals=3; plural=n%10==1 && n%100!=11 ? 0 : n != 0 ? 1 : 2;
1292 @end smallexample
1293
1294 @noindent
1295 Languages with this property include:
1296
1297 @table @asis
1298 @item Baltic family
1299 Latvian
1300 @end table
1301
1302 @item Three forms, special cases for one and two
1303 The header entry would be:
1304
1305 @smallexample
1306 Plural-Forms: nplurals=3; plural=n==1 ? 0 : n==2 ? 1 : 2;
1307 @end smallexample
1308
1309 @noindent
1310 Languages with this property include:
1311
1312 @table @asis
1313 @item Celtic
1314 Gaeilge (Irish)
1315 @end table
1316
1317 @item Three forms, special case for numbers ending in 1[2-9]
1318 The header entry would look like this:
1319
1320 @smallexample
1321 Plural-Forms: nplurals=3; \
1322     plural=n%10==1 && n%100!=11 ? 0 : \
1323            n%10>=2 && (n%100<10 || n%100>=20) ? 1 : 2;
1324 @end smallexample
1325
1326 @noindent
1327 Languages with this property include:
1328
1329 @table @asis
1330 @item Baltic family
1331 Lithuanian
1332 @end table
1333
1334 @item Three forms, special cases for numbers ending in 1 and 2, 3, 4, except those ending in 1[1-4]
1335 The header entry would look like this:
1336
1337 @smallexample
1338 Plural-Forms: nplurals=3; \
1339     plural=n%100/10==1 ? 2 : n%10==1 ? 0 : (n+9)%10>3 ? 2 : 1;
1340 @end smallexample
1341
1342 @noindent
1343 Languages with this property include:
1344
1345 @table @asis
1346 @item Slavic family
1347 Croatian, Czech, Russian, Ukrainian
1348 @end table
1349
1350 @item Three forms, special cases for 1 and 2, 3, 4
1351 The header entry would look like this:
1352
1353 @smallexample
1354 Plural-Forms: nplurals=3; \
1355     plural=(n==1) ? 1 : (n>=2 && n<=4) ? 2 : 0;
1356 @end smallexample
1357
1358 @noindent
1359 Languages with this property include:
1360
1361 @table @asis
1362 @item Slavic family
1363 Slovak
1364 @end table
1365
1366 @item Three forms, special case for one and some numbers ending in 2, 3, or 4
1367 The header entry would look like this:
1368
1369 @smallexample
1370 Plural-Forms: nplurals=3; \
1371     plural=n==1 ? 0 : \
1372            n%10>=2 && n%10<=4 && (n%100<10 || n%100>=20) ? 1 : 2;
1373 @end smallexample
1374
1375 @noindent
1376 Languages with this property include:
1377
1378 @table @asis
1379 @item Slavic family
1380 Polish
1381 @end table
1382
1383 @item Four forms, special case for one and all numbers ending in 02, 03, or 04
1384 The header entry would look like this:
1385
1386 @smallexample
1387 Plural-Forms: nplurals=4; \
1388     plural=n%100==1 ? 0 : n%100==2 ? 1 : n%100==3 || n%100==4 ? 2 : 3;
1389 @end smallexample
1390
1391 @noindent
1392 Languages with this property include:
1393
1394 @table @asis
1395 @item Slavic family
1396 Slovenian
1397 @end table
1398 @end table
1399
1400
1401 @node Charset conversion in gettext
1402 @subsubsection How to specify the output character set @code{gettext} uses
1403
1404 @code{gettext} not only looks up a translation in a message catalog.  It
1405 also converts the translation on the fly to the desired output character
1406 set.  This is useful if the user is working in a different character set
1407 than the translator who created the message catalog, because it avoids
1408 distributing variants of message catalogs which differ only in the
1409 character set.
1410
1411 The output character set is, by default, the value of @code{nl_langinfo
1412 (CODESET)}, which depends on the @code{LC_CTYPE} part of the current
1413 locale.  But programs which store strings in a locale independent way
1414 (e.g. UTF-8) can request that @code{gettext} and related functions
1415 return the translations in that encoding, by use of the
1416 @code{bind_textdomain_codeset} function.
1417
1418 Note that the @var{msgid} argument to @code{gettext} is not subject to
1419 character set conversion.  Also, when @code{gettext} does not find a
1420 translation for @var{msgid}, it returns @var{msgid} unchanged --
1421 independently of the current output character set.  It is therefore
1422 recommended that all @var{msgid}s be US-ASCII strings.
1423
1424 @comment libintl.h
1425 @comment GNU
1426 @deftypefun {char *} bind_textdomain_codeset (const char *@var{domainname}, const char *@var{codeset})
1427 The @code{bind_textdomain_codeset} function can be used to specify the
1428 output character set for message catalogs for domain @var{domainname}.
1429 The @var{codeset} argument must be a valid codeset name which can be used
1430 for the @code{iconv_open} function, or a null pointer.
1431
1432 If the @var{codeset} parameter is the null pointer,
1433 @code{bind_textdomain_codeset} returns the currently selected codeset
1434 for the domain with the name @var{domainname}. It returns @code{NULL} if
1435 no codeset has yet been selected.
1436
1437 The @code{bind_textdomain_codeset} function can be used several times.
1438 If used multiple times with the same @var{domainname} argument, the
1439 later call overrides the settings made by the earlier one.
1440
1441 The @code{bind_textdomain_codeset} function returns a pointer to a
1442 string containing the name of the selected codeset.  The string is
1443 allocated internally in the function and must not be changed by the
1444 user.  If the system went out of core during the execution of
1445 @code{bind_textdomain_codeset}, the return value is @code{NULL} and the
1446 global variable @var{errno} is set accordingly.  @end deftypefun
1447
1448
1449 @node GUI program problems
1450 @subsubsection How to use @code{gettext} in GUI programs
1451
1452 One place where the @code{gettext} functions, if used normally, have big
1453 problems is within programs with graphical user interfaces (GUIs).  The
1454 problem is that many of the strings which have to be translated are very
1455 short.  They have to appear in pull-down menus which restricts the
1456 length.  But strings which are not containing entire sentences or at
1457 least large fragments of a sentence may appear in more than one
1458 situation in the program but might have different translations.  This is
1459 especially true for the one-word strings which are frequently used in
1460 GUI programs.
1461
1462 As a consequence many people say that the @code{gettext} approach is
1463 wrong and instead @code{catgets} should be used which indeed does not
1464 have this problem.  But there is a very simple and powerful method to
1465 handle these kind of problems with the @code{gettext} functions.
1466
1467 @noindent
1468 As an example consider the following fictional situation.  A GUI program
1469 has a menu bar with the following entries:
1470
1471 @smallexample
1472 +------------+------------+--------------------------------------+
1473 | File       | Printer    |                                      |
1474 +------------+------------+--------------------------------------+
1475 | Open     | | Select   |
1476 | New      | | Open     |
1477 +----------+ | Connect  |
1478              +----------+
1479 @end smallexample
1480
1481 To have the strings @code{File}, @code{Printer}, @code{Open},
1482 @code{New}, @code{Select}, and @code{Connect} translated there has to be
1483 at some point in the code a call to a function of the @code{gettext}
1484 family.  But in two places the string passed into the function would be
1485 @code{Open}.  The translations might not be the same and therefore we
1486 are in the dilemma described above.
1487
1488 One solution to this problem is to artificially enlengthen the strings
1489 to make them unambiguous.  But what would the program do if no
1490 translation is available?  The enlengthened string is not what should be
1491 printed.  So we should use a little bit modified version of the functions.
1492
1493 To enlengthen the strings a uniform method should be used.  E.g., in the
1494 example above the strings could be chosen as
1495
1496 @smallexample
1497 Menu|File
1498 Menu|Printer
1499 Menu|File|Open
1500 Menu|File|New
1501 Menu|Printer|Select
1502 Menu|Printer|Open
1503 Menu|Printer|Connect
1504 @end smallexample
1505
1506 Now all the strings are different and if now instead of @code{gettext}
1507 the following little wrapper function is used, everything works just
1508 fine:
1509
1510 @cindex sgettext
1511 @smallexample
1512   char *
1513   sgettext (const char *msgid)
1514   @{
1515     char *msgval = gettext (msgid);
1516     if (msgval == msgid)
1517       msgval = strrchr (msgid, '|') + 1;
1518     return msgval;
1519   @}
1520 @end smallexample
1521
1522 What this little function does is to recognize the case when no
1523 translation is available.  This can be done very efficiently by a
1524 pointer comparison since the return value is the input value.  If there
1525 is no translation we know that the input string is in the format we used
1526 for the Menu entries and therefore contains a @code{|} character.  We
1527 simply search for the last occurrence of this character and return a
1528 pointer to the character following it.  That's it!
1529
1530 If one now consistently uses the enlengthened string form and replaces
1531 the @code{gettext} calls with calls to @code{sgettext} (this is normally
1532 limited to very few places in the GUI implementation) then it is
1533 possible to produce a program which can be internationalized.
1534
1535 With advanced compilers (such as GNU C) one can write the
1536 @code{sgettext} functions as an inline function or as a macro like this:
1537
1538 @cindex sgettext
1539 @smallexample
1540 #define sgettext(msgid) \
1541   (@{ const char *__msgid = (msgid);            \
1542      char *__msgstr = gettext (__msgid);       \
1543      if (__msgval == __msgid)                  \
1544        __msgval = strrchr (__msgid, '|') + 1;  \
1545      __msgval; @})
1546 @end smallexample
1547
1548 The other @code{gettext} functions (@code{dgettext}, @code{dcgettext}
1549 and the @code{ngettext} equivalents) can and should have corresponding
1550 functions as well which look almost identical, except for the parameters
1551 and the call to the underlying function.
1552
1553 Now there is of course the question why such functions do not exist in
1554 @theglibc{}?  There are two parts of the answer to this question.
1555
1556 @itemize @bullet
1557 @item
1558 They are easy to write and therefore can be provided by the project they
1559 are used in.  This is not an answer by itself and must be seen together
1560 with the second part which is:
1561
1562 @item
1563 There is no way the C library can contain a version which can work
1564 everywhere.  The problem is the selection of the character to separate
1565 the prefix from the actual string in the enlenghtened string.  The
1566 examples above used @code{|} which is a quite good choice because it
1567 resembles a notation frequently used in this context and it also is a
1568 character not often used in message strings.
1569
1570 But what if the character is used in message strings.  Or if the chose
1571 character is not available in the character set on the machine one
1572 compiles (e.g., @code{|} is not required to exist for @w{ISO C}; this is
1573 why the @file{iso646.h} file exists in @w{ISO C} programming environments).
1574 @end itemize
1575
1576 There is only one more comment to make left.  The wrapper function above
1577 require that the translations strings are not enlengthened themselves.
1578 This is only logical.  There is no need to disambiguate the strings
1579 (since they are never used as keys for a search) and one also saves
1580 quite some memory and disk space by doing this.
1581
1582
1583 @node Using gettextized software
1584 @subsubsection User influence on @code{gettext}
1585
1586 The last sections described what the programmer can do to
1587 internationalize the messages of the program.  But it is finally up to
1588 the user to select the message s/he wants to see.  S/He must understand
1589 them.
1590
1591 The POSIX locale model uses the environment variables @code{LC_COLLATE},
1592 @code{LC_CTYPE}, @code{LC_MESSAGES}, @code{LC_MONETARY}, @code{LC_NUMERIC},
1593 and @code{LC_TIME} to select the locale which is to be used.  This way
1594 the user can influence lots of functions.  As we mentioned above the
1595 @code{gettext} functions also take advantage of this.
1596
1597 To understand how this happens it is necessary to take a look at the
1598 various components of the filename which gets computed to locate a
1599 message catalog.  It is composed as follows:
1600
1601 @smallexample
1602 @var{dir_name}/@var{locale}/LC_@var{category}/@var{domain_name}.mo
1603 @end smallexample
1604
1605 The default value for @var{dir_name} is system specific.  It is computed
1606 from the value given as the prefix while configuring the C library.
1607 This value normally is @file{/usr} or @file{/}.  For the former the
1608 complete @var{dir_name} is:
1609
1610 @smallexample
1611 /usr/share/locale
1612 @end smallexample
1613
1614 We can use @file{/usr/share} since the @file{.mo} files containing the
1615 message catalogs are system independent, so all systems can use the same
1616 files.  If the program executed the @code{bindtextdomain} function for
1617 the message domain that is currently handled, the @code{dir_name}
1618 component is exactly the value which was given to the function as
1619 the second parameter.  I.e., @code{bindtextdomain} allows overwriting
1620 the only system dependent and fixed value to make it possible to
1621 address files anywhere in the filesystem.
1622
1623 The @var{category} is the name of the locale category which was selected
1624 in the program code.  For @code{gettext} and @code{dgettext} this is
1625 always @code{LC_MESSAGES}, for @code{dcgettext} this is selected by the
1626 value of the third parameter.  As said above it should be avoided to
1627 ever use a category other than @code{LC_MESSAGES}.
1628
1629 The @var{locale} component is computed based on the category used.  Just
1630 like for the @code{setlocale} function here comes the user selection
1631 into the play.  Some environment variables are examined in a fixed order
1632 and the first environment variable set determines the return value of
1633 the lookup process.  In detail, for the category @code{LC_xxx} the
1634 following variables in this order are examined:
1635
1636 @table @code
1637 @item LANGUAGE
1638 @item LC_ALL
1639 @item LC_xxx
1640 @item LANG
1641 @end table
1642
1643 This looks very familiar.  With the exception of the @code{LANGUAGE}
1644 environment variable this is exactly the lookup order the
1645 @code{setlocale} function uses.  But why introducing the @code{LANGUAGE}
1646 variable?
1647
1648 The reason is that the syntax of the values these variables can have is
1649 different to what is expected by the @code{setlocale} function.  If we
1650 would set @code{LC_ALL} to a value following the extended syntax that
1651 would mean the @code{setlocale} function will never be able to use the
1652 value of this variable as well.  An additional variable removes this
1653 problem plus we can select the language independently of the locale
1654 setting which sometimes is useful.
1655
1656 While for the @code{LC_xxx} variables the value should consist of
1657 exactly one specification of a locale the @code{LANGUAGE} variable's
1658 value can consist of a colon separated list of locale names.  The
1659 attentive reader will realize that this is the way we manage to
1660 implement one of our additional demands above: we want to be able to
1661 specify an ordered list of language.
1662
1663 Back to the constructed filename we have only one component missing.
1664 The @var{domain_name} part is the name which was either registered using
1665 the @code{textdomain} function or which was given to @code{dgettext} or
1666 @code{dcgettext} as the first parameter.  Now it becomes obvious that a
1667 good choice for the domain name in the program code is a string which is
1668 closely related to the program/package name.  E.g., for @theglibc{}
1669 the domain name is @code{libc}.
1670
1671 @noindent
1672 A limit piece of example code should show how the programmer is supposed
1673 to work:
1674
1675 @smallexample
1676 @{
1677   setlocale (LC_ALL, "");
1678   textdomain ("test-package");
1679   bindtextdomain ("test-package", "/usr/local/share/locale");
1680   puts (gettext ("Hello, world!"));
1681 @}
1682 @end smallexample
1683
1684 At the program start the default domain is @code{messages}, and the
1685 default locale is "C".  The @code{setlocale} call sets the locale
1686 according to the user's environment variables; remember that correct
1687 functioning of @code{gettext} relies on the correct setting of the
1688 @code{LC_MESSAGES} locale (for looking up the message catalog) and
1689 of the @code{LC_CTYPE} locale (for the character set conversion).
1690 The @code{textdomain} call changes the default domain to
1691 @code{test-package}.  The @code{bindtextdomain} call specifies that
1692 the message catalogs for the domain @code{test-package} can be found
1693 below the directory @file{/usr/local/share/locale}.
1694
1695 If now the user set in her/his environment the variable @code{LANGUAGE}
1696 to @code{de} the @code{gettext} function will try to use the
1697 translations from the file
1698
1699 @smallexample
1700 /usr/local/share/locale/de/LC_MESSAGES/test-package.mo
1701 @end smallexample
1702
1703 From the above descriptions it should be clear which component of this
1704 filename is determined by which source.
1705
1706 In the above example we assumed that the @code{LANGUAGE} environment
1707 variable to @code{de}.  This might be an appropriate selection but what
1708 happens if the user wants to use @code{LC_ALL} because of the wider
1709 usability and here the required value is @code{de_DE.ISO-8859-1}?  We
1710 already mentioned above that a situation like this is not infrequent.
1711 E.g., a person might prefer reading a dialect and if this is not
1712 available fall back on the standard language.
1713
1714 The @code{gettext} functions know about situations like this and can
1715 handle them gracefully.  The functions recognize the format of the value
1716 of the environment variable.  It can split the value is different pieces
1717 and by leaving out the only or the other part it can construct new
1718 values.  This happens of course in a predictable way.  To understand
1719 this one must know the format of the environment variable value.  There
1720 is one more or less standardized form, originally from the X/Open
1721 specification:
1722
1723 @code{language[_territory[.codeset]][@@modifier]}
1724
1725 Less specific locale names will be stripped of in the order of the
1726 following list:
1727
1728 @enumerate
1729 @item
1730 @code{codeset}
1731 @item
1732 @code{normalized codeset}
1733 @item
1734 @code{territory}
1735 @item
1736 @code{modifier}
1737 @end enumerate
1738
1739 The @code{language} field will never be dropped for obvious reasons.
1740
1741 The only new thing is the @code{normalized codeset} entry.  This is
1742 another goodie which is introduced to help reducing the chaos which
1743 derives from the inability of the people to standardize the names of
1744 character sets.  Instead of @w{ISO-8859-1} one can often see @w{8859-1},
1745 @w{88591}, @w{iso8859-1}, or @w{iso_8859-1}.  The @code{normalized
1746 codeset} value is generated from the user-provided character set name by
1747 applying the following rules:
1748
1749 @enumerate
1750 @item
1751 Remove all characters beside numbers and letters.
1752 @item
1753 Fold letters to lowercase.
1754 @item
1755 If the same only contains digits prepend the string @code{"iso"}.
1756 @end enumerate
1757
1758 @noindent
1759 So all of the above name will be normalized to @code{iso88591}.  This
1760 allows the program user much more freely choosing the locale name.
1761
1762 Even this extended functionality still does not help to solve the
1763 problem that completely different names can be used to denote the same
1764 locale (e.g., @code{de} and @code{german}).  To be of help in this
1765 situation the locale implementation and also the @code{gettext}
1766 functions know about aliases.
1767
1768 The file @file{/usr/share/locale/locale.alias} (replace @file{/usr} with
1769 whatever prefix you used for configuring the C library) contains a
1770 mapping of alternative names to more regular names.  The system manager
1771 is free to add new entries to fill her/his own needs.  The selected
1772 locale from the environment is compared with the entries in the first
1773 column of this file ignoring the case.  If they match the value of the
1774 second column is used instead for the further handling.
1775
1776 In the description of the format of the environment variables we already
1777 mentioned the character set as a factor in the selection of the message
1778 catalog.  In fact, only catalogs which contain text written using the
1779 character set of the system/program can be used (directly; there will
1780 come a solution for this some day).  This means for the user that s/he
1781 will always have to take care for this.  If in the collection of the
1782 message catalogs there are files for the same language but coded using
1783 different character sets the user has to be careful.
1784
1785
1786 @node Helper programs for gettext
1787 @subsection Programs to handle message catalogs for @code{gettext}
1788
1789 @Theglibc{} does not contain the source code for the programs to
1790 handle message catalogs for the @code{gettext} functions.  As part of
1791 the GNU project the GNU gettext package contains everything the
1792 developer needs.  The functionality provided by the tools in this
1793 package by far exceeds the abilities of the @code{gencat} program
1794 described above for the @code{catgets} functions.
1795
1796 There is a program @code{msgfmt} which is the equivalent program to the
1797 @code{gencat} program.  It generates from the human-readable and
1798 -editable form of the message catalog a binary file which can be used by
1799 the @code{gettext} functions.  But there are several more programs
1800 available.
1801
1802 The @code{xgettext} program can be used to automatically extract the
1803 translatable messages from a source file.  I.e., the programmer need not
1804 take care for the translations and the list of messages which have to be
1805 translated.  S/He will simply wrap the translatable string in calls to
1806 @code{gettext} et.al and the rest will be done by @code{xgettext}.  This
1807 program has a lot of option which help to customize the output or do
1808 help to understand the input better.
1809
1810 Other programs help to manage development cycle when new messages appear
1811 in the source files or when a new translation of the messages appear.
1812 Here it should only be noted that using all the tools in GNU gettext it
1813 is possible to @emph{completely} automate the handling of message
1814 catalog.  Beside marking the translatable string in the source code and
1815 generating the translations the developers do not have anything to do
1816 themselves.