manual/charset.texi

   1 @node Character Set Handling, Locales, String and Array Utilities, Top
   2 @c %MENU% Support for extended character sets
   3 @chapter Character Set Handling
   4
   5 @ifnottex
   6 @macro cal{text}
   7 \text\
   8 @end macro
   9 @end ifnottex
  10
  11 Character sets used in the early days of computers had only six, seven,
  12 or eight bits for each character.  In no case more bits than would fit
  13 into one byte which nowadays is almost exclusively @w{8 bits} wide.
  14 This of course leads to several problems once not all characters needed
  15 at one time can be represented by the up to 256 available characters.
  16 This chapter shows the functionality which was added to the C library to
  17 overcome this problem.
  18
  19 @menu
  20 * Extended Char Intro::              Introduction to Extended Characters.
  21 * Charset Function Overview::        Overview about Character Handling
  22                                       Functions.
  23 * Restartable multibyte conversion:: Restartable multibyte conversion
  24                                       Functions.
  25 * Non-reentrant Conversion::         Non-reentrant Conversion Function.
  26 * Generic Charset Conversion::       Generic Charset Conversion.
  27 @end menu
  28
  29
  30 @node Extended Char Intro
  31 @section Introduction to Extended Characters
  32
  33 To overcome the limitations of character sets with a 1:1 relation
  34 between bytes and characters people came up with a variety of solutions.
  35 The remainder of this section gives a few examples to help understanding
  36 the design decision made while developing the functionality of the @w{C
  37 library} to support them.
  38
  39 @cindex internal representation
  40 A distinction we have to make right away is between internal and
  41 external representation.  @dfn{Internal representation} means the
  42 representation used by a program while keeping the text in memory.
  43 External representations are used when text is stored or transmitted
  44 through whatever communication channel.
  45
  46 Traditionally there was no difference between the two representations.
  47 It was equally comfortable and useful to use the same one-byte
  48 representation internally and externally.  This changes with more and
  49 larger character sets.
  50
  51 One of the problems to overcome with the internal representation is
  52 handling text which were externally encoded using different character
  53 sets.  Assume a program which reads two texts and compares them using
  54 some metric.  The comparison can be usefully done only if the texts are
  55 internally kept in a common format.
  56
  57 @cindex wide character
  58 For such a common format (@math{=} character set) eight bits are certainly
  59 not enough anymore.  So the smallest entity will have to grow: @dfn{wide
  60 characters} will be used.  Here instead of one byte one uses two or four
  61 (three are not good to address in memory and more than four bytes seem
  62 not to be necessary).
  63
  64 @cindex Unicode
  65 @cindex ISO 10646
  66 As shown in some other part of this manual
  67 @c !!! Ahem, wide char string functions are not yet covered -- drepper
  68 there exists a completely new family of functions which can handle texts
  69 of this kinds in memory.  The most commonly used character set for such
  70 internal wide character representations are Unicode and @w{ISO 10646}.
  71 The former is a subset of the later and used when wide characters are
  72 chosen to by 2 bytes (@math{= 16} bits) wide.  The standard names of the
  73 @cindex UCS2
  74 @cindex UCS4
  75 encodings used in these cases are UCS2 (@math{= 16} bits) and UCS4
  76 (@math{= 32} bits).
  77
  78 To represent wide characters the @code{char} type is certainly not
  79 suitable.  For this reason the @w{ISO C} standard introduces a new type
  80 which is designed to keep one character of a wide character string.  To
  81 maintain the similarity there is also a type corresponding to @code{int}
  82 for those functions which take a single wide character.
  83
  84 @comment stddef.h
  85 @comment ISO
  86 @deftp {Data type} wchar_t
  87 This data type is used as the base type for wide character strings.
  88 I.e., arrays of objects of this type are the equivalent of @code{char[]}
  89 for multibyte character strings.  The type is defined in @file{stddef.h}.
  90
  91 The @w{ISO C89} standard, where this type was introduced, does not say
  92 anything specific about the representation.  It only requires that this
  93 type is capable to store all elements of the basic character set.
  94 Therefore it would be legitimate to define @code{wchar_t} and
  95 @code{char}.  This might make sense for embedded systems.
  96
  97 But for GNU systems this type is always 32 bits wide.  It is therefore
  98 capable to represent all UCS4 value therefore covering all of @w{ISO
  99 10646}.  Some Unix systems define @code{wchar_t} as a 16 bit type and
 100 thereby follow Unicode very strictly.  This is perfectly fine with the
 101 standard but it also means that to represent all characters fro Unicode
 102 and @w{ISO 10646} one has to use surrogate character which is in fact a
 103 multi-wide-character encoding.  But this contradicts the purpose of the
 104 @code{wchar_t} type.
 105 @end deftp
 106
 107 @comment wchar.h
 108 @comment ISO
 109 @deftp {Data type} wint_t
 110 @code{wint_t} is a data type used for parameters and variables which
 111 contain a single wide character.  As the name already suggests it is the
 112 equivalent to @code{int} when using the normal @code{char} strings.  The
 113 types @code{wchar_t} and @code{wint_t} have often the same
 114 representation if their size if 32 bits wide but if @code{wchar_t} is
 115 defined as @code{char} the type @code{wint_t} must be defined as
 116 @code{int} due to the parameter promotion.
 117
 118 @pindex wchar.h
 119 This type is defined in @file{wchar.h} and got introduced in the second
 120 amendment to @w{ISO C 89}.
 121 @end deftp
 122
 123 As there are for the @code{char} data type there also exist macros
 124 specifying the minimum and maximum value representable in an object of
 125 type @code{wchar_t}.
 126
 127 @comment wchar.h
 128 @comment ISO
 129 @deftypevr Macro wint_t WCHAR_MIN
 130 The macro @code{WCHAR_MIN} evaluates to the minimum value representable
 131 by an object of type @code{wint_t}.
 132
 133 This macro got introduced in the second amendment to @w{ISO C89}.
 134 @end deftypevr
 135
 136 @comment wchar.h
 137 @comment ISO
 138 @deftypevr Macro wint_t WCHAR_MAX
 139 The macro @code{WCHAR_MIN} evaluates to the maximum value representable
 140 by an object of type @code{wint_t}.
 141
 142 This macro got introduced in the second amendment to @w{ISO C89}.
 143 @end deftypevr
 144
 145 Another special wide character value is the equivalent to @code{EOF}.
 146
 147 @comment wchar.h
 148 @comment ISO
 149 @deftypevr Macro wint_t WEOF
 150 The macro @code{WEOF} evaluates to a constant expression of type
 151 @code{wint_t} whose value is different from any member of the extended
 152 character set.
 153
 154 @code{WEOF} need not be the same value as @code{EOF} and unlike
 155 @code{EOF} it also need @emph{not} be negative.  I.e., sloppy code like
 156
 157 @smallexample
 158 @{
 159   int c;
 160   ...
 161   while ((c = getc (fp)) < 0)
 162     ...
 163 @}
 164 @end smallexample
 165
 166 @noindent
 167 has to be rewritten to explicitly use @code{WEOF} when wide characters
 168 are used.
 169
 170 @smallexample
 171 @{
 172   wint_t c;
 173   ...
 174   while ((c = wgetc (fp)) != WEOF)
 175     ...
 176 @}
 177 @end smallexample
 178
 179 @pindex wchar.h
 180 This macro was introduced in the second amendment to @w{ISO C89} and is
 181 defined in @file{wchar.h}.
 182 @end deftypevr
 183
 184
 185 These internal representations present problems when it comes to storing
 186 and transmitting them.  Since a single wide character consists of more
 187 than one byte they are effected by byte-ordering.  I.e., machines with
 188 different endianesses would see different value accessing the same data.
 189 This also applies for communication protocols which are all byte-based
 190 and therefore the sender has to decide about splitting the wide
 191 character in bytes.  A last but not least important point is that wide
 192 characters often require more storage space than an customized byte
 193 oriented character set.
 194
 195 @cindex multibyte character
 196 This is why most of the time an external encoding which is different
 197 from the internal encoding is used if the later is UCS2 or UCS4.  The
 198 external encoding is byte-based and can be chosen appropriately for the
 199 environment and for the texts to be handled.  There exists a variety of
 200 different character sets which can be used which is too much to be
 201 handled completely here.  We restrict ourself here to a description of
 202 the major groups.  All of the ASCII-based character sets fulfill one
 203 requirement: they are ``filesystem safe''.  This means that the
 204 character @code{'/'} is used in the encoding @emph{only} to represent
 205 itself.  Things are a bit different for character like EBCDIC but if the
 206 operation system does not understand EBCDIC directly the parameters to
 207 system calls have to be converted first anyhow.
 208
 209 @itemize @bullet
 210 @item
 211 The simplest character sets are one-byte character sets.  There can be
 212 only up to 256 characters (for @w{8 bit} character sets) which is not
 213 sufficient to cover all languages but might be sufficient to handle a
 214 specific text.  Another reason to choose this is because of constraints
 215 from interaction with other programs.
 216
 217 @cindex ISO 2022
 218 @item
 219 The @w{ISO 2022} standard defines a mechanism for extended character
 220 sets where one character @emph{can} be represented by more than one
 221 byte.  This is achieved by associating a state with the text.  Embedded
 222 in the text can be characters which can be used to change the state.
 223 Each byte in the text might have a different interpretation in each
 224 state.  The state might even influence whether a given byte stands for a
 225 character on its own or whether it has to be combined with some more
 226 bytes.
 227
 228 @cindex EUC
 229 @cindex SJIS
 230 In most uses of @w{ISO 2022} the defined character sets do not allow
 231 state changes which cover more than the next character.  This has the
 232 big advantage that whenever one can identify the beginning of the byte
 233 sequence of a character one can interpret a text correctly.  Examples of
 234 character sets using this policy are the various EUC character sets
 235 (used by Sun's operations systems, EUC-JP, EUC-KR, EUC-TW, and EUC-CN)
 236 or SJIS (Shift JIS, a Japanese encoding).
 237
 238 But there are also character sets using a state which is valid for more
 239 than one character and has to be changed by another byte sequence.
 240 Examples for this are ISO-2022-JP, ISO-2022-KR, and ISO-2022-CN.
 241
 242 @item
 243 @cindex ISO 6937
 244 Early attempts to fix 8 bit character sets for other languages using the
 245 Roman alphabet lead to character sets like @w{ISO 6937}.  Here bytes
 246 representing characters like the acute accent do not produce output on
 247 there on.  One has to combine them with other characters.  E.g., the
 248 byte sequence @code{0xc2 0x61} (non-spacing acute accent, following by
 249 lower-case `a') to get the ``small a with acute'' character.  To get the
 250 acute accent character on its on one has to write @code{0xc2 0x20} (the
 251 non-spacing acute followed by a space).
 252
 253 This type of characters sets is quite frequently used in embedded
 254 systems such as video text.
 255
 256 @item
 257 @cindex UTF-8
 258 Instead of converting the Unicode or @w{ISO 10646} text used internally
 259 it is often also sufficient to simply use an encoding different then
 260 UCS2/UCS4.  The Unicode and @w{ISO 10646} standards even specify such an
 261 encoding: UTF-8.  This encoding is able to represent all of @w{ISO
 262 10464} 31 bits in a byte string of length one to seven.
 263
 264 @cindex UTF-7
 265 There were a few other attempts to encode @w{ISO 10646} such as UTF-7
 266 but UTF-8 is today the only encoding which should be used.  In fact,
 267 UTF-8 will hopefully soon be the only external which has to be
 268 supported.  It proofs to be universally usable and the only disadvantage
 269 is that it favor Latin languages very much by making the byte string
 270 representation of other scripts (Cyrillic, Greek, Asian scripts) longer
 271 than necessary if using a specific character set for these scripts.  But
 272 with methods like the Unicode compression scheme one can overcome these
 273 problems and the ever growing memory and storage capacities do the rest.
 274 @end itemize
 275
 276 The question remaining now is: how to select the character set or
 277 encoding to use.  The answer is mostly: you cannot decide about it
 278 yourself, it is decided by the developers of the system or the majority
 279 of the users.  Since the goal is interoperability one has to use
 280 whatever the other people one works with use.  If there are no
 281 constraints the selection is based on the requirements the expected
 282 circle of users will have.  I.e., if a project is expected to only be
 283 used in, say, Russia it is fine to use KOI8-R or a similar character
 284 set.  But if at the same time people from, say, Greek are participating
 285 one should use a character set which allows all people to collaborate.
 286
 287 A general advice here could be: go with the most general character set,
 288 namely @w{ISO 10646}.  Use UTF-8 as the external encoding and problems
 289 about users not being able to use their own language adequately are a
 290 thing of the past.
 291
 292 One final comment about the choice of the wide character representation
 293 is necessary at this point.  We have said above that the natural choice
 294 is using Unicode or @w{ISO 10646}.  This is not specified in any
 295 standard, though.  The @w{ISO C} standard does not specify anything
 296 specific about the @code{wchar_t} type.  There might be systems where
 297 the developers decided differently.  Therefore one should as much as
 298 possible avoid making assumption about the wide character representation
 299 although GNU systems will always work as described above.  If the
 300 programmer uses only the functions provided by the C library to handle
 301 wide character strings there should not be any compatibility problems
 302 with other systems.
 303
 304 @node Charset Function Overview
 305 @section Overview about Character Handling Functions
 306
 307 A Unix @w{C library} contains three different sets of functions in two
 308 families to handling character set conversion.  The one function family
 309 is specified in the @w{ISO C} standard and therefore is portable even
 310 beyond the Unix world.
 311
 312 The most commonly known set of functions, coming from the @w{ISO C89}
 313 standard, is unfortunately the least useful one.  In fact, these
 314 functions should be avoided whenever possible, especially when
 315 developing libraries (as opposed to applications).
 316
 317 The second family o functions got introduced in the early Unix standards
 318 (XPG2) and is still part of the latest and greatest Unix standard:
 319 @w{Unix 98}.  It is also the most powerful and useful set of functions.
 320 But we will start with the functions defined in the second amendment to
 321 @w{ISO C89}.
 322
 323 @node Restartable multibyte conversion
 324 @section Restartable Multibyte Conversion Functions
 325
 326 The @w{ISO C} standard defines functions to convert strings from a
 327 multibyte representation to wide character strings.  There are a number
 328 of peculiarities:
 329
 330 @itemize @bullet
 331 @item
 332 The character set assumed for the multibyte encoding is not specified
 333 as an argument to the functions.  Instead the character set specified by
 334 the @code{LC_CTYPE} category of the current locale is used; see
 335 @ref{Locale Categories}.
 336
 337 @item
 338 The functions handling more than one character at a time require NUL
 339 terminated strings as the argument.  I.e., converting blocks of text
 340 does not work unless one can add a NUL byte at an appropriate place.
 341 The GNU C library contains some extensions the standard which allow
 342 specifying a size but basically they also expect terminated strings.
 343 @end itemize
 344
 345 Despite these limitations the @w{ISO C} functions can very well be used
 346 in many contexts.  In graphical user interfaces, for instance, it is not
 347 uncommon to have functions which require text to be displayed in a wide
 348 character string if it is not simple ASCII.  The text itself might come
 349 from a file with translations and of course to user should decide about
 350 the current locale which determines the translation and therefore also
 351 the external encoding used.  In such a situation (and many others) the
 352 functions described here are perfect.  If more freedom while performing
 353 the conversion is necessary take a look at the @code{iconv} functions
 354 (@pxref{Generic Charset Conversion})
 355
 356 @menu
 357 * Selecting the Conversion::     Selecting the conversion and its properties.
 358 * Keeping the state::            Representing the state of the conversion.
 359 * Converting a Character::       Converting Single Characters.
 360 * Converting Strings::           Converting Multibyte and Wide Character
 361                                   Strings.
 362 * Multibyte Conversion Example:: A Complete Multibyte Conversion Example.
 363 @end menu
 364
 365 @node Selecting the Conversion
 366 @subsection Selecting the conversion and its properties
 367
 368 We already said above that the currently selected locale for the
 369 @code{LC_CTYPE} category decides about the conversion which is performed
 370 by the functions we are about to describe.  Each locale uses its own
 371 character set (given as an argument to @code{localedef}) and this is the
 372 one assumed as the external multibyte encoding.  The wide character
 373 character set always is UCS4.  So we can see here already where the
 374 limitations of these conversion functions are.
 375
 376 A characteristic of each multibyte character set is the maximum number
 377 of bytes which can be necessary to represent one character.  This
 378 information is quite important when writing code which uses the
 379 conversion functions.  In the examples below we will see some examples.
 380 The @w{ISO C} standard defines two macros which provide this information.
 381
 382
 383 @comment limits.h
 384 @comment ISO
 385 @deftypevr Macro int MB_LEN_MAX
 386 This macro specifies the maximum number of bytes in the multibyte
 387 sequence for a single character in any of the supported locales.  It is
 388 a compile-time constant and it is defined in @file{limits.h}.
 389 @pindex limits.h
 390 @end deftypevr
 391
 392 @comment stdlib.h
 393 @comment ISO
 394 @deftypevr Macro int MB_CUR_MAX
 395 @code{MB_CUR_MAX} expands into a positive integer expression that is the
 396 maximum number of bytes in a multibyte character in the current locale.
 397 The value is never greater than @code{MB_LEN_MAX}.  Unlike
 398 @code{MB_LEN_MAX} this macro need not be a compile-time constant and in
 399 fact, in the GNU C library it is not.
 400
 401 @pindex stdlib.h
 402 @code{MB_CUR_MAX} is defined in @file{stdlib.h}.
 403 @end deftypevr
 404
 405 Two different macros are necessary since strictly @w{ISO C89} compiles
 406 do not allow variable length array definitions but still it is desirable
 407 to avoid dynamic allocation.  This incomplete piece of code shows the
 408 problem:
 409
 410 @smallexample
 411 @{
 412   char buf[MB_LEN_MAX];
 413   ssize_t len = 0;
 414
 415   while (! feof (fp))
 416     @{
 417       fread (&buf[len], 1, MB_CUR_MAX - len, fp);
 418       /* @r{... process} buf */
 419       len -= used;
 420     @}
 421 @}
 422 @end smallexample
 423
 424 The code in the inner loop is expected to have always enough bytes in
 425 the array @var{buf} to convert one multibyte character.  The array
 426 @var{buf} has to be sized statically since many compilers do not allow a
 427 variable size.  The @code{fread} call makes sure that always
 428 @code{MB_CUR_MAX} bytes are available in @var{buf}.  Note that it is no
 429 problem if @code{MB_CUR_MAX} is not a compile-time constant.
 430
 431
 432 @node Keeping the state
 433 @subsection Representing the state of the conversion
 434
 435 @cindex stateful
 436 In the introduction of this chapter it was said that certain character
 437 sets use a @dfn{stateful} encoding.  I.e., the encoded values depend in
 438 some way on the previous byte in the text.
 439
 440 Since the conversion functions allow converting a text in more than one
 441 step we must have a way to pass this information from one call of the
 442 functions to another.
 443
 444 @comment wchar.h
 445 @comment ISO
 446 @deftp {Data type} mbstate_t
 447 @cindex shift state
 448 A variable of type @code{mbstate_t} can contain all the information
 449 about the @dfn{shift state} needed from one call to a conversion
 450 function to another.
 451
 452 @pindex wchar.h
 453 This type is defined in @file{wchar.h}.  It got introduced in the second
 454 amendment to @w{ISO C89}.
 455 @end deftp
 456
 457 To use objects of this type the programmer has to define such objects
 458 (normally as local variables on the stack) and pass a pointer to the
 459 object to the conversion functions.  This way the conversion function
 460 can update the object if the current multibyte character set is
 461 stateful.
 462
 463 There is no specific function or initializer to put the state object in
 464 any specific state.  The rules are that the object should always
 465 represent the initial state before the first use and this is achieved by
 466 clearing the whole variable with code such as follows:
 467
 468 @smallexample
 469 @{
 470   mbstate_t state;
 471   memset (&state, '\0', sizeof (state));
 472   /* @r{from now on @var{state} can be used.}  */
 473   ...
 474 @}
 475 @end smallexample
 476
 477 When using the conversion functions to generate output it is often
 478 necessary to test whether current state corresponds to the initial
 479 state.  This is necessary, for example, to decide whether or not to emit
 480 escape sequences to set the state to the initial state at certain
 481 sequence points.  Communication protocols often require this.
 482
 483 @comment wchar.h
 484 @comment ISO
 485 @deftypefun int mbsinit (const mbstate_t *@var{ps})
 486 This function determines whether the state object pointed to by @var{ps}
 487 is in the initial state or not.  If @var{ps} is no null pointer or the
 488 object is in the initial state the return value is nonzero.  Otherwise
 489 it is zero.
 490
 491 @pindex wchar.h
 492 This function was introduced in the second amendment to @w{ISO C89} and
 493 is declared in @file{wchar.h}.
 494 @end deftypefun
 495
 496 Code using this function often looks similar to this:
 497
 498 @smallexample
 499 @{
 500   mbstate_t state;
 501   memset (&state, '\0', sizeof (state));
 502   /* @r{Use @var{state}.}  */
 503   ...
 504   if (! mbsinit (&state))
 505     @{
 506       /* @r{Emit code to return to initial state.}  */
 507       fputs ("@r{whatever needed}", fp);
 508     @}
 509   ...
 510 @}
 511 @end smallexample
 512
 513 @node Converting a Character
 514 @subsection Converting Single Characters
 515
 516 The most fundamental of the conversion functions are those dealing with
 517 single characters.  Please note that this does not always mean single
 518 bytes.  But since there is very often a subset of the multibyte
 519 character set which consists of single byte sequences there are
 520 functions to help with converting bytes.  One very important and often
 521 applicable scenario is where ASCII is a subpart of the multibyte
 522 character set.  I.e., all ASCII characters stand for itself and all
 523 other characters have at least a first byte which is beyond the range
 524 @math{0} to @math{127}.
 525
 526 @comment wchar.h
 527 @comment ISO
 528 @deftypefun wint_t btowc (int @var{c})
 529 The @code{btowc} function (``byte to wide character'') converts a valid
 530 single byte character in the initial shift state into the wide character
 531 equivalent using the conversion rules from the currently selected locale
 532 of the @code{LC_CTYPE} category.
 533
 534 If @code{(unsigned char) @var{c}} is no valid single byte multibyte
 535 character or if @var{c} is @code{EOF} the function returns @code{WEOF}.
 536
 537 Please note the restriction of @var{c} being tested for validity only in
 538 the initial shift state.  There is no @code{mbstate_t} object used from
 539 which the state information is taken and the function also does not use
 540 any static state.
 541
 542 @pindex wchar.h
 543 This function was introduced in the second amendment of @w{ISO C89} and
 544 is declared in @file{wchar.h}.
 545 @end deftypefun
 546
 547 Despite the limitation that the single byte value always is interpreted
 548 in the initial state this function is actually useful most of the time.
 549 Most character are either entirely single-byte character sets or they
 550 are extension to ASCII.  But then it is possible to write code like this
 551 (not that this specific example is useful):
 552
 553 @smallexample
 554 wchar_t *
 555 itow (unsigned long int val)
 556 @{
 557   static wchar_t buf[30];
 558   wchar_t *wcp = &buf[29];
 559   *wcp = L'\0';
 560   while (val != 0)
 561     @{
 562       *--wcp = btowc ('0' + val % 10);
 563       val /= 10;
 564     @}
 565   if (wcp == &buf[29])
 566     *--wcp = btowc ('0');
 567   return wcp;
 568 @}
 569 @end smallexample
 570
 571 The question is why is it necessary to use such a complicated
 572 implementation and not simply cast L'0' to a wide character.  The answer
 573 is that there is no guarantee that the compiler knows about the wide
 574 character set used at runtime.  Even if the wide character equivalent of
 575 a given single-byte character is simply the equivalent to casting a
 576 single-byte character to @code{wchar_t} this is no guarantee that this
 577 is the case everywhere.
 578
 579 There also is a function for the conversion in the other direction.
 580
 581 @comment wchar.h
 582 @comment ISO
 583 @deftypefun int wctob (wint_t @var{c})
 584 The @code{wctob} function (``wide character to byte'') takes as the
 585 parameter a valid wide character.  If the multibyte representation for
 586 this character in the initial state is exactly one byte long the return
 587 value of this function is this character.  Otherwise the return value is
 588 @code{EOF}.
 589
 590 @pindex wchar.h
 591 This function was introduced in the second amendment of @w{ISO C89} and
 592 is declared in @file{wchar.h}.
 593 @end deftypefun
 594
 595 There are more general functions to convert single character from
 596 multibyte representation to wide characters and vice versa.  These
 597 functions pose no limit on the length of the multibyte representation
 598 and they also do not require it to be in the initial state.
 599
 600 @comment wchar.h
 601 @comment ISO
 602 @deftypefun size_t mbrtowc (wchar_t *restrict @var{pwc}, const char *restrict @var{s}, size_t @var{n}, mbstate_t *restrict @var{ps})
 603 @cindex stateful
 604 The @code{mbrtowc} function (``multibyte restartable to wide
 605 character'') converts the next multibyte character in the string pointed
 606 to by @var{s} into a wide character and stores it in the wide character
 607 string pointed to by @var{pwc}.  The conversion is performed according
 608 to the locale currently selected for the @code{LC_CTYPE} category.  If
 609 the character set for the locale is stateful the multibyte string is
 610 interpreted in the state represented by the object pointed to by
 611 @var{ps}.  If @var{ps} is a null pointer an static, internal state
 612 variable used only by the @code{mbrtowc} variable is used.
 613
 614 If the next multibyte character corresponds to the NUL wide character
 615 the return value of the function is @math{0} and the state object is
 616 afterwards in the initial state.  If the next @var{n} or fewer bytes
 617 form a correct multibyte character the return value is the number of
 618 bytes starting from @var{s} which form the multibyte character.  The
 619 conversion state is updated according to the bytes consumed in the
 620 conversion.  In both cases the wide character (either the @code{L'\0'}
 621 or the one found in the conversion) is stored in the string pointer to
 622 by @var{pwc} iff @var{pwc} is not null.
 623
 624 If the first @var{n} bytes of the multibyte string possibly form a valid
 625 multibyte character but there are more than @var{n} bytes needed to
 626 complete it the return value of the function is @code{(size_t) -2} and
 627 no value is stored.  Please note that this can happen even if @var{n}
 628 has a value greater or equal to @code{MB_CUR_MAX} since the input might
 629 contain redundant shift sequences.
 630
 631 If the first @code{n} bytes of the multibyte string cannot possibly
 632 form a valid multibyte character also no value is stored, the global
 633 variable i set to the value @code{EILSEQ} and the function return
 634 @code{(size_t) -1}.  The conversion state is afterwards undefined.
 635
 636 @pindex wchar.h
 637 This function was introduced in the second amendment to @w{ISO C89} and
 638 is declared in @file{wchar.h}.
 639 @end deftypefun
 640
 641 Using this function is straight forward.  A function which copies a
 642 multibyte string into a wide character string while at the same time
 643 converting all lowercase character into uppercase could look like this
 644 (this is not the final version, just an example; it has no error
 645 checking and leaks sometimes memory):
 646
 647 @smallexample
 648 wchar_t *
 649 mbstouwcs (const char *s)
 650 @{
 651   size_t len = strlen (s);
 652   wchar_t *result = malloc ((len + 1) * sizeof (wchar_t));
 653   wchar_t *wcp = result;
 654   wchar_t tmp[1];
 655   mbstate_t state;
 656   memset (&state, '\0', sizeof (state));
 657   size_t nbytes;
 658   while ((nbytes = mbrtowc (tmp, s, len, &state)) > 0)
 659     @{
 660       if (nbytes >= (size_t) -2)
 661         /* Invalid input string.  */
 662         return NULL;
 663       *result++ = towupper (tmp[0]);
 664       len -= nbytes;
 665       s += nbytes;
 666     @}
 667   return result;
 668 @}
 669 @end smallexample
 670
 671 The use of @code{mbrtowc} should be clear.  A single wide character is
 672 stored in @code{@var{tmp}[0]} and the number of consumed bytes is stored
 673 in the variable @var{nbytes}.  In case the the conversion was successful
 674 the uppercase variant of the wide character is stored in the
 675 @var{result} array and the pointer to the input string and the number of
 676 available bytes is adjusted.
 677
 678 The only non-obvious thing about the function might be the way memory is
 679 allocated for the result.  The above code uses the fact that there can
 680 never be more wide characters in the converted results than there are
 681 bytes in the multibyte input string.  This method yields to a
 682 pessimistic guess about the size of the result and if many wide
 683 character strings have to be constructed this way or the strings are
 684 long, the extra memory required to store the wide character strings
 685 might be significant.  It would of course be possible to resize the
 686 allocated memory block to the correct size before returning it.  A
 687 better solution might be to allocate just the right amount of space for
 688 the result right away.  Unfortunately there is no function to compute
 689 the length of the wide character string directly from the multibyte
 690 string.  But there is a function which does part of the work.
 691
 692 @comment wchar.h
 693 @comment ISO
 694 @deftypefun size_t mbrlen (const char *restrict @var{s}, size_t @var{n}, mbstate_t *@var{ps})
 695 The @code{mbrlen} function (``multibyte restartable length'') computes
 696 the number of at most @var{n} bytes starting at @var{s} which form the
 697 next valid and complete multibyte character.
 698
 699 If the next multibyte character corresponds to the NUL wide character
 700 the return value is @math{0}.  If the next @var{n} bytes form a valid
 701 multibyte character the number of bytes belonging to this multibyte
 702 character byte sequence is returned.
 703
 704 If the the first @var{n} bytes possibly form a valid multibyte
 705 character but it is incomplete the return value is @code{(size_t) -2}.
 706 Otherwise the multibyte character sequence is invalid and the return
 707 value is @code{(size_t) -1}.
 708
 709 The multibyte sequence is interpreted in the state represented by the
 710 object pointer to by @var{ps}.  If @var{ps} is a null pointer an state
 711 object local to @code{mbrlen} is used.
 712
 713 @pindex wchar.h
 714 This function was introduced in the second amendment to @w{ISO C89} and
 715 is declared in @file{wchar.h}.
 716 @end deftypefun
 717
 718 The tentative reader now will of course note that @code{mbrlen} can be
 719 implemented as
 720
 721 @smallexample
 722 mbrtowc (NULL, s, n, ps != NULL ? ps : &internal)
 723 @end smallexample
 724
 725 This is true and in fact is mentioned in the official specification.
 726 Now, how can this function be used to determine the length of the wide
 727 character string created from a multibyte character string?  It is not
 728 directly usable but we can define a function @code{mbslen} using it:
 729
 730 @smallexample
 731 size_t
 732 mbslen (const char *s)
 733 @{
 734   mbstate_t state;
 735   size_t result = 0;
 736   size_t nbytes;
 737   memset (&state, '\0', sizeof (state));
 738   while ((nbytes = mbrlen (s, MB_LEN_MAX, &state)) > 0)
 739     @{
 740       if (nbytes >= (size_t) -2)
 741         /* @r{Something is wrong.}  */
 742         return (size_t) -1;
 743       s += nbytes;
 744       ++result;
 745     @}
 746   return result;
 747 @}
 748 @end smallexample
 749
 750 This function simply calls @code{mbrlen} for each multibyte character
 751 in the string and counts the number of function calls.  Please note that
 752 we here use @code{MB_LEN_MAX} as the size argument in the @code{mbrlen}
 753 call.  This is OK since a) this value is larger then the length of the
 754 longest multibyte character sequence and b) because we know that the
 755 string @var{s} ends with a NIL byte which cannot be part of any other
 756 multibyte character sequence but the one representing the NIL wide
 757 character.  Therefore the @code{mbrlen} function will never read invalid
 758 memory.
 759
 760 Now that this function is available (just to make this clear, this
 761 function is @emph{not} part of the GNU C library) we can compute the
 762 number of wide character required to store the converted multibyte
 763 character string @var{s} using
 764
 765 @smallexample
 766 wcs_bytes = (mbslen (s) + 1) * sizeof (wchar_t);
 767 @end smallexample
 768
 769 Please note that the @code{mbslen} function is quite inefficient.  The
 770 implementation of @code{mbstouwcs} implemented using @code{mbslen} would
 771 have to perform the conversion of the multibyte character input string
 772 twice and this conversion might be quite expensive.  So it is necessary
 773 to think about the consequences of using the easier but imprecise method
 774 before doing the work twice.
 775
 776 @comment wchar.h
 777 @comment ISO
 778 @deftypefun size_t wcrtomb (char *restrict @var{s}, wchar_t @var{wc}, mbstate_t *restrict @var{ps})
 779 The @code{wcrtomb} function (``wide character restartable to
 780 multibyte'') converts a single wide character into a multibyte string
 781 corresponding to that wide character.
 782
 783 If @var{s} is a null pointer the resets the the state stored in the
 784 objects pointer to by @var{ps} to the initial state.  This can also be
 785 achieved by a call like this:
 786
 787 @smallexample
 788 wcrtombs (temp_buf, L'\0', ps)
 789 @end smallexample
 790
 791 @noindent
 792 since when @var{s} is a null pointer @code{wcrtomb} performs as if it
 793 writes into an internal buffer which is guaranteed to be large enough.
 794
 795 If @var{wc} is the NUL wide character @code{wcrtomb} emits, if
 796 necessary, a shift sequence to get the state @var{ps} into the initial
 797 state followed by a single NUL byte is stored in the string @var{s}.
 798
 799 Otherwise a byte sequence (possibly including shift sequences) is
 800 written into the string @var{s}.  This of course only happens if
 801 @var{wc} is a valid wide character, i.e., it has a multibyte
 802 representation in the character set selected by locale of the
 803 @code{LC_CTYPE} category.  If @var{wc} is no valid wide character
 804 nothing is stored in the strings @var{s}, @code{errno} is set to
 805 @code{EILSEQ}, the conversion state in @var{ps} is undefined and the
 806 return value is @code{(size_t) -1}.
 807
 808 If no error occurred the function returns the number of bytes stored in
 809 the string @var{s}.  This includes all byte representing shift
 810 sequences.
 811
 812 One word about the interface of the function: there is no parameter
 813 specifying the length of the array @var{s}.  Instead the function
 814 assumes that there are at least @code{MB_CUR_MAX} bytes available since
 815 this is the maximum length of any byte sequence representing a single
 816 character.  So the caller has to make sure that there is enough space
 817 available, otherwise buffer overruns can occur.
 818
 819 @pindex wchar.h
 820 This function was introduced in the second amendment to @w{ISO C} and is
 821 declared in @file{wchar.h}.
 822 @end deftypefun
 823
 824 Using this function is as easy as using @code{mbrtowc}.  The following
 825 example appends a wide character string to a multibyte character string.
 826 Again, the code is not really useful, it is simply here to demonstrate
 827 the use and some problems.
 828
 829 @smallexample
 830 char *
 831 mbscatwc (char *s, size_t len, const wchar_t *ws)
 832 @{
 833   mbstate_t state;
 834   char *wp = strchr (s, '\0');
 835   len -= wp - s;
 836   memset (&state, '\0', sizeof (state));
 837   do
 838     @{
 839       size_t nbytes;
 840       if (len < MB_CUR_LEN)
 841         @{
 842           /* @r{We cannot guarantee that the next}
 843              @r{character fits into the buffer, so}
 844              @r{return an error.}  */
 845           errno = E2BIG;
 846           return NULL;
 847         @}
 848       nbytes = wcrtomb (wp, *ws, &state);
 849       if (nbytes == (size_t) -1)
 850         /* @r{Error in the conversion.}  */
 851         return NULL;
 852       len -= nbytes;
 853       wp += nbytes;
 854     @}
 855   while (*ws++ != L'\0');
 856   return s;
 857 @}
 858 @end smallexample
 859
 860 First the function has to find the end of the string currently in the
 861 array @var{s}.  The @code{strchr} call does this very efficiently since a
 862 requirement for multibyte character representations is that the NUL byte
 863 never is used except to represent itself (and in this context, the end
 864 of the string).
 865
 866 After initializing the state object the loop is entered where the first
 867 task is to make sure there is enough room in the array @var{s}.  We
 868 abort if there are not at least @code{MB_CUR_LEN} bytes available.  This
 869 is not always optimal but we have no other choice.  We might have less
 870 than @code{MB_CUR_LEN} bytes available but the next multibyte character
 871 might also be only one byte long.  At the time the @code{wcrtomb} call
 872 returns it is too late to decide whether the buffer was large enough or
 873 not.  If this solution is really unsuitable there is a very slow but
 874 more accurate solution.
 875
 876 @smallexample
 877   ...
 878   if (len < MB_CUR_LEN)
 879     @{
 880       mbstate_t temp_state;
 881       memcpy (&temp_state, &state, sizeof (state));
 882       if (wcrtomb (NULL, *ws, &temp_state) > len)
 883         @{
 884           /* @r{We cannot guarantee that the next}
 885              @r{character fits into the buffer, so}
 886              @r{return an error.}  */
 887           errno = E2BIG;
 888           return NULL;
 889         @}
 890     @}
 891   ...
 892 @end smallexample
 893
 894 Here we do perform the conversion which might overflow the buffer so
 895 that we are afterwards in the position to make an exact decision about
 896 the buffer size.  Please note the @code{NULL} argument for the
 897 destination buffer in the new @code{wcrtomb} call; since we are not
 898 interested in the result at this point this is a nice way to express
 899 this.  The most unusual thing about this piece of code certainly is the
 900 duplication of the conversion state object.  But think about it: if a
 901 change of the state is necessary to emit the next multibyte character we
 902 want to have the same shift state change performed in the real
 903 conversion.  Therefore we have to preserve the initial shift state
 904 information.
 905
 906 There are certainly many more and even better solutions to this problem.
 907 This example is only meant for educational purposes.
 908
 909 @node Converting Strings
 910 @subsection Converting Multibyte and Wide Character Strings
 911
 912 The functions described in the previous section only convert a single
 913 character at a time.  Most operations to be performed in real-world
 914 programs include strings and therefore the @w{ISO C} standard also
 915 defines conversions on entire strings.  The defined set of functions is
 916 quite limited, though.  Therefore contains the GNU C library a few
 917 extensions which are necessary in some important situations.
 918
 919 @comment wchar.h
 920 @comment ISO
 921 @deftypefun size_t mbsrtowcs (wchar_t *restrict @var{dst}, const char **restrict @var{src}, size_t @var{len}, mbstate_t *restrict @var{ps})
 922 The @code{mbsrtowcs} function (``multibyte string restartable to wide
 923 character string'') converts an NUL terminated multibyte character
 924 string at @code{*@var{src}} into an equivalent wide character string,
 925 including the NUL wide character at the end.  The conversion is started
 926 using the state information from the object pointed to by @var{ps} or
 927 from an internal object of @code{mbsrtowcs} if @var{ps} is a null
 928 pointer.  Before returning the state object to match the state after the
 929 last converted character.  The state is the initial state if the
 930 terminating NUL byte is reached and converted.
 931
 932 If @var{dst} is not a null pointer the result is stored in the array
 933 pointed to by @var{dst}, otherwise the conversion result is not
 934 available since it is stored in an internal buffer.
 935
 936 If @var{len} wide characters are stored in the array @var{dst} before
 937 reaching the end of the input string the conversion stops and @var{len}
 938 is returned.  If @var{dst} is a null pointer @var{len} is never checked.
 939
 940 Another reason for a premature return from the function call is if the
 941 input string contains an invalid multibyte sequence.  In this case the
 942 global variable @code{errno} is set to @code{EILSEQ} and the function
 943 returns @code{(size_t) -1}.
 944
 945 @c XXX The ISO C9x draft seems to have a problem here.  It says that PS
 946 @c is not updated if DST is NULL.  This is not said straight forward and
 947 @c none of the other functions is described like this.  It would make sense
 948 @c to define the function this way but I don't think it is meant like this.
 949
 950 In all other cases the function returns the number of wide characters
 951 converted during this call.  If @var{dst} is not null @code{mbsrtowcs}
 952 stores in the pointer pointed to by @var{src} a null pointer (if the NUL
 953 byte in the input string was reached) or the address of the byte
 954 following the last converted multibyte character.
 955
 956 @pindex wchar.h
 957 This function was introduced in the second amendment to @w{ISO C} and is
 958 declared in @file{wchar.h}.
 959 @end deftypefun
 960
 961 The definition of this function has one limitation which has to be
 962 understood.  The requirement that @var{dst} has to be a NUL terminated
 963 string provides problems if one wants to convert buffers with text.  A
 964 buffer is normally no collection of NUL terminated strings but instead a
 965 continuous collection of lines, separated by newline characters.  Now
 966 assume a function to convert one line from a buffer is needed.  Since
 967 the line is not NUL terminated the source pointer cannot directly point
 968 into the unmodified text buffer.  This means, either one inserts the NUL
 969 byte at the appropriate place for the time of the @code{mbsrtowcs}
 970 function call (which is not doable for a read-only buffer or in a
 971 multi-threaded application) or one copies the line in an extra buffer
 972 where it can be terminated by a NUL byte.  Note that it is not in
 973 general possible to limit the number of characters to convert by setting
 974 the parameter @var{len} to any specific value.  Since it is not known
 975 how many bytes each multibyte character sequence is in length one always
 976 could do only a guess.
 977
 978 @cindex stateful
 979 There is still a problem with the method of NUL-terminating a line right
 980 after the newline character which could lead to very strange results.
 981 As said in the description of the @var{mbsrtowcs} function above the
 982 conversion state is guaranteed to be in the initial shift state after
 983 processing the NUL byte at the end of the input string.  But this NUL
 984 byte is not really part of the text.  I.e., the conversion state after
 985 the newline in the original text could be something different than the
 986 initial shift state and therefore the first character of the next line
 987 is encoded using this state.  But the state in question is never
 988 accessible to the user since the conversion stops after the NUL byte.
 989 Fortunately most stateful character sets in use today require that the
 990 shift state after a newline is the initial state but this is no
 991 guarantee.  Therefore simply NUL terminating a piece of a running text
 992 is not always the adequate solution.
 993
 994 The generic conversion
 995 @comment XXX reference to iconv
 996 interface does not have this limitation (it simply works on buffers, not
 997 strings) but there is another way.  The GNU C library contains a set of
 998 functions why take additional parameters specifying maximal number of
 999 bytes which are consumed from the input string.  This way the problem of
1000 above's example could be solved by determining the line length and
1001 passing this length to the function.
1002
1003 @comment wchar.h
1004 @comment ISO
1005 @deftypefun size_t wcsrtombs (char *restrict @var{dst}, const wchar_t **restrict @var{src}, size_t @var{len}, mbstate_t *restrict @var{ps})
1006 The @code{wcsrtombs} function (``wide character string restartable to
1007 multibyte string'') converts the NUL terminated wide character string at
1008 @code{*@var{src}} into an equivalent multibyte character string and
1009 stores the result in the array pointed to by @var{dst}.  The NUL wide
1010 character is also converted.  The conversion starts in the state
1011 described in the object pointed to by @var{ps} or by a state object
1012 locally to @code{wcsrtombs} in case @var{ps} is a null pointer.  If
1013 @var{dst} is a null pointer the conversion is performed as usual but the
1014 result is not available.  If all characters of the input string were
1015 successfully converted and if @var{dst} is not a null pointer the
1016 pointer pointed to by @var{src} gets assigned a null pointer.
1017
1018 If one of the wide characters in the input string has no valid multibyte
1019 character equivalent the conversion stops early, sets the global
1020 variable @code{errno} to @code{EILSEQ}, and returns @code{(size_t) -1}.
1021
1022 Another reason for a premature stop is if @var{dst} is not a null
1023 pointer and the next converted character would require more than
1024 @var{len} bytes in total to the array @var{dst}.  In this case (and if
1025 @var{dest} is not a null pointer) the pointer pointed to by @var{src} is
1026 assigned a value pointing to the wide character right after the last one
1027 successfully converted.
1028
1029 Except in the case of an encoding error the return value of the function
1030 is the number of bytes in all the multibyte character sequences stored
1031 in @var{dst}.  Before returning the state in the object pointed to by
1032 @var{ps} (or the internal object in case @var{ps} is a null pointer) is
1033 updated to reflect the state after the last conversion.  The state is
1034 the initial shift state in case the terminating NUL wide character was
1035 converted.
1036
1037 @pindex wchar.h
1038 This function was introduced in the second amendment to @w{ISO C} and is
1039 declared in @file{wchar.h}.
1040 @end deftypefun
1041
1042 The restriction mentions above for the @code{mbsrtowcs} function applies
1043 also here.  There is no possibility to directly control the number of
1044 input characters.  One has to place the NUL wide character at the
1045 correct place or control the consumed input indirectly via the available
1046 output array size (the @var{len} parameter).
1047
1048 @comment wchar.h
1049 @comment GNU
1050 @deftypefun size_t mbsnrtowcs (wchar_t *restrict @var{dst}, const char **restrict @var{src}, size_t @var{nmc}, size_t @var{len}, mbstate_t *restrict @var{ps})
1051 The @code{mbsnrtowcs} function is very similar to the @code{mbsrtowcs}
1052 function.  All the parameters are the same except for @var{nmc} which is
1053 new.  The return value is the same as for @code{mbsrtowcs}.
1054
1055 This new parameter specifies how many bytes at most can be used from the
1056 multibyte character string.  I.e., the multibyte character string
1057 @code{*@var{src}} need not be NUL terminated.  But if a NUL byte is
1058 found within the @var{nmc} first bytes of the string the conversion
1059 stops here.
1060
1061 This function is a GNU extensions.  It is meant to work around the
1062 problems mentioned above.  Now it is possible to convert buffer with
1063 multibyte character text piece for piece without having to care about
1064 inserting NUL bytes and the effect of NUL bytes on the conversion state.
1065 @end deftypefun
1066
1067 A function to convert a multibyte string into a wide character string
1068 and display it could be written like this (this is no really useful
1069 example):
1070
1071 @smallexample
1072 void
1073 showmbs (const char *src, FILE *fp)
1074 @{
1075   mbstate_t state;
1076   int cnt = 0;
1077   memset (&state, '\0', sizeof (state));
1078   while (1)
1079     @{
1080       wchar_t linebuf[100];
1081       const char *endp = strchr (src, '\n');
1082       size_t n;
1083
1084       /* @r{Exit if there is no more line.}  */
1085       if (endp == NULL)
1086         break;
1087
1088       n = mbsnrtowcs (linebuf, &src, endp - src, 99, &state);
1089       linebuf[n] = L'\0';
1090       fprintf (fp, "line %d: \"%S\"\n", linebuf);
1091     @}
1092 @}
1093 @end smallexample
1094
1095 There is no more problem with the state after a call to
1096 @code{mbsnrtowcs}.  Since we don't insert characters in the strings
1097 which were not in there right from the beginning and we use @var{state}
1098 only for the conversion of the given buffer there is no problem with
1099 mixing the state up.
1100
1101 @comment wchar.h
1102 @comment GNU
1103 @deftypefun size_t wcsnrtombs (char *restrict @var{dst}, const wchar_t **restrict @var{src}, size_t @var{nwc}, size_t @var{len}, mbstate_t *restrict @var{ps})
1104 The @code{wcsnrtombs} function implements the conversion from wide
1105 character strings to multibyte character strings.  It is similar to
1106 @code{wcsrtombs} but it takes, just like @code{mbsnrtowcs}, an extra
1107 parameter which specifies the length of the input string.
1108
1109 No more than @var{nwc} wide characters from the input string
1110 @code{*@var{src}} are converted.  If the input string contains a NUL
1111 wide character in the first @var{nwc} character to conversion stops at
1112 this place.
1113
1114 This function is a GNU extension and just like @code{mbsnrtowcs} is
1115 helps in situations where no NUL terminated input strings are available.
1116 @end deftypefun
1117
1118
1119 @node Multibyte Conversion Example
1120 @subsection A Complete Multibyte Conversion Example
1121
1122 The example programs given in the last sections are only brief and do
1123 not contain all the error checking etc.  Therefore here comes a complete
1124 and documented example.  It features the @code{mbrtowc} function but it
1125 should be easy to derive versions using the other functions.
1126
1127 @smallexample
1128 int
1129 file_mbsrtowcs (int input, int output)
1130 @{
1131   /* @r{Note the use of @code{MB_LEN_MAX}.}
1132      @r{@code{MB_CUR_MAX} cannot portably be used here.}  */
1133   char buffer[BUFSIZ + MB_LEN_MAX];
1134   mbstate_t state;
1135   int filled = 0;
1136   int eof = 0;
1137
1138   /* @r{Initialize the state.}  */
1139   memset (&state, '\0', sizeof (state));
1140
1141   while (!eof)
1142     @{
1143       ssize_t nread;
1144       ssize_t nwrite;
1145       char *inp = buffer;
1146       wchar_t outbuf[BUFSIZ];
1147       wchar_t *outp = outbuf;
1148
1149       /* @r{Fill up the buffer from the input file.}  */
1150       nread = read (input, buffer + filled, BUFSIZ);
1151       if (nread < 0)
1152         @{
1153           perror ("read");
1154           return 0;
1155         @}
1156       /* @r{If we reach end of file, make a note to read no more.} */
1157       if (nread == 0)
1158         eof = 1;
1159
1160       /* @r{@code{filled} is now the number of bytes in @code{buffer}.} */
1161       filled += nread;
1162
1163       /* @r{Convert those bytes to wide characters--as many as we can.} */
1164       while (1)
1165         @{
1166           size_t thislen = mbrtowc (outp, inp, filled, &state);
1167           /* @r{Stop converting at invalid character;}
1168              @r{this can mean we have read just the first part}
1169              @r{of a valid character.}  */
1170           if (thislen == (size_t) -1)
1171             break;
1172           /* @r{We want to handle embedded NUL bytes}
1173              @r{but the return value is 0.  Correct this.}  */
1174           if (thislen == 0)
1175             thislen = 1;
1176           /* @r{Advance past this character.} */
1177           inp += thislen;
1178           filled -= thislen;
1179           ++outp;
1180         @}
1181
1182       /* @r{Write the wide characters we just made.}  */
1183       nwrite = write (output, outbuf,
1184                       (outp - outbuf) * sizeof (wchar_t));
1185       if (nwrite < 0)
1186         @{
1187           perror ("write");
1188           return 0;
1189         @}
1190
1191       /* @r{See if we have a @emph{real} invalid character.} */
1192       if ((eof && filled > 0) || filled >= MB_CUR_MAX)
1193         @{
1194           error (0, 0, "invalid multibyte character");
1195           return 0;
1196         @}
1197
1198       /* @r{If any characters must be carried forward,}
1199          @r{put them at the beginning of @code{buffer}.} */
1200       if (filled > 0)
1201         memmove (inp, buffer, filled);
1202     @}
1203
1204   return 1;
1205 @}
1206 @end smallexample
1207
1208
1209 @node Non-reentrant Conversion
1210 @section Non-reentrant Conversion Function
1211
1212 The functions described in the last chapter are defined in the second
1213 amendment to @w{ISO C89}.  But the original @w{ISO C89} standard also
1214 contained functions for character set conversion.  The reason that they
1215 are not described in the first place is that they are almost entirely
1216 useless.
1217
1218 The problem is that all the functions for conversion defined in @w{ISO
1219 C89} use a local state.  This does not only mean that multiple
1220 conversions at the same time (not only when using threads) cannot be
1221 done.  It also means that you cannot first convert single characters and
1222 the strings since you cannot say the conversion functions which state to
1223 use.
1224
1225 These functions are therefore usable only in a very limited set of
1226 situation.  One most complete converting the entire string before
1227 starting a new one and each string/text must be converted with the same
1228 function (there is no problem with the library itself; it is guaranteed
1229 that no library function changes the state of any of these functions).
1230 For these reasons it is @emph{highly} requested to use the functions
1231 from the last section.
1232
1233 @menu
1234 * Non-reentrant Character Conversion::  Non-reentrant Conversion of Single
1235                                          Characters.
1236 * Non-reentrant String Conversion::     Non-reentrant Conversion of Strings.
1237 * Shift State::                         States in Non-reentrant Functions.
1238 @end menu
1239
1240 @node Non-reentrant Character Conversion
1241 @subsection Non-reentrant Conversion of Single Characters
1242
1243 @comment stdlib.h
1244 @comment ISO
1245 @deftypefun int mbtowc (wchar_t *@var{result}, const char *@var{string}, size_t @var{size})
1246 The @code{mbtowc} (``multibyte to wide character'') function when called
1247 with non-null @var{string} converts the first multibyte character
1248 beginning at @var{string} to its corresponding wide character code.  It
1249 stores the result in @code{*@var{result}}.
1250
1251 @code{mbtowc} never examines more than @var{size} bytes.  (The idea is
1252 to supply for @var{size} the number of bytes of data you have in hand.)
1253
1254 @code{mbtowc} with non-null @var{string} distinguishes three
1255 possibilities: the first @var{size} bytes at @var{string} start with
1256 valid multibyte character, they start with an invalid byte sequence or
1257 just part of a character, or @var{string} points to an empty string (a
1258 null character).
1259
1260 For a valid multibyte character, @code{mbtowc} converts it to a wide
1261 character and stores that in @code{*@var{result}}, and returns the
1262 number of bytes in that character (always at least @code{1}, and never
1263 more than @var{size}).
1264
1265 For an invalid byte sequence, @code{mbtowc} returns @code{-1}.  For an
1266 empty string, it returns @code{0}, also storing @code{0} in
1267 @code{*@var{result}}.
1268
1269 If the multibyte character code uses shift characters, then
1270 @code{mbtowc} maintains and updates a shift state as it scans.  If you
1271 call @code{mbtowc} with a null pointer for @var{string}, that
1272 initializes the shift state to its standard initial value.  It also
1273 returns nonzero if the multibyte character code in use actually has a
1274 shift state.  @xref{Shift State}.
1275 @end deftypefun
1276
1277 @comment stdlib.h
1278 @comment ISO
1279 @deftypefun int wctomb (char *@var{string}, wchar_t @var{wchar})
1280 The @code{wctomb} (``wide character to multibyte'') function converts
1281 the wide character code @var{wchar} to its corresponding multibyte
1282 character sequence, and stores the result in bytes starting at
1283 @var{string}.  At most @code{MB_CUR_MAX} characters are stored.
1284
1285 @code{wctomb} with non-null @var{string} distinguishes three
1286 possibilities for @var{wchar}: a valid wide character code (one that can
1287 be translated to a multibyte character), an invalid code, and @code{0}.
1288
1289 Given a valid code, @code{wctomb} converts it to a multibyte character,
1290 storing the bytes starting at @var{string}.  Then it returns the number
1291 of bytes in that character (always at least @code{1}, and never more
1292 than @code{MB_CUR_MAX}).
1293
1294 If @var{wchar} is an invalid wide character code, @code{wctomb} returns
1295 @code{-1}.  If @var{wchar} is @code{0}, it returns @code{0}, also
1296 storing @code{0} in @code{*@var{string}}.
1297
1298 If the multibyte character code uses shift characters, then
1299 @code{wctomb} maintains and updates a shift state as it scans.  If you
1300 call @code{wctomb} with a null pointer for @var{string}, that
1301 initializes the shift state to its standard initial value.  It also
1302 returns nonzero if the multibyte character code in use actually has a
1303 shift state.  @xref{Shift State}.
1304
1305 Calling this function with a @var{wchar} argument of zero when
1306 @var{string} is not null has the side-effect of reinitializing the
1307 stored shift state @emph{as well as} storing the multibyte character
1308 @code{0} and returning @code{0}.
1309 @end deftypefun
1310
1311 Similar to @code{mbrlen} there is also a non-reentrant function which
1312 computes the length of a multibyte character.  It can be defined in
1313 terms of @code{mbtowc}.
1314
1315 @comment stdlib.h
1316 @comment ISO
1317 @deftypefun int mblen (const char *@var{string}, size_t @var{size})
1318 The @code{mblen} function with a non-null @var{string} argument returns
1319 the number of bytes that make up the multibyte character beginning at
1320 @var{string}, never examining more than @var{size} bytes.  (The idea is
1321 to supply for @var{size} the number of bytes of data you have in hand.)
1322
1323 The return value of @code{mblen} distinguishes three possibilities: the
1324 first @var{size} bytes at @var{string} start with valid multibyte
1325 character, they start with an invalid byte sequence or just part of a
1326 character, or @var{string} points to an empty string (a null character).
1327
1328 For a valid multibyte character, @code{mblen} returns the number of
1329 bytes in that character (always at least @code{1}, and never more than
1330 @var{size}).  For an invalid byte sequence, @code{mblen} returns
1331 @code{-1}.  For an empty string, it returns @code{0}.
1332
1333 If the multibyte character code uses shift characters, then @code{mblen}
1334 maintains and updates a shift state as it scans.  If you call
1335 @code{mblen} with a null pointer for @var{string}, that initializes the
1336 shift state to its standard initial value.  It also returns nonzero if
1337 the multibyte character code in use actually has a shift state.
1338 @xref{Shift State}.
1339
1340 @pindex stdlib.h
1341 The function @code{mblen} is declared in @file{stdlib.h}.
1342 @end deftypefun
1343
1344
1345 @node Non-reentrant String Conversion
1346 @subsection Non-reentrant Conversion of Strings
1347
1348 For convenience reasons the @w{ISO C89} standard defines also functions
1349 to convert entire strings instead of single characters.  These functions
1350 suffer from the same problems as their reentrant counterparts from the
1351 second amendment to @w{ISO C89}; see @xref{Converting Strings}.
1352
1353 @comment stdlib.h
1354 @comment ISO
1355 @deftypefun size_t mbstowcs (wchar_t *@var{wstring}, const char *@var{string}, size_t @var{size})
1356 The @code{mbstowcs} (``multibyte string to wide character string'')
1357 function converts the null-terminated string of multibyte characters
1358 @var{string} to an array of wide character codes, storing not more than
1359 @var{size} wide characters into the array beginning at @var{wstring}.
1360 The terminating null character counts towards the size, so if @var{size}
1361 is less than the actual number of wide characters resulting from
1362 @var{string}, no terminating null character is stored.
1363
1364 The conversion of characters from @var{string} begins in the initial
1365 shift state.
1366
1367 If an invalid multibyte character sequence is found, this function
1368 returns a value of @code{-1}.  Otherwise, it returns the number of wide
1369 characters stored in the array @var{wstring}.  This number does not
1370 include the terminating null character, which is present if the number
1371 is less than @var{size}.
1372
1373 Here is an example showing how to convert a string of multibyte
1374 characters, allocating enough space for the result.
1375
1376 @smallexample
1377 wchar_t *
1378 mbstowcs_alloc (const char *string)
1379 @{
1380   size_t size = strlen (string) + 1;
1381   wchar_t *buf = xmalloc (size * sizeof (wchar_t));
1382
1383   size = mbstowcs (buf, string, size);
1384   if (size == (size_t) -1)
1385     return NULL;
1386   buf = xrealloc (buf, (size + 1) * sizeof (wchar_t));
1387   return buf;
1388 @}
1389 @end smallexample
1390
1391 @end deftypefun
1392
1393 @comment stdlib.h
1394 @comment ISO
1395 @deftypefun size_t wcstombs (char *@var{string}, const wchar_t *@var{wstring}, size_t @var{size})
1396 The @code{wcstombs} (``wide character string to multibyte string'')
1397 function converts the null-terminated wide character array @var{wstring}
1398 into a string containing multibyte characters, storing not more than
1399 @var{size} bytes starting at @var{string}, followed by a terminating
1400 null character if there is room.  The conversion of characters begins in
1401 the initial shift state.
1402
1403 The terminating null character counts towards the size, so if @var{size}
1404 is less than or equal to the number of bytes needed in @var{wstring}, no
1405 terminating null character is stored.
1406
1407 If a code that does not correspond to a valid multibyte character is
1408 found, this function returns a value of @code{-1}.  Otherwise, the
1409 return value is the number of bytes stored in the array @var{string}.
1410 This number does not include the terminating null character, which is
1411 present if the number is less than @var{size}.
1412 @end deftypefun
1413
1414 @node Shift State
1415 @subsection States in Non-reentrant Functions
1416
1417 In some multibyte character codes, the @emph{meaning} of any particular
1418 byte sequence is not fixed; it depends on what other sequences have come
1419 earlier in the same string.  Typically there are just a few sequences
1420 that can change the meaning of other sequences; these few are called
1421 @dfn{shift sequences} and we say that they set the @dfn{shift state} for
1422 other sequences that follow.
1423
1424 To illustrate shift state and shift sequences, suppose we decide that
1425 the sequence @code{0200} (just one byte) enters Japanese mode, in which
1426 pairs of bytes in the range from @code{0240} to @code{0377} are single
1427 characters, while @code{0201} enters Latin-1 mode, in which single bytes
1428 in the range from @code{0240} to @code{0377} are characters, and
1429 interpreted according to the ISO Latin-1 character set.  This is a
1430 multibyte code which has two alternative shift states (``Japanese mode''
1431 and ``Latin-1 mode''), and two shift sequences that specify particular
1432 shift states.
1433
1434 When the multibyte character code in use has shift states, then
1435 @code{mblen}, @code{mbtowc} and @code{wctomb} must maintain and update
1436 the current shift state as they scan the string.  To make this work
1437 properly, you must follow these rules:
1438
1439 @itemize @bullet
1440 @item
1441 Before starting to scan a string, call the function with a null pointer
1442 for the multibyte character address---for example, @code{mblen (NULL,
1443 0)}.  This initializes the shift state to its standard initial value.
1444
1445 @item
1446 Scan the string one character at a time, in order.  Do not ``back up''
1447 and rescan characters already scanned, and do not intersperse the
1448 processing of different strings.
1449 @end itemize
1450
1451 Here is an example of using @code{mblen} following these rules:
1452
1453 @smallexample
1454 void
1455 scan_string (char *s)
1456 @{
1457   int length = strlen (s);
1458
1459   /* @r{Initialize shift state.} */
1460   mblen (NULL, 0);
1461
1462   while (1)
1463     @{
1464       int thischar = mblen (s, length);
1465       /* @r{Deal with end of string and invalid characters.} */
1466       if (thischar == 0)
1467         break;
1468       if (thischar == -1)
1469         @{
1470           error ("invalid multibyte character");
1471           break;
1472         @}
1473       /* @r{Advance past this character.} */
1474       s += thischar;
1475       length -= thischar;
1476     @}
1477 @}
1478 @end smallexample
1479
1480 The functions @code{mblen}, @code{mbtowc} and @code{wctomb} are not
1481 reentrant when using a multibyte code that uses a shift state.  However,
1482 no other library functions call these functions, so you don't have to
1483 worry that the shift state will be changed mysteriously.
1484
1485
1486 @node Generic Charset Conversion
1487 @section Generic Charset Conversion
1488
1489 The conversion functions mentioned so far in this chapter all had in
1490 common that they operate on character sets which are not directly
1491 specified by the functions.  The multibyte encoding used is specified by
1492 the currently selected locale for the @code{LC_CTYPE} category.  The
1493 wide character set is fixed by the implementation (in the case of GNU C
1494 library it always is @w{ISO 10646}.
1495
1496 This has of course several problems when it comes to general character
1497 conversion:
1498
1499 @itemize @bullet
1500 @item
1501 For every conversion where neither the source or destination character
1502 set is the character set of the locale for the @code{LC_CTYPE} category,
1503 one has to change the @code{LC_CTYPE} locale using @code{setlocale}.
1504
1505 This introduces major problems for the rest of the programs since
1506 several more functions (e.g., the character classification functions,
1507 @xref{Classification of Characters}) use the @code{LC_CTYPE} category.
1508
1509 @item
1510 Parallel conversions to and from different character sets are not
1511 possible since the @code{LC_CTYPE} selection is global and shared by all
1512 threads.
1513
1514 @item
1515 If neither the source nor the destination character set is the character
1516 set used for @code{wchar_t} representation there is at least a two-step
1517 process necessary to convert a text using the functions above.  One
1518 would have to select the source character set as the multibyte encoding,
1519 convert the text into a @code{wchar_t} text, select the destination
1520 character set as the multibyte encoding and convert the wide character
1521 text to the multibyte (=destination) character set.
1522
1523 Even if this is possible (which is not guaranteed) it is a very tiring
1524 work.  Plus it suffers from the other two raised points even more due to
1525 the steady changing of the locale.
1526 @end itemize
1527
1528
1529 The XPG2 standard defines a completely new set of functions which has
1530 none of these limitations.  They are not at all coupled to the selected
1531 locales and they but no constraints on the character sets selected for
1532 source and destination.  Only the set of available conversions is
1533 limiting them.  The standard does not specify that any conversion at all
1534 must be available.  It is a measure of the quality of the implementation.
1535
1536 In the following text first the interface will be described.  It is here
1537 shortly named @code{iconv}-interface after the name of the conversion
1538 function.  Then the implementation is described as far as interesting to
1539 the advanced user who wants to extend the conversion capabilities.
1540 Comparisons with other implementations will show what trapfalls lie on
1541 the way of portable applications.
1542
1543 @menu
1544 * Generic Conversion Interface::    Generic Character Set Conversion Interface.
1545 * iconv Examples::                  A complete @code{iconv} example.
1546 * Other iconv Implementations::     Some Details about other @code{iconv}
1547                                      Implementations.
1548 * glibc iconv Implementation::      The @code{iconv} Implementation in the GNU C
1549                                      library.
1550 @end menu
1551
1552 @node Generic Conversion Interface
1553 @subsection Generic Character Set Conversion Interface
1554
1555 This set of functions follows the traditional cycle of using a resource:
1556 open--use--close.  The interface consists of three functions, each of
1557 which implement one step.
1558
1559 Before the interfaces are described it is necessary to introduce a
1560 datatype.  Just like other open--use--close interface the functions
1561 introduced here work using a handles and the @file{iconv.h} header
1562 defines a special type for the handles used.
1563
1564 @comment iconv.h
1565 @comment XPG2
1566 @deftp {Data Type} iconv_t
1567 This data type is an abstract type defined in @file{iconv.h}.  The user
1568 must not assume anything about the definition of this type, it must be
1569 completely opaque.
1570
1571 Objects of this type can get assigned handles for the conversions using
1572 the @code{iconv} functions.  The objects themselves need not be freed but
1573 the conversions for which the handles stand for have to.
1574 @end deftp
1575
1576 @noindent
1577 The first step is the function to create a handle.
1578
1579 @comment iconv.h
1580 @comment XPG2
1581 @deftypefun iconv_t iconv_open (const char *@var{tocode}, const char *@var{fromcode})
1582 The @code{iconv_open} function has to be used before starting a
1583 conversion.  The two parameters this function takes determine the
1584 source and destination character set for the conversion and if the
1585 implementation has the possibility to perform such a conversion the
1586 function returns a handle.
1587
1588 If the wanted conversion is not available the function returns
1589 @code{(iconv_t) -1}.  In this case the global variable @code{errno} can
1590 have the following values:
1591
1592 @table @code
1593 @item EMFILE
1594 The process already has @code{OPEN_MAX} file descriptors open.
1595 @item ENFILE
1596 The system limit of open file is reached.
1597 @item ENOMEM
1598 Not enough memory to carry out the operation.
1599 @item EINVAL
1600 The conversion from @var{fromcode} to @var{tocode} is not supported.
1601 @end table
1602
1603 It is not possible to use the same descriptor in different threads to
1604 perform independent conversions.  Within the data structures associated
1605 with the descriptor there is information about the conversion state.
1606 This must of course not be messed up by using it in different
1607 conversions.
1608
1609 An @code{iconv} descriptor is like a file descriptor as for every use a
1610 new descriptor must be created.  The descriptor does not stand for all
1611 of the conversions from @var{fromset} to @var{toset}.
1612
1613 The GNU C library implementation of @code{iconv_open} has one
1614 significant extension to other implementations.  To ease the extension
1615 of the set of available conversions the implementation allows to store
1616 the necessary files with data and code in arbitrary many directories.
1617 How this extensions have to be written will be explained below
1618 (@pxref{glibc iconv Implementation}).  Here it is only important to say
1619 that all directories mentioned in the @code{GCONV_PATH} environment
1620 variable are considered if they contain a file @file{gconv-modules}.
1621 These directories need not necessarily be created by the system
1622 administrator.  In fact, this extension is introduced to help users
1623 writing and using own, new conversions.  Of course this does not work
1624 for security reasons in SUID binaries; in this case only the system
1625 directory is considered and this normally is
1626 @file{@var{prefix}/lib/gconv}.  The @code{GCONV_PATH} environment
1627 variable is examined exactly once at the first call of the
1628 @code{iconv_open} function.  Later modifications of the variable have no
1629 effect.
1630
1631 @pindex iconv.h
1632 This function got introduced early in the X/Open Portability Guide,
1633 @w{version 2}.  It is supported by all commercial Unices as it is
1634 required for the Unix branding.  The quality and completeness of the
1635 implementation varies widely, though.  The function is declared in
1636 @file{iconv.h}.
1637 @end deftypefun
1638
1639 The @code{iconv} implementation can associate large data structure with
1640 the handle returned by @code{iconv_open}.  Therefore it is crucial to
1641 free all the resources once all conversions are carried out and the
1642 conversion is not needed anymore.
1643
1644 @comment iconv.h
1645 @comment XPG2
1646 @deftypefun int iconv_close (iconv_t @var{cd})
1647 The @code{iconv_close} function frees all resources associated with the
1648 handle @var{cd} which must have been returned by a successful call to
1649 the @code{iconv_open} function.
1650
1651 If the function call was successful the return value is @math{0}.
1652 Otherwise it is @math{-1} and @code{errno} is set appropriately.
1653 Defined error are:
1654
1655 @table @code
1656 @item EBADF
1657 The conversion descriptor is invalid.
1658 @end table
1659
1660 @pindex iconv.h
1661 This function was introduced together with the rest of the @code{iconv}
1662 functions in XPG2 and it is declared in @file{iconv.h}.
1663 @end deftypefun
1664
1665 The standard defines only one actual conversion function.  This has
1666 therefore the most general interface: it allows conversion from one
1667 buffer to another.  Conversion from a file to a buffer, vice versa, or
1668 even file to file can be implemented on top of it.
1669
1670 @comment iconv.h
1671 @comment XPG2
1672 @deftypefun size_t iconv (iconv_t @var{cd}, const char **@var{inbuf}, size_t *@var{inbytesleft}, char **@var{outbuf}, size_t *@var{outbytesleft})
1673 @cindex stateful
1674 The @code{iconv} function converts the text in the input buffer
1675 according to the rules associated with the descriptor @var{cd} and
1676 stores the result in the output buffer.  It is possible to call the
1677 function for the same text several times in a row since for stateful
1678 character sets the necessary state information is kept in the data
1679 structures associated with the descriptor.
1680
1681 The input buffer is specified by @code{*@var{inbuf}} and it contains
1682 @code{*@var{inbytesleft}} bytes.  The extra indirection is necessary for
1683 communicating the used input back to the caller (see below).  It is
1684 important to note that the buffer pointer is of type @code{char} and the
1685 length is measured in bytes even if the input text is encoded in wide
1686 characters.
1687
1688 The output buffer is specified in a similar way.  @code{*@var{outbuf}}
1689 points to the beginning of the buffer with at least
1690 @code{*@var{outbytesleft}} bytes room for the result.  The buffer
1691 pointer again is of type @code{char} and the length is measured in
1692 bytes.  If @var{outbuf} or @code{*@var{outbuf}} is a null pointer the
1693 conversion is performed but no output is available.
1694
1695 If @var{inbuf} is a null pointer the @code{iconv} function performs the
1696 necessary action to put the state of the conversion into the initial
1697 state.  This is obviously a no-op for non-stateful encodings, but if the
1698 encoding has a state such a function call might put some byte sequences
1699 in the output buffer which perform the necessary state changes.  The
1700 next call with @var{inbuf} not being a null pointer then simply goes on
1701 from the initial state.  It is important that the programmer never makes
1702 any assumption on whether the conversion has to deal with states or not.
1703 Even if the input and output character sets are not stateful the
1704 implementation might still have to keep states.  This is due to the
1705 implementation chosen for the GNU C library as it is described below.
1706 Therefore an @code{iconv} call to reset the state should always be
1707 performed if some protocol requires this for the output text.
1708
1709 The conversion stops for three reasons.  The first is that all
1710 characters from the input buffer are converted.  This actually can mean
1711 two things: really all bytes from the input buffer are consumed or
1712 there are some bytes at the end of the buffer which possibly can form a
1713 complete character but the input is incomplete.  The second reason for a
1714 stop is when the output buffer is full.  And the third reason is that
1715 the input contains invalid characters.
1716
1717 In all these cases the buffer pointers after the last successful
1718 conversion, for input and output buffer, are stored in @var{inbuf} and
1719 @var{outbuf} and the available room in each buffer is stored in
1720 @var{inbytesleft} and @var{outbytesleft}.
1721
1722 Since the character sets selected in the @code{iconv_open} call can be
1723 almost arbitrary there can be situations where the input buffer contains
1724 valid characters which have no identical representation in the output
1725 character set.  The behavior in this situation is undefined.  The
1726 @emph{current} behavior of the GNU C library in this situation is to
1727 return with an error immediately.  This certainly is not the most
1728 desirable solution.  Therefore future versions will provide better ones
1729 but they are not yet finished.
1730
1731 If all input from the input buffer is successfully converted and stored
1732 in the output buffer the function returns the number of conversions
1733 performed.  In all other cases the return value is @code{(size_t) -1}
1734 and @code{errno} is set appropriately.  In this case the value pointed
1735 to by @var{inbytesleft} is nonzero.
1736
1737 @table @code
1738 @item EILSEQ
1739 The conversion stopped because of an invalid byte sequence in the input.
1740 After the call @code{*@var{inbuf}} points at the first byte of the
1741 invalid byte sequence.
1742
1743 @item E2BIG
1744 The conversion stopped because it ran out of space in the output buffer.
1745
1746 @item EINVAL
1747 The conversion stopped because of an incomplete byte sequence at the end
1748 of the input buffer.
1749
1750 @item EBADF
1751 The @var{cd} argument is invalid.
1752 @end table
1753
1754 @pindex iconv.h
1755 This function was introduced in the XPG2 standard and is declared in the
1756 @file{iconv.h} header.
1757 @end deftypefun
1758
1759 The definition of the @code{iconv} function is quite good overall.  It
1760 provides quite flexible functionality.  The only problems lie in the
1761 boundary cases which are incomplete byte sequences at the end of the
1762 input buffer and invalid input.  A third problem, which is not really a
1763 design problem, is the way conversions are selected.  The standard does
1764 not say anything about the legitimate names, a minimal set of available
1765 conversions.  We will see how this has negative impacts in the
1766 discussion of other implementations further down.
1767
1768
1769 @node iconv Examples
1770 @subsection A complete @code{iconv} example
1771
1772 The example below features a solution for a common problem.  Given that
1773 one knows the internal encoding used by the system for @code{wchar_t}
1774 strings one often is in the position to read text from a file and store
1775 it in wide character buffers.  One can do this using @code{mbsrtowcs}
1776 but then we run into the problems discussed above.
1777
1778 @smallexample
1779 int
1780 file2wcs (int fd, const char *charset, wchar_t *outbuf, size_t avail)
1781 @{
1782   char inbuf[BUFSIZ];
1783   size_t insize = 0;
1784   char *wrptr = (char *) outbuf;
1785   int result = 0;
1786   iconv_t cd;
1787
1788   cd = iconv_open ("UCS4", charset);
1789   if (cd == (iconv_t) -1)
1790     @{
1791       /* @r{Something went wrong.}  */
1792       if (errno == EINVAL)
1793         error (0, 0, "conversion from `%s' to `UCS4' no available",
1794                charset);
1795       else
1796         perror ("iconv_open");
1797
1798       /* @r{Terminate the output string.}  */
1799       *outbuf = L'\0';
1800
1801       return -1;
1802     @}
1803
1804   while (avail > 0)
1805     @{
1806       size_t nread;
1807       size_t nconv;
1808       char *inptr = inbuf;
1809
1810       /* @r{Read more input.}  */
1811       nread = read (fd, inbuf + insize, sizeof (inbuf) - insize);
1812       if (nread == 0)
1813         @{
1814           /* @r{When we come here the file is completely read.}
1815              @r{This still could mean there are some unused}
1816              @r{characters in the @code{inbuf}.  Put them back.}  */
1817           if (lseek (fd, -insize, SEEK_CUR) == -1)
1818             result = -1;
1819           break;
1820         @}
1821       insize += nread;
1822
1823       /* @r{Do the conversion.}  */
1824       nconv = iconv (cd, &inptr, &insize, &wrptr, &avail);
1825       if (nconv == (size_t) -1)
1826         @{
1827           /* @r{Not everything went right.  It might only be}
1828              @r{an unfinished byte sequence at the end of the}
1829              @r{buffer.  Or it is a real problem.}  */
1830           if (errno == EINVAL)
1831             /* @r{This is harmless.  Simply move the unused}
1832                @r{bytes to the beginning of the buffer so that}
1833                @r{they can be used in the next round.}  */
1834             memmove (inbuf, inptr, insize);
1835           else
1836             @{
1837               /* @r{It is a real problem.  Maybe we ran out of}
1838                  @r{space in the output buffer or we have invalid}
1839                  @r{input.  In any case back the file pointer to}
1840                  @r{the position of the last processed byte.}  */
1841               lseek (fd, -insize, SEEK_CUR);
1842               result = -1;
1843               break;
1844             @}
1845         @}
1846     @}
1847
1848   /* @r{Terminate the output string.}  */
1849   *((wchar_t *) wrptr) = L'\0';
1850
1851   if (iconv_close (cd) != 0)
1852     perror ("iconv_close");
1853
1854   return (wchar_t *) wrptr - outbuf;
1855 @}
1856 @end smallexample
1857
1858 @cindex stateful
1859 This example shows the most important aspects of using the @code{iconv}
1860 functions.  It shows how successive calls to @code{iconv} can be used to
1861 convert large amounts of text.  The user does not have to care about
1862 stateful encodings as the functions take care of everything.
1863
1864 An interesting point is the case where @code{iconv} return an error and
1865 @code{errno} is set to @code{EINVAL}.  This is not really an error in
1866 the transformation.  It can happen whenever the input character set
1867 contains byte sequences of more than one byte for some character and
1868 texts are not processed in one piece.  In this case there is a chance
1869 that a multibyte sequence is cut.  The caller than can simply read the
1870 remainder of the takes and feed the offending bytes together with new
1871 character from the input to @code{iconv} and continue the work.  The
1872 internal state kept in the descriptor is @emph{not} unspecified after
1873 such an event as it is the case with the conversion functions from the
1874 @w{ISO C} standard.
1875
1876 The example also shows the problem of using wide character strings with
1877 @code{iconv}.  As explained in the description of the @code{iconv}
1878 function above the function always takes a pointer to a @code{char}
1879 array and the available space is measured in bytes.  In the example the
1880 output buffer is a wide character buffer.  Therefore we use a local
1881 variable @var{wrptr} of type @code{char *} which is used in the
1882 @code{iconv} calls.
1883
1884 This looks rather innocent but can lead to problems on platforms which
1885 have tight restriction on alignment.  Therefore the caller of
1886 @code{iconv} has to make sure that the pointers passed are suitable for
1887 access of characters from the appropriate character set.  Since in the
1888 above case the input parameter to the function is a @code{wchar_t}
1889 pointer this is the case (unless the user violates alignment when
1890 computing the parameter).  But in other situations, especially when
1891 writing generic functions where one does not know what type of character
1892 set one uses and therefore treats text as a sequence of bytes, it might
1893 become tricky.
1894
1895
1896 @node Other iconv Implementations
1897 @subsection Some Details about other @code{iconv} Implementations
1898
1899 This is not really the place to discuss the @code{iconv} implementation
1900 of other systems but it is necessary to know a bit about them to write
1901 portable programs.  The above mentioned problems with the specification
1902 of the @code{iconv} functions can lead to portability issues.
1903
1904 The first thing to notice is that due to the large number of character
1905 sets in use it is certainly not practical to encode the conversions
1906 directly in the C library.  Therefore the conversion information must
1907 come from files outside the C library.  This is usually in one or both
1908 of the following ways:
1909
1910 @itemize @bullet
1911 @item
1912 The C library contains a set of generic conversion functions which can
1913 read the needed conversion tables and other information from data files.
1914 These files get loaded when necessary.
1915
1916 This solution is problematic as it is only with very much effort
1917 applicable to all character set (maybe it is even impossible).  The
1918 differences in structure of the different character sets is so large
1919 that many different variants of the table processing functions must be
1920 developed.  On top of this the generic nature of these functions make
1921 them slower than specifically implemented functions.
1922
1923 @item
1924 The C library only contains a framework which can dynamically load
1925 object files and execute the therein contained conversion functions.
1926
1927 This solution provides much more flexibility.  The C library itself
1928 contains only very little code and therefore reduces the general memory
1929 footprint.  Also, with a documented interface between the C library and
1930 the loadable modules it is possible for third parties to extend the set
1931 of available conversion modules.  A drawback of this solution is that
1932 dynamic loading must be available.
1933 @end itemize
1934
1935 Some implementations in commercial Unices implement a mixture of these
1936 possibilities, the majority only the second solution.  This often leads
1937 to problems, though.  Since the modules with the conversion modules must
1938 be dynamically loaded the system must have this possibility for all
1939 programs.  But this is not the case.  At least some platforms (if not
1940 all) are not able to dynamically load objects if the program is linked
1941 statically.  This is often solved by outlawing static linking entirely
1942 but sure it is a weak solution.  The GNU C library does not have this
1943 restriction though it also uses dynamic loading.  The danger is that one
1944 get acquainted with this and forgets about the restriction on other
1945 systems.
1946
1947 A second thing to know about other @code{iconv} implementations is that
1948 the number of available conversions is often very limited.  Some
1949 implementations provide in the standard release (not the special
1950 international release, if something exists) at most 100 to 200
1951 conversion possibilities.  This does not mean 200 different character
1952 sets are supported.  E.g., conversions from one character set to a set
1953 of, say, 10 others counts as 10 conversion.  Together with the other
1954 direction this makes already 20.  One can imagine the thin coverage
1955 these platform provide.  Some Unix vendors even provide only a handful
1956 of conversions which renders them useless for almost all uses.
1957
1958 This directly leads to a third and probably the most problematic point.
1959 The way the @code{iconv} conversion functions are implemented on all
1960 known Unix system and the availability of the conversion functions from
1961 character set @math{@cal{A}} to @math{@cal{B}} and the conversion from
1962 @math{@cal{B}} to @math{@cal{C}} does @emph{not} imply that the
1963 conversion from @math{@cal{A}} to @math{@cal{C}} is available.
1964
1965 This might not seem unreasonable and problematic at first but it is a
1966 quite big problem as one will notice shortly after hitting it.  To show
1967 the problem we assume to write a program which has to convert from
1968 @math{@cal{A}} to @math{@cal{C}}.  A call like
1969
1970 @smallexample
1971 cd = iconv_open ("@math{@cal{C}}", "@math{@cal{A}}");
1972 @end smallexample
1973
1974 @noindent
1975 does fail according to the assumption above.  But what does the program
1976 do now?  The conversion is really necessary and therefore simply giving
1977 up is no possibility.
1978
1979 First this is of course a nuisance.  The @code{iconv} function should
1980 take care of this.  But second, how should the program proceed from here
1981 on?  If it would try to convert to character set @math{@cal{B}} first
1982 the two @code{iconv_open} calls
1983
1984 @smallexample
1985 cd1 = iconv_open ("@math{@cal{B}}", "@math{@cal{A}}");
1986 @end smallexample
1987
1988 @noindent
1989 and
1990
1991 @smallexample
1992 cd2 = iconv_open ("@math{@cal{C}}", "@math{@cal{B}}");
1993 @end smallexample
1994
1995 @noindent
1996 will succeed but how to find @math{@cal{B}}?
1997
1998 The answer is unfortunately: there is no general solution.  On some
1999 systems guessing might help.  On those systems most character sets can
2000 convert to and from UTF8 encoded @w{ISO 10646} or Unicode text.  Beside
2001 this only some very system-specific methods can help.  Since the
2002 conversion functions come from loadable modules and these modules must
2003 be stored somewhere in the filesystem, one @emph{could} try to find them
2004 and determine from the available file which conversions are available
2005 and whether there is an indirect route from @math{@cal{A}} to
2006 @math{@cal{C}}.
2007
2008 This shows one of the design errors of @code{iconv} mentioned above.  It
2009 should at least be possible to determine the list of available
2010 conversion programmatically so that if @code{iconv_open} says there is
2011 no such conversion, one could make sure this also is true for indirect
2012 routes.
2013
2014
2015 @node glibc iconv Implementation
2016 @subsection The @code{iconv} Implementation in the GNU C library
2017
2018 After reading about the problems of @code{iconv} implementations in the
2019 last section it is certainly good to read here that the implementation
2020 in the GNU C library has none of the problems mentioned above.  But step
2021 by step now.  We will now address the points raised above.  The
2022 evaluation is based on the current state of the development (as of
2023 January 1999).  The development of the @code{iconv} functions is not
2024 entirely finished by now but things can only get better.
2025
2026 The GNU C library's @code{iconv} implementation uses shared loadable
2027 modules to implement the conversions.  A very small number of
2028 conversions are built into the library itself but these are only rather
2029 trivial conversions.
2030
2031 All the benefits of loadable modules are available in the GNU C library
2032 implementation.  This is especially interesting since the interface is
2033 well documented (see below) and it therefore is easy to write new
2034 conversion modules.  The drawback of using loadable object is not a
2035 problem in the GNU C library, at least on ELF systems.  Since the
2036 library is able to load shared objects even in statically linked
2037 binaries this means that static linking needs not to be forbidden in case
2038 one wants to use @code{iconv}.
2039
2040 The second mentioned problems is the number of supported conversions.
2041 First, the GNU C library supports more than 150 character sets.  And the
2042 way the implementation is designed the number of supported conversions
2043 is greater than 22350 (@math{150} times @math{149}).  If any conversion
2044 from or to a character set is missing it can easily be added.
2045
2046 This high number is due to the fact that the GNU C library
2047 implementation of @code{iconv} does not have the third problem mentioned
2048 above.  I.e., whenever there is a conversion from a character set
2049 @math{@cal{A}} to @math{@cal{B}} and from @math{@cal{B}} to
2050 @math{@cal{C}} it is always possible to convert from @math{@cal{A}} to
2051 @math{@cal{C}} directly.  If the @code{iconv_open} returns an error and
2052 sets @code{errno} to @code{EINVAL} this really means there is no known
2053 way, directly or indirectly, to perform the wanted conversion.
2054
2055 @cindex triangulation
2056 This is achieved by providing for each character set a conversion from
2057 and to UCS4 encoded @w{ISO 10646}.  Using @w{ISO 10646} as an
2058 intermediate representation it is possible to ``triangulate''.
2059
2060 There is no inherent requirement to provide a conversion to @w{ISO
2061 10646} for a new character set and it is also possible to provide other
2062 conversions where neither source not destination character set is @w{ISO
2063 10646}.  The currently existing set of conversions is simply meant to
2064 convert all conversions which might be of interest.  What could be done
2065 in future is improving the speed of certain conversions.
2066
2067 @cindex ISO-2022-JP
2068 @cindex EUC-JP
2069 Since all currently available conversions use the triangulation methods
2070 often used conversion run unnecessarily slow.  If, e.g., somebody often
2071 needs the conversion from ISO-2022-JP to EUC-JP it is not the best way
2072 to convert the input to @w{ISO 10646} first.  The two character sets of
2073 interest are much more similar to each other than to @w{ISO 10646}.
2074
2075 In such a situation one can easy write a new conversion and provide it
2076 as a better alternative.  The GNU C library @code{iconv} implementation
2077 would automatically use the module implementing the conversion if it is
2078 specified to be more efficient.
2079
2080 @subsubsection Format of @file{gconv-modules} files
2081
2082 All information about the available conversions comes from a file named
2083 @file{gconv-modules} which can be found in any of the directories along
2084 the @code{GCONV_PATH}.  The @file{gconv-modules} files are line-oriented
2085 text files, where each of the lines has one of the following formats:
2086
2087 @itemize @bullet
2088 @item
2089 If the first non-whitespace character is a @kbd{#} the line contains
2090 only comments and is ignored.
2091
2092 @item
2093 Lines starting with @code{alias} define an alias name for a character
2094 set.  There are two more words expected on the line.  The first one
2095 defines the alias name and the second defines the original name of the
2096 character set.  The effect is that it is possible to use the alias name
2097 in the @var{fromset} or @var{toset} parameters of @code{iconv_open} and
2098 achieve the same result as when using the real character set name.
2099
2100 This is quite important as a character set has often many different
2101 names.  There is normally always an official name but this need not
2102 correspond to the most popular name.  Beside this many character sets
2103 have special names which are somehow constructed.  E.g., all character
2104 sets specified by the ISO have an alias of the form
2105 @code{ISO-IR-@var{nnn}} where @var{nnn} is the registration number.
2106 This allows programs which know about the registration number to
2107 construct character set names and use them in @code{iconv_open} calls.
2108 More on the available names and aliases follows below.
2109
2110 @item
2111 Lines starting with @code{module} introduce an available conversion
2112 module.  These lines must contain three or four more words.
2113
2114 The first word specifies the source character set, the second word the
2115 destination character set of conversion implemented in this module.  The
2116 third word is the name of the loadable module.  The filename is
2117 constructed by appending the usual shared object prefix (normally
2118 @file{.so}) and this file is then supposed to be found in the same
2119 directory the @file{gconv-modules} file is in.  The last word on the
2120 line, which is optional, is a numeric value representing the cost of the
2121 conversion.  If this word is missing a cost of @math{1} is assumed.  The
2122 numeric value itself does not matter that much; what counts are the
2123 relative values of the sums of costs for all possible conversion paths.
2124 Below is a more precise description of the use of the cost value.
2125 @end itemize
2126
2127 Coming back to the example where one has written a module to directly
2128 convert from ISO-2022-JP to EUC-JP and back.  All what has to be done is
2129 to put the new module, be its name ISO2022JP-EUCJP.so, in a directory
2130 and add a file @file{gconv-modules} with the following content in the
2131 same directory:
2132
2133 @smallexample
2134 module  ISO-2022-JP//   EUC-JP//        ISO2022JP-EUCJP    1
2135 module  EUC-JP//        ISO-2022-JP//   ISO2022JP-EUCJP    1
2136 @end smallexample
2137
2138 To see why this is enough it is necessary to understand how the
2139 conversion used by @code{iconv} and described in the descriptor is
2140 selected.  The approach to this problem is quite simple.
2141
2142 At the first call of the @code{iconv_open} function the program reads
2143 all available @file{gconv-modules} files and builds up two tables: one
2144 containing all the known aliases and another which contains the
2145 information about the conversions and which shared object implements
2146 them.
2147
2148 @subsubsection Finding the conversion path in @code{iconv}
2149
2150 The set of available conversions form a directed graph with weighted
2151 edges.  The weights on the edges are of course the costs specified in
2152 the @file{gconv-modules} files.  The @code{iconv_open} function
2153 therefore uses an algorithm suitable to search for the best path in such
2154 a graph and so constructs a list of conversions which must be performed
2155 in succession to get the transformation from the source to the
2156 destination character set.
2157
2158 Now it can be easily seen why the above @file{gconv-modules} files
2159 allows the @code{iconv} implementation to pick up the specific
2160 ISO-2022-JP to EUC-JP conversion module instead of the conversion coming
2161 with the library itself.  Since the later conversion takes two steps
2162 (from ISO-2022-JP to @w{ISO 10646} and then from @w{ISO 10646} to
2163 EUC-JP) the cost is @math{1+1 = 2}.  But the above @file{gconv-modules}
2164 file specifies that the new conversion modules can perform this
2165 conversion with only the cost of @math{1}.
2166
2167 A bit mysterious about the @file{gconv-modules} file above (and also the
2168 file coming with the GNU C library) are the names of the character sets
2169 specified in the @code{module} lines.  Why do almost all the names end
2170 in @code{//}?  And this is not all: the names can actually be regular
2171 expressions.  At this point of time this mystery should not be revealed.
2172 Sorry!  @strong{The part of the implementation where this is used is not
2173 yet finished.  For now please simply follow the existing examples.
2174 It'll become clearer once it is. --drepper}
2175
2176 A last remark about the @file{gconv-modules} is about the names not
2177 ending with @code{//}.  There often is a character set named
2178 @code{INTERNAL} mentioned.  From the discussion above and the chosen
2179 name it should have become clear that this is the names for the
2180 representation used in the intermediate step of the triangulation.  We
2181 have said that this is UCS4 but actually it is not quite right.  The
2182 UCS4 specification also includes the specification of the byte ordering
2183 used.  Since an UCS4 value consists of four bytes a stored value is
2184 effected by byte ordering.  The internal representation is @emph{not}
2185 the same as UCS4 in case the byte ordering of the processor (or at least
2186 the running process) is not the same as the one required for UCS4.  This
2187 is done for performance reasons as one does not want to perform
2188 unnecessary byte-swapping operations if one is not interested in actually
2189 seeing the result in UCS4.  To avoid trouble with endianess the internal
2190 representation consistently is named @code{INTERNAL} even on big-endian
2191 systems where the representations are identical.
2192
2193 @subsubsection @code{iconv} module data structures
2194
2195 So far this section described how modules are located and considered to
2196 be used.  What remains to be described is the interface of the modules
2197 so that one can write new ones.  This section describes the interface as
2198 it is in use in January 1999.  The interface will change in future a bit
2199 but hopefully only in an upward compatible way.
2200
2201 The definitions necessary to write new modules are publically available
2202 in the non-standard header @file{gconv.h}.  The following text will
2203 therefore describe the definitions from this header file.  But first it
2204 is necessary to get an overview.
2205
2206 From the perspective of the user of @code{iconv} the interface is quite
2207 simple: the @code{iconv_open} function returns a handle which can be
2208 used in calls @code{iconv} and finally the handle is freed with a call
2209 to @code{iconv_close}.  The problem is: the handle has to be able to
2210 represent the possibly long sequences of conversion steps and also the
2211 state of each conversion since the handle is all which is passed to the
2212 @code{iconv} function.  Therefore the data structures are really the
2213 elements to understanding the implementation.
2214
2215 We need two different kinds of data structures.  The first describes the
2216 conversion and the second describes the state etc.  There are really two
2217 type definitions like this in @file{gconv.h}.
2218 @pindex gconv.h
2219
2220 @comment gconv.h
2221 @comment GNU
2222 @deftp {Data type} {struct gconv_step}
2223 This data structure describes one conversion a module can perform.  For
2224 each function in a loaded module with conversion functions there is
2225 exactly one object of this type.  This object is shared by all users of
2226 the conversion.  I.e., this object does not contain any information
2227 corresponding to an actual conversion.  It only describes the conversion
2228 itself.
2229
2230 @table @code
2231 @item struct gconv_loaded_object *shlib_handle
2232 @itemx const char *modname
2233 @itemx int counter
2234 All these elements of the structure are used internally in the C library
2235 to coordinate loading and unloading the shared.  One must not expect any
2236 of the other elements be available or initialized.
2237
2238 @item const char *from_name
2239 @itemx const char *to_name
2240 @code{from_name} and @code{to_name} contain the names of the source and
2241 destination character sets.  They can be used to identify the actual
2242 conversion to be carried out since one module might implement
2243 conversions for more than one character set and/or direction.
2244
2245 @item gconv_fct fct
2246 @itemx gconv_init_fct init_fct
2247 @itemx gconv_end_fct end_fct
2248 These elements contain pointers to the functions in the loadable module.
2249 The interface will be explained below.
2250
2251 @item int min_needed_from
2252 @itemx int max_needed_from
2253 @itemx int min_needed_to
2254 @itemx int max_needed_to;
2255 These values have to be filled in the the init function of the module.
2256 The @code{min_needed_from} value specifies how many bytes a character of
2257 the source character set at least needs.  The @code{max_needed_from}
2258 specifies the maximum value which also includes possible shift
2259 sequences.
2260
2261 The @code{min_needed_to} and @code{max_needed_to} values serve the same
2262 purpose but this time for the destination character set.
2263
2264 It is crucial that these values are accurate since otherwise the
2265 conversion functions will have problems or not work at all.
2266
2267 @item int stateful
2268 This element must also be initialized by the init function.  It is
2269 nonzero if the source character set is stateful.  Otherwise it is zero.
2270
2271 @item void *data
2272 This element can be used freely by the conversion functions in the
2273 module.  It can be used to communicate extra information from one call
2274 to another.  It need not be initialized if not needed at all.  If this
2275 element gets assigned a pointer to dynamically allocated memory
2276 (presumably in the init function) it has to be made sure that the end
2277 function deallocates the memory.  Otherwise the application will leak
2278 memory.
2279
2280 It is important to be aware that this data structure is shared by all
2281 users of this specification conversion and therefore the @code{data}
2282 element must not contain data specific to one specific use of the
2283 conversion function.
2284 @end table
2285 @end deftp
2286
2287 @comment gconv.h
2288 @comment GNU
2289 @deftp {Data type} {struct gconv_step_data}
2290 This is the data structure which contains the information specific to
2291 each use of the conversion functions.
2292
2293 @table @code
2294 @item char *outbuf
2295 @itemx char *outbufend
2296 These elements specify the output buffer for the conversion step.  The
2297 @code{outbuf} element points to the beginning of the buffer and
2298 @code{outbufend} points to the byte following the last byte in the
2299 buffer.  The conversion function must not assume anything about the size
2300 of the buffer but it can be safely assumed the there is room for at
2301 least one complete character in the output buffer.
2302
2303 Once the conversion is finished and the conversion is the last step the
2304 @code{outbuf} element must be modified to point after last last byte
2305 written into the buffer to signal how much output is available.  If this
2306 conversion step is not the last one the element must not be modified.
2307 The @code{outbufend} element must not be modified.
2308
2309 @item int is_last
2310 This element is nonzero if this conversion step is the last one.  This
2311 information is necessary for the recursion.  See the description of the
2312 conversion function internals below.  This element must never be
2313 modified.
2314
2315 @item int invocation_counter
2316 The conversion function can use this element to see how many calls of
2317 the conversion function already happened.  Some character sets require
2318 when generating output a certain prolog and by comparing this value with
2319 zero one can find out whether it is the first call and therefore the
2320 prolog should be emitted or not.  This element must never be modified.
2321
2322 @item int internal_use
2323 This element is another one rarely used but needed in certain
2324 situations.  It got assigned a nonzero value in case the conversion
2325 functions are used to implement @code{mbsrtowcs} et.al.  I.e., the
2326 function is not used directly through the @code{iconv} interface.
2327
2328 This sometimes makes a difference as it is expected that the
2329 @code{iconv} functions are used to translate entire texts while the
2330 @code{mbsrtowcs} functions are normally only used to convert single
2331 strings and might be used multiple times to convert entire texts.
2332
2333 But in this situation we would have problem complying with some rules of
2334 the character set specification.  Some character sets require a prolog
2335 which must appear exactly once for an entire text.  If a number of
2336 @code{mbsrtowcs} calls are used to convert the text only the first call
2337 must add the prolog.  But since there is no communication between the
2338 different calls of @code{mbsrtowcs} the conversion functions have no
2339 possibility to find this out.  The situation is different for sequences
2340 of @code{iconv} calls since the handle allows to access the needed
2341 information.
2342
2343 This element is mostly used together with @code{invocation_counter} in a
2344 way like this:
2345
2346 @smallexample
2347 if (!data->internal_use && data->invocation_counter == 0)
2348   /* @r{Emit prolog.}  */
2349   ...
2350 @end smallexample
2351
2352 This element must never be modified.
2353
2354 @item mbstate_t *statep
2355 The @code{statep} element points to an object of type @code{mbstate_t}
2356 (@pxref{Keeping the state}).  The conversion of an stateful character
2357 set must use the object pointed to by this element to store information
2358 about the conversion state.  The @code{statep} element itself must never
2359 be modified.
2360
2361 @item mbstate_t __state
2362 This element @emph{never} must be used directly.  It is only part of
2363 this structure to have the needed space allocated.
2364 @end table
2365 @end deftp
2366
2367 @subsubsection @code{iconv} module interfaces
2368
2369 With the knowledge about the data structures we now can describe the
2370 conversion functions itself.  To understand the interface a bit of
2371 knowledge about the functionality in the C library which loads the
2372 objects with the conversions is necessary.
2373
2374 It is often the case that one conversion is used more than once.  I.e.,
2375 there are several @code{iconv_open} calls for the same set of character
2376 sets during one program run.  The @code{mbsrtowcs} et.al.@: functions in
2377 the GNU C library also use the @code{iconv} functionality which
2378 increases the number of uses of the same functions even more.
2379
2380 For this reason the modules do not get loaded exclusively for one
2381 conversion.  Instead a module once loaded can be used by arbitrary many
2382 @code{iconv} or @code{mbsrtowcs} calls at the same time.  The splitting
2383 of the information between conversion function specific information and
2384 conversion data makes this possible.  The last section showed the two
2385 data structure used to do this.
2386
2387 This is of course also reflected in the interface and semantic of the
2388 functions the modules must provide.  There are three functions which
2389 must have the following names:
2390
2391 @table @code
2392 @item gconv_init
2393 The @code{gconv_init} function initializes the conversion function
2394 specific data structure.  This very same object is shared by all
2395 conversion which use this conversion and therefore no state information
2396 about the conversion itself must be stored in here.  If a module
2397 implements more than one conversion the @code{gconv_init} function will be
2398 called multiple times.
2399
2400 @item gconv_end
2401 The @code{gconv_end} function is responsible to free all resources
2402 allocated by the @code{gconv_init} function.  If there is nothing to do
2403 this function can be missing.  Special care must be taken if the module
2404 implements more than one conversion and the @code{gconv_init} function
2405 does not allocate the same resources for all conversions.
2406
2407 @item gconv
2408 This is the actual conversion function.  It is called to convert one
2409 block of text.  It gets passed the conversion step information
2410 initialized by @code{gconv_init} and the conversion data, specific to
2411 this use of the conversion functions.
2412 @end table
2413
2414 There are three data types defined for the three module interface
2415 function and these define the interface.
2416
2417 @comment gconv.h
2418 @comment GNU
2419 @deftypevr {Data type} int (*gconv_init_fct) (struct gconv_step *)
2420 This specifies the interface of the initialization function of the
2421 module.  It is called exactly once for each conversion the module
2422 implements.
2423
2424 As explained int the description of the @code{struct gconv_step} data
2425 structure above the initialization function has to initialize parts of
2426 it.
2427
2428 @table @code
2429 @item min_needed_from
2430 @itemx max_needed_from
2431 @itemx min_needed_to
2432 @itemx max_needed_to
2433 These elements must be initialized to the exact numbers of the minimum
2434 and maximum number of bytes used by one character in the source and
2435 destination character set respectively.  If the characters all have the
2436 same size the minimum and maximum values are the same.
2437
2438 @item stateful
2439 This element must be initialized to an nonzero value if the source
2440 character set is stateful.  Otherwise it must be zero.
2441 @end table
2442
2443 If the initialization function needs to communication some information
2444 to the conversion function this can happen using the @code{data} element
2445 of the @code{gconv_step} structure.  But since this data is shared by
2446 all the conversion is must not be modified by the conversion function.
2447 How this can be used is shown in the example below.
2448
2449 @smallexample
2450 #define MIN_NEEDED_FROM         1
2451 #define MAX_NEEDED_FROM         4
2452 #define MIN_NEEDED_TO           4
2453 #define MAX_NEEDED_TO           4
2454
2455 int
2456 gconv_init (struct gconv_step *step)
2457 @{
2458   /* @r{Determine which direction.}  */
2459   struct iso2022jp_data *new_data;
2460   enum direction dir = illegal_dir;
2461   enum variant var = illegal_var;
2462   int result;
2463
2464   if (__strcasecmp (step->from_name, "ISO-2022-JP//") == 0)
2465     @{
2466       dir = from_iso2022jp;
2467       var = iso2022jp;
2468     @}
2469   else if (__strcasecmp (step->to_name, "ISO-2022-JP//") == 0)
2470     @{
2471       dir = to_iso2022jp;
2472       var = iso2022jp;
2473     @}
2474   else if (__strcasecmp (step->from_name, "ISO-2022-JP-2//") == 0)
2475     @{
2476       dir = from_iso2022jp;
2477       var = iso2022jp2;
2478     @}
2479   else if (__strcasecmp (step->to_name, "ISO-2022-JP-2//") == 0)
2480     @{
2481       dir = to_iso2022jp;
2482       var = iso2022jp2;
2483     @}
2484
2485   result = GCONV_NOCONV;
2486   if (dir != illegal_dir)
2487     @{
2488       new_data = (struct iso2022jp_data *)
2489         malloc (sizeof (struct iso2022jp_data));
2490
2491       result = GCONV_NOMEM;
2492       if (new_data != NULL)
2493         @{
2494           new_data->dir = dir;
2495           new_data->var = var;
2496           step->data = new_data;
2497
2498           if (dir == from_iso2022jp)
2499             @{
2500               step->min_needed_from = MIN_NEEDED_FROM;
2501               step->max_needed_from = MAX_NEEDED_FROM;
2502               step->min_needed_to = MIN_NEEDED_TO;
2503               step->max_needed_to = MAX_NEEDED_TO;
2504             @}
2505           else
2506             @{
2507               step->min_needed_from = MIN_NEEDED_TO;
2508               step->max_needed_from = MAX_NEEDED_TO;
2509               step->min_needed_to = MIN_NEEDED_FROM;
2510               step->max_needed_to = MAX_NEEDED_FROM + 2;
2511             @}
2512
2513           /* @r{Yes, this is a stateful encoding.}  */
2514           step->stateful = 1;
2515
2516           result = GCONV_OK;
2517         @}
2518     @}
2519
2520   return result;
2521 @}
2522 @end smallexample
2523
2524 The function first checks which conversion is wanted.  The module from
2525 which this function is taken implements four different conversion and
2526 which one is selected can be determined by comparing the names.  The
2527 comparison should always be done without paying attention to the case.
2528
2529 Then a data structure is allocated which contains the necessary
2530 information about which conversion is selected.  The data structure
2531 @code{struct iso2022jp_data} is locally defined since outside the module
2532 this data is not used at all.  Please note that if all four conversions
2533 this modules supports are requested there are four data blocks.
2534
2535 One interesting thing is the initialization of the @code{min_} and
2536 @code{max_} elements of the step data object.  A single ISO-2022-JP
2537 character can consist of one to four bytes.  Therefore the
2538 @code{MIN_NEEDED_FROM} and @code{MAX_NEEDED_FROM} macros are defined
2539 this way.  The output is always the @code{INTERNAL} character set (aka
2540 UCS4) and therefore each character consists of exactly four bytes.  For
2541 the conversion from @code{INTERNAL} to ISO-2022-JP we have to take into
2542 account that escape sequences might be necessary to switch the character
2543 sets.  Therefore the @code{max_needed_to} element for this direction
2544 gets assigned @code{MAX_NEEDED_FROM + 2}.  This takes into account the
2545 two bytes needed for the escape sequences to single the switching.  The
2546 asymmetry in the maximum values for the two directions can be explained
2547 easily: when reading ISO-2022-JP text escape sequences can be handled
2548 alone.  I.e., it is not necessary to process a real character since the
2549 effect of the escape sequence can be recorded in the state information.
2550 The situation is different for the other direction.  Since it is in
2551 general not known which character comes next one cannot emit escape
2552 sequences to change the state in advance.  This means the escape
2553 sequences which have to be emitted together with the next character.
2554 Therefore one needs more room then only for the character itself.
2555
2556 The possible return values of the initialization function are:
2557
2558 @table @code
2559 @item GCONV_OK
2560 The initialization succeeded
2561 @item GCONV_NOCONV
2562 The requested conversion is not supported in the module.  This can
2563 happen if the @file{gconv-modules} file has errors.
2564 @item GCONV_NOMEM
2565 Memory required to store additional information could not be allocated.
2566 @end table
2567 @end deftypevr
2568
2569 The functions called before the module is unloaded is significantly
2570 easier.  It often has nothing at all to do in which case it can be left
2571 out completely.
2572
2573 @comment gconv.h
2574 @comment GNU
2575 @deftypevr {Data type} void (*gconv_end_fct) (struct gconv_step *)
2576 The task of this function is it to free all resources allocated in the
2577 initialization function.  Therefore only the @code{data} element of the
2578 object pointed to by the argument is of interest.  Continuing the
2579 example from the initialization function, the finalization function
2580 looks like this:
2581
2582 @smallexample
2583 void
2584 gconv_end (struct gconv_step *data)
2585 @{
2586   free (data->data);
2587 @}
2588 @end smallexample
2589 @end deftypevr
2590
2591 The most important function of course is the conversion function itself.
2592 It can get quite complicated for complex character sets.  But since this
2593 is not of interest here we will only describe a possible skeleton for
2594 the conversion function.
2595
2596 @comment gconv.h
2597 @comment GNU
2598 @deftypevr {Data type} int (*gconv_fct) (struct gconv_step *, struct gconv_step_data *, const char **, const char *, size_t *, int)
2599 The conversion function can be called for two basic reason: to convert
2600 text or to reset the state.  From the description of the @code{iconv}
2601 function it can be seen why the flushing mode is necessary.  What mode
2602 is selected is determined by the sixth argument, an integer.  If it is
2603 nonzero it means that flushing is selected.
2604
2605 Common to both mode is where the output buffer can be found.  The
2606 information about this buffer is stored in the conversion step data.  A
2607 pointer to this is passed as the second argument to this function.  The
2608 description of the @code{struct gconv_step_data} structure has more
2609 information on this.
2610
2611 @cindex stateful
2612 What has to be done for flushing depends on the source character set.
2613 If it is not stateful nothing has to be done.  Otherwise the function
2614 has to emit a byte sequence to bring the state object in the initial
2615 state.  Once this all happened the other conversion modules in the chain
2616 of conversions have to get the same chance.  Whether another step
2617 follows can be determined from the @code{is_last} element of the step
2618 data structure to which the first parameter points.
2619
2620 The more interesting mode is when actually text has to be converted.
2621 The first step in this case is to convert as much text as possible from
2622 the input buffer and store the result in the output buffer.  The start
2623 of the input buffer is determined by the third argument which is a
2624 pointer to a pointer variable referencing the beginning of the buffer.
2625 The fourth argument is a pointer to the byte right after the last byte
2626 in the buffer.
2627
2628 The conversion has to be performed according to the current state if the
2629 character set is stateful.  The state is stored in an object pointed to
2630 by the @code{statep} element of the step data (second argument).  Once
2631 either the input buffer is empty or the output buffer is full the
2632 conversion stops.  At this point the pointer variable referenced by the
2633 third parameter must point to the byte following the last processed
2634 byte.  I.e., if all of the input is consumed this pointer and the fourth
2635 parameter have the same value.
2636
2637 What now happens depends on whether this step is the last one or not.
2638 If it is the last step the only thing which has to be done is to update
2639 the @code{outbuf} element of the step data structure to point after the
2640 last written byte.  This gives the caller the information on how much
2641 text is available in the output buffer.  Beside this the variable
2642 pointed to by the fifth parameter, which is of type @code{size_t}, must
2643 be incremented by the number of characters (@emph{not bytes}) which were
2644 written in the output buffer.  Then the function can return.
2645
2646 In case the step is not the last one the later conversion functions have
2647 to get a chance to do their work.  Therefore the appropriate conversion
2648 function has to be called.  The information about the functions is
2649 stored in the conversion data structures, passed as the first parameter.
2650 This information and the step data are stored in arrays so the next
2651 element in both cases can be found by simple pointer arithmetic:
2652
2653 @smallexample
2654 int
2655 gconv (struct gconv_step *step, struct gconv_step_data *data,
2656        const char **inbuf, const char *inbufend, size_t *written,
2657        int do_flush)
2658 @{
2659   struct gconv_step *next_step = step + 1;
2660   struct gconv_step_data *next_data = data + 1;
2661   ...
2662 @end smallexample
2663
2664 The @code{next_step} pointer references the next step information and
2665 @code{next_data} the next data record.  The call of the next function
2666 therefore will look similar to this:
2667
2668 @smallexample
2669   next_step->fct (next_step, next_data, &outerr, outbuf, written, 0)
2670 @end smallexample
2671
2672 But this is not yet all.  Once the function call returns the conversion
2673 function might have some more to do.  If the return value of the
2674 function is @code{GCONV_EMPTY_INPUT} this means there is more room in
2675 the output buffer.  Unless the input buffer is empty the conversion
2676 functions start all over again and processes the rest of the input
2677 buffer.  If the return value is not @code{GCONV_EMPTY_INPUT} something
2678 went wrong and we have to recover from this.
2679
2680 A requirement for the conversion function is that the input buffer
2681 pointer (the third argument) always points to the last character which
2682 was put in the converted form in the output buffer.  This is trivial
2683 true after the conversion performed in the current step.  But if the
2684 conversion functions deeper down the stream stop prematurely not all
2685 characters from the output buffer are consumed and therefore the input
2686 buffer pointers must be backed of to the right position.
2687
2688 This is easy to do if the input and output character sets have a fixed
2689 width for all characters.  In this situation we can compute how many
2690 characters are left in the output buffer and therefore can correct the
2691 input buffer pointer appropriate with a similar computation.  Things are
2692 getting tricky if either character set has character represented with
2693 variable length byte sequences and it gets even more complicated if the
2694 conversion has to take care of the state.  In these cases the conversion
2695 has to be performed once again, from the known state before the initial
2696 conversion.  I.e., if necessary the state of the conversion has to be
2697 reset and the conversion loop has to be executed again.  The difference
2698 now is that it is known how much input must be created and the
2699 conversion can stop before converting the first unused character.  Once
2700 this is done the input buffer pointers must be updated again and the
2701 function can return.
2702
2703 One final thing should be mentioned.  If it is necessary for the
2704 conversion to know whether it is the first invocation (in case a prolog
2705 has to be emitted) the conversion function should just before returning
2706 to the caller increment the @code{invocation_counter} element of the
2707 step data structure.  See the description of the @code{struct
2708 gconv_step_data} structure above for more information on how this can be
2709 used.
2710
2711 The return value must be one of the following values:
2712
2713 @table @code
2714 @item GCONV_EMPTY_INPUT
2715 All input was consumed and there is room left in the output buffer.
2716 @item GCONV_OUTPUT_FULL
2717 No more room in the output buffer.  In case this is not the last step
2718 this value is propagated down from the call of the next conversion
2719 function in the chain.
2720 @item GCONV_INCOMPLETE_INPUT
2721 The input buffer is not entirely empty since it contains an incomplete
2722 character sequence.
2723 @end table
2724
2725 The following example provides a framework for a conversion function.
2726 In case a new conversion has to be written the holes in this
2727 implementation have to be filled and that is it.
2728
2729 @smallexample
2730 int
2731 gconv (struct gconv_step *step, struct gconv_step_data *data,
2732        const char **inbuf, const char *inbufend, size_t *written,
2733        int do_flush)
2734 @{
2735   struct gconv_step *next_step = step + 1;
2736   struct gconv_step_data *next_data = data + 1;
2737   gconv_fct fct = next_step->fct;
2738   int status;
2739
2740   /* @r{If the function is called with no input this means we have}
2741      @r{to reset to the initial state.  The possibly partly}
2742      @r{converted input is dropped.}  */
2743   if (do_flush)
2744     @{
2745       status = GCONV_OK;
2746
2747       /* @r{Possible emit a byte sequence which put the state object}
2748          @r{into the initial state.}  */
2749
2750       /* @r{Call the steps down the chain if there are any but only}
2751          @r{if we successfully emitted the escape sequence.}  */
2752       if (status == GCONV_OK && ! data->is_last)
2753         status = fct (next_step, next_data, NULL, NULL,
2754                       written, 1);
2755     @}
2756   else
2757     @{
2758       /* @r{We preserve the initial values of the pointer variables.}  */
2759       const char *inptr = *inbuf;
2760       char *outbuf = data->outbuf;
2761       char *outend = data->outbufend;
2762       char *outptr;
2763
2764       /* @r{This variable is used to count the number of characters}
2765          @r{we actually converted.}  */
2766       size_t converted = 0;
2767
2768       do
2769         @{
2770           /* @r{Remember the start value for this round.}  */
2771           inptr = *inbuf;
2772           /* @r{The outbuf buffer is empty.}  */
2773           outptr = outbuf;
2774
2775           /* @r{For stateful encodings the state must be safe here.}  */
2776
2777           /* @r{Run the conversion loop.  @code{status} is set}
2778              @r{appropriately afterwards.}  */
2779
2780           /* @r{If this is the last step leave the loop, there is}
2781              @r{nothing we can do.}  */
2782           if (data->is_last)
2783             @{
2784               /* @r{Store information about how many bytes are}
2785                  @r{available.}  */
2786               data->outbuf = outbuf;
2787
2788              /* @r{Remember how many characters we converted.}  */
2789              *written += converted;
2790
2791              break;
2792            @}
2793
2794           /* @r{Write out all output which was produced.}  */
2795           if (outbuf > outptr)
2796             @{
2797               const char *outerr = data->outbuf;
2798               int result;
2799
2800               result = fct (next_step, next_data, &outerr,
2801                             outbuf, written, 0);
2802
2803               if (result != GCONV_EMPTY_INPUT)
2804                 @{
2805                   if (outerr != outbuf)
2806                     @{
2807                       /* @r{Reset the input buffer pointer.  We}
2808                          @r{document here the complex case.}  */
2809                       size_t nstatus;
2810
2811                       /* @r{Reload the pointers.}  */
2812                       *inbuf = inptr;
2813                       outbuf = outptr;
2814
2815                       /* @r{Possibly reset the state.}  */
2816
2817                       /* @r{Redo the conversion, but this time}
2818                          @r{the end of the output buffer is at}
2819                          @r{@code{outerr}.}  */
2820                     @}
2821
2822                   /* @r{Change the status.}  */
2823                   status = result;
2824                 @}
2825               else
2826                 /* @r{All the output is consumed, we can make}
2827                    @r{ another run if everything was ok.}  */
2828                 if (status == GCONV_FULL_OUTPUT)
2829                   status = GCONV_OK;
2830            @}
2831         @}
2832       while (status == GCONV_OK);
2833
2834       /* @r{We finished one use of this step.}  */
2835       ++data->invocation_counter;
2836     @}
2837
2838   return status;
2839 @}
2840 @end smallexample
2841 @end deftypevr
2842
2843 This information should be sufficient to write new modules.  Anybody
2844 doing so should also take a look at the available source code in the GNU
2845 C library sources.  It contains many examples of working and optimized
2846 modules.