manual/mbyte.texi

   1 @node Extended Characters, Locales, String and Array Utilities, Top
   2 @chapter Extended Characters
   3
   4 A number of languages use character sets that are larger than the range
   5 of values of type @code{char}.  Japanese and Chinese are probably the
   6 most familiar examples.
   7
   8 The GNU C library includes support for two mechanisms for dealing with
   9 extended character sets: multibyte characters and wide characters.  This
  10 chapter describes how to use these mechanisms, and the functions for
  11 converting between them.
  12 @cindex extended character sets
  13
  14 The behavior of the functions in this chapter is affected by the current
  15 locale for character classification---the @code{LC_CTYPE} category; see
  16 @ref{Locale Categories}.  This choice of locale selects which multibyte
  17 code is used, and also controls the meanings and characteristics of wide
  18 character codes.
  19
  20 @menu
  21 * Extended Char Intro::         Multibyte codes versus wide characters.
  22 * Locales and Extended Chars::  The locale selects the character codes.
  23 * Multibyte Char Intro::        How multibyte codes are represented.
  24 * Wide Char Intro::             How wide characters are represented.
  25 * Wide String Conversion::      Converting wide strings to multibyte code
  26                                  and vice versa.
  27 * Length of Char::              how many bytes make up one multibyte char.
  28 * Converting One Char::         Converting a string character by character.
  29 * Example of Conversion::       Example showing why converting
  30                                  one character at a time may be useful.
  31 * Shift State::                 Multibyte codes with "shift characters".
  32 @end menu
  33
  34 @node Extended Char Intro, Locales and Extended Chars,  , Extended Characters
  35 @section Introduction to Extended Characters
  36
  37 You can represent extended characters in either of two ways:
  38
  39 @itemize @bullet
  40 @item
  41 As @dfn{multibyte characters} which can be embedded in an ordinary
  42 string, an array of @code{char} objects.  Their advantage is that many
  43 programs and operating systems can handle occasional multibyte
  44 characters scattered among ordinary ASCII characters, without any
  45 change.
  46
  47 @item
  48 @cindex wide characters
  49 As @dfn{wide characters}, which are like ordinary characters except that
  50 they occupy more bits.  The wide character data type, @code{wchar_t},
  51 has a range large enough to hold extended character codes as well as
  52 old-fashioned ASCII codes.
  53
  54 An advantage of wide characters is that each character is a single data
  55 object, just like ordinary ASCII characters.  There are a few
  56 disadvantages:
  57
  58 @itemize @bullet
  59 @item
  60 Each existing program must be modified and recompiled to make it use
  61 wide characters.
  62
  63 @item
  64 Files of wide characters cannot be read by programs that expect ordinary
  65 characters.
  66 @end itemize
  67 @end itemize
  68
  69 Typically, you use the multibyte character representation as part of the
  70 external program interface, such as reading or writing text to files.
  71 However, it's usually easier to perform internal manipulations on
  72 strings containing extended characters on arrays of @code{wchar_t}
  73 objects, since the uniform representation makes most editing operations
  74 easier.  If you do use multibyte characters for files and wide
  75 characters for internal operations, you need to convert between them
  76 when you read and write data.
  77
  78 If your system supports extended characters, then it supports them both
  79 as multibyte characters and as wide characters.  The library includes
  80 functions you can use to convert between the two representations.
  81 These functions are described in this chapter.
  82
  83 @node Locales and Extended Chars, Multibyte Char Intro, Extended Char Intro, Extended Characters
  84 @section Locales and Extended Characters
  85
  86 A computer system can support more than one multibyte character code,
  87 and more than one wide character code.  The user controls the choice of
  88 codes through the current locale for character classification
  89 (@pxref{Locales}).  Each locale specifies a particular multibyte
  90 character code and a particular wide character code.  The choice of locale
  91 influences the behavior of the conversion functions in the library.
  92
  93 Some locales support neither wide characters nor nontrivial multibyte
  94 characters.  In these locales, the library conversion functions still
  95 work, even though what they do is basically trivial.
  96
  97 If you select a new locale for character classification, the internal
  98 shift state maintained by these functions can become confused, so it's
  99 not a good idea to change the locale while you are in the middle of
 100 processing a string.
 101
 102 @node Multibyte Char Intro, Wide Char Intro, Locales and Extended Chars, Extended Characters
 103 @section Multibyte Characters
 104 @cindex multibyte characters
 105
 106 In the ordinary ASCII code, a sequence of characters is a sequence of
 107 bytes, and each character is one byte.  This is very simple, but
 108 allows for only 256 distinct characters.
 109
 110 In a @dfn{multibyte character code}, a sequence of characters is a
 111 sequence of bytes, but each character may occupy one or more consecutive
 112 bytes of the sequence.
 113
 114 @cindex basic byte sequence
 115 There are many different ways of designing a multibyte character code;
 116 different systems use different codes.  To specify a particular code
 117 means designating the @dfn{basic} byte sequences---those which represent
 118 a single character---and what characters they stand for.  A code that a
 119 computer can actually use must have a finite number of these basic
 120 sequences, and typically none of them is more than a few characters
 121 long.
 122
 123 These sequences need not all have the same length.  In fact, many of
 124 them are just one byte long.  Because the basic ASCII characters in the
 125 range from @code{0} to @code{0177} are so important, they stand for
 126 themselves in all multibyte character codes.  That is to say, a byte
 127 whose value is @code{0} through @code{0177} is always a character in
 128 itself.  The characters which are more than one byte must always start
 129 with a byte in the range from @code{0200} through @code{0377}.
 130
 131 The byte value @code{0} can be used to terminate a string, just as it is
 132 often used in a string of ASCII characters.
 133
 134 Specifying the basic byte sequences that represent single characters
 135 automatically gives meanings to many longer byte sequences, as more than
 136 one character.  For example, if the two byte sequence @code{0205 049}
 137 stands for the Greek letter alpha, then @code{0205 049 065} must stand
 138 for an alpha followed by an @samp{A} (ASCII code 065), and @code{0205 049
 139 0205 049} must stand for two alphas in a row.
 140
 141 If any byte sequence can have more than one meaning as a sequence of
 142 characters, then the multibyte code is ambiguous---and no good.  The
 143 codes that systems actually use are all unambiguous.
 144
 145 In most codes, there are certain sequences of bytes that have no meaning
 146 as a character or characters.  These are called @dfn{invalid}.
 147
 148 The simplest possible multibyte code is a trivial one:
 149
 150 @quotation
 151 The basic sequences consist of single bytes.
 152 @end quotation
 153
 154 This particular code is equivalent to not using multibyte characters at
 155 all.  It has no invalid sequences.  But it can handle only 256 different
 156 characters.
 157
 158 Here is another possible code which can handle 9376 different
 159 characters:
 160
 161 @quotation
 162 The basic sequences consist of
 163
 164 @itemize @bullet
 165 @item
 166 single bytes with values in the range @code{0} through @code{0237}.
 167
 168 @item
 169 two-byte sequences, in which both of the bytes have values in the range
 170 from @code{0240} through @code{0377}.
 171 @end itemize
 172 @end quotation
 173
 174 @noindent
 175 This code or a similar one is used on some systems to represent Japanese
 176 characters.  The invalid sequences are those which consist of an odd
 177 number of consecutive bytes in the range from @code{0240} through
 178 @code{0377}.
 179
 180 Here is another multibyte code which can handle more distinct extended
 181 characters---in fact, almost thirty million:
 182
 183 @quotation
 184 The basic sequences consist of
 185
 186 @itemize @bullet
 187 @item
 188 single bytes with values in the range @code{0} through @code{0177}.
 189
 190 @item
 191 sequences of up to four bytes in which the first byte is in the range
 192 from @code{0200} through @code{0237}, and the remaining bytes are in the
 193 range from @code{0240} through @code{0377}.
 194 @end itemize
 195 @end quotation
 196
 197 @noindent
 198 In this code, any sequence that starts with a byte in the range
 199 from @code{0240} through @code{0377} is invalid.
 200
 201 And here is another variant which has the advantage that removing the
 202 last byte or bytes from a valid character can never produce another
 203 valid character.  (This property is convenient when you want to search
 204 strings for particular characters.)
 205
 206 @quotation
 207 The basic sequences consist of
 208
 209 @itemize @bullet
 210 @item
 211 single bytes with values in the range @code{0} through @code{0177}.
 212
 213 @item
 214 two-byte sequences in which the first byte is in the range from
 215 @code{0200} through @code{0207}, and the second byte is in the range
 216 from @code{0240} through @code{0377}.
 217
 218 @item
 219 three-byte sequences in which the first byte is in the range from
 220 @code{0210} through @code{0217}, and the other bytes are in the range
 221 from @code{0240} through @code{0377}.
 222
 223 @item
 224 four-byte sequences in which the first byte is in the range from
 225 @code{0220} through @code{0227}, and the other bytes are in the range
 226 from @code{0240} through @code{0377}.
 227 @end itemize
 228 @end quotation
 229
 230 @noindent
 231 The list of invalid sequences for this code is long and not worth
 232 stating in full; examples of invalid sequences include @code{0240} and
 233 @code{0220 0300 065}.
 234
 235 The number of @emph{possible} multibyte codes is astronomical.  But a
 236 given computer system will support at most a few different codes.  (One
 237 of these codes may allow for thousands of different characters.)
 238 Another computer system may support a completely different code.  The
 239 library facilities described in this chapter are helpful because they
 240 package up the knowledge of the details of a particular computer
 241 system's multibyte code, so your programs need not know them.
 242
 243 You can use special standard macros to find out the maximum possible
 244 number of bytes in a character in the currently selected multibyte
 245 code with @code{MB_CUR_MAX}, and the maximum for @emph{any} multibyte
 246 code supported on your computer with @code{MB_LEN_MAX}.
 247
 248 @comment limits.h
 249 @comment ANSI
 250 @deftypevr Macro int MB_LEN_MAX
 251 This is the maximum length of a multibyte character for any supported
 252 locale.  It is defined in @file{limits.h}.
 253 @pindex limits.h
 254 @end deftypevr
 255
 256 @comment stdlib.h
 257 @comment ANSI
 258 @deftypevr Macro int MB_CUR_MAX
 259 This macro expands into a (possibly non-constant) positive integer
 260 expression that is the maximum number of bytes in a multibyte character
 261 in the current locale.  The value is never greater than @code{MB_LEN_MAX}.
 262
 263 @pindex stdlib.h
 264 @code{MB_CUR_MAX} is defined in @file{stdlib.h}.
 265 @end deftypevr
 266
 267 Normally, each basic sequence in a particular character code stands for
 268 one character, the same character regardless of context.  Some multibyte
 269 character codes have a concept of @dfn{shift state}; certain codes,
 270 called @dfn{shift sequences}, change to a different shift state, and the
 271 meaning of some or all basic sequences varies according to the current
 272 shift state.  In fact, the set of basic sequences might even be
 273 different depending on the current shift state.  @xref{Shift State}, for
 274 more information on handling this sort of code.
 275
 276 What happens if you try to pass a string containing multibyte characters
 277 to a function that doesn't know about them?  Normally, such a function
 278 treats a string as a sequence of bytes, and interprets certain byte
 279 values specially; all other byte values are ``ordinary''.  As long as a
 280 multibyte character doesn't contain any of the special byte values, the
 281 function should pass it through as if it were several ordinary
 282 characters.
 283
 284 For example, let's figure out what happens if you use multibyte
 285 characters in a file name.  The functions such as @code{open} and
 286 @code{unlink} that operate on file names treat the name as a sequence of
 287 byte values, with @samp{/} as the only special value.  Any other byte
 288 values are copied, or compared, in sequence, and all byte values are
 289 treated alike.  Thus, you may think of the file name as a sequence of
 290 bytes or as a string containing multibyte characters; the same behavior
 291 makes sense equally either way, provided no multibyte character contains
 292 a @samp{/}.
 293
 294 @node Wide Char Intro, Wide String Conversion, Multibyte Char Intro, Extended Characters
 295 @section Wide Character Introduction
 296
 297 @dfn{Wide characters} are much simpler than multibyte characters.  They
 298 are simply characters with more than eight bits, so that they have room
 299 for more than 256 distinct codes.  The wide character data type,
 300 @code{wchar_t}, has a range large enough to hold extended character
 301 codes as well as old-fashioned ASCII codes.
 302
 303 An advantage of wide characters is that each character is a single data
 304 object, just like ordinary ASCII characters.  Wide characters also have
 305 some disadvantages:
 306
 307 @itemize @bullet
 308 @item
 309 A program must be modified and recompiled in order to use wide
 310 characters at all.
 311
 312 @item
 313 Files of wide characters cannot be read by programs that expect ordinary
 314 characters.
 315 @end itemize
 316
 317 Wide character values @code{0} through @code{0177} are always identical
 318 in meaning to the ASCII character codes.  The wide character value zero
 319 is often used to terminate a string of wide characters, just as a single
 320 byte with value zero often terminates a string of ordinary characters.
 321
 322 @comment stddef.h
 323 @comment ANSI
 324 @deftp {Data Type} wchar_t
 325 This is the ``wide character'' type, an integer type whose range is
 326 large enough to represent all distinct values in any extended character
 327 set in the supported locales.  @xref{Locales}, for more information
 328 about locales.  This type is defined in the header file @file{stddef.h}.
 329 @pindex stddef.h
 330 @end deftp
 331
 332 If your system supports extended characters, then each extended
 333 character has both a wide character code and a corresponding multibyte
 334 basic sequence.
 335
 336 @cindex code, character
 337 @cindex character code
 338 In this chapter, the term @dfn{code} is used to refer to a single
 339 extended character object to emphasize the distinction from the
 340 @code{char} data type.
 341
 342 @node Wide String Conversion, Length of Char, Wide Char Intro, Extended Characters
 343 @section Conversion of Extended Strings
 344 @cindex extended strings, converting representations
 345 @cindex converting extended strings
 346
 347 @pindex stdlib.h
 348 The @code{mbstowcs} function converts a string of multibyte characters
 349 to a wide character array.  The @code{wcstombs} function does the
 350 reverse.  These functions are declared in the header file
 351 @file{stdlib.h}.
 352
 353 In most programs, these functions are the only ones you need for
 354 conversion between wide strings and multibyte character strings.  But
 355 they have limitations.  If your data is not null-terminated or is not
 356 all in core at once, you probably need to use the low-level conversion
 357 functions to convert one character at a time.  @xref{Converting One
 358 Char}.
 359
 360 @comment stdlib.h
 361 @comment ANSI
 362 @deftypefun size_t mbstowcs (wchar_t *@var{wstring}, const char *@var{string}, size_t @var{size})
 363 The @code{mbstowcs} (``multibyte string to wide character string'')
 364 function converts the null-terminated string of multibyte characters
 365 @var{string} to an array of wide character codes, storing not more than
 366 @var{size} wide characters into the array beginning at @var{wstring}.
 367 The terminating null character counts towards the size, so if @var{size}
 368 is less than the actual number of wide characters resulting from
 369 @var{string}, no terminating null character is stored.
 370
 371 The conversion of characters from @var{string} begins in the initial
 372 shift state.
 373
 374 If an invalid multibyte character sequence is found, this function
 375 returns a value of @code{-1}.  Otherwise, it returns the number of wide
 376 characters stored in the array @var{wstring}.  This number does not
 377 include the terminating null character, which is present if the number
 378 is less than @var{size}.
 379
 380 Here is an example showing how to convert a string of multibyte
 381 characters, allocating enough space for the result.
 382
 383 @smallexample
 384 wchar_t *
 385 mbstowcs_alloc (const char *string)
 386 @{
 387   size_t size = strlen (string) + 1;
 388   wchar_t *buf = xmalloc (size * sizeof (wchar_t));
 389
 390   size = mbstowcs (buf, string, size);
 391   if (size == (size_t) -1)
 392     return NULL;
 393   buf = xrealloc (buf, (size + 1) * sizeof (wchar_t));
 394   return buf;
 395 @}
 396 @end smallexample
 397
 398 @end deftypefun
 399
 400 @comment stdlib.h
 401 @comment ANSI
 402 @deftypefun size_t wcstombs (char *@var{string}, const wchar_t @var{wstring}, size_t @var{size})
 403 The @code{wcstombs} (``wide character string to multibyte string'')
 404 function converts the null-terminated wide character array @var{wstring}
 405 into a string containing multibyte characters, storing not more than
 406 @var{size} bytes starting at @var{string}, followed by a terminating
 407 null character if there is room.  The conversion of characters begins in
 408 the initial shift state.
 409
 410 The terminating null character counts towards the size, so if @var{size}
 411 is less than or equal to the number of bytes needed in @var{wstring}, no
 412 terminating null character is stored.
 413
 414 If a code that does not correspond to a valid multibyte character is
 415 found, this function returns a value of @code{-1}.  Otherwise, the
 416 return value is the number of bytes stored in the array @var{string}.
 417 This number does not include the terminating null character, which is
 418 present if the number is less than @var{size}.
 419 @end deftypefun
 420
 421 @node Length of Char, Converting One Char, Wide String Conversion, Extended Characters
 422 @section Multibyte Character Length
 423 @cindex multibyte character, length of
 424 @cindex length of multibyte character
 425
 426 This section describes how to scan a string containing multibyte
 427 characters, one character at a time.  The difficulty in doing this
 428 is to know how many bytes each character contains.  Your program
 429 can use @code{mblen} to find this out.
 430
 431 @comment stdlib.h
 432 @comment ANSI
 433 @deftypefun int mblen (const char *@var{string}, size_t @var{size})
 434 The @code{mblen} function with a non-null @var{string} argument returns
 435 the number of bytes that make up the multibyte character beginning at
 436 @var{string}, never examining more than @var{size} bytes.  (The idea is
 437 to supply for @var{size} the number of bytes of data you have in hand.)
 438
 439 The return value of @code{mblen} distinguishes three possibilities: the
 440 first @var{size} bytes at @var{string} start with valid multibyte
 441 character, they start with an invalid byte sequence or just part of a
 442 character, or @var{string} points to an empty string (a null character).
 443
 444 For a valid multibyte character, @code{mblen} returns the number of
 445 bytes in that character (always at least @code{1}, and never more than
 446 @var{size}).  For an invalid byte sequence, @code{mblen} returns
 447 @code{-1}.  For an empty string, it returns @code{0}.
 448
 449 If the multibyte character code uses shift characters, then @code{mblen}
 450 maintains and updates a shift state as it scans.  If you call
 451 @code{mblen} with a null pointer for @var{string}, that initializes the
 452 shift state to its standard initial value.  It also returns nonzero if
 453 the multibyte character code in use actually has a shift state.
 454 @xref{Shift State}.
 455
 456 @pindex stdlib.h
 457 The function @code{mblen} is declared in @file{stdlib.h}.
 458 @end deftypefun
 459
 460 @node Converting One Char, Example of Conversion, Length of Char, Extended Characters
 461 @section Conversion of Extended Characters One by One
 462 @cindex extended characters, converting
 463 @cindex converting extended characters
 464
 465 @pindex stdlib.h
 466 You can convert multibyte characters one at a time to wide characters
 467 with the @code{mbtowc} function.  The @code{wctomb} function does the
 468 reverse.  These functions are declared in @file{stdlib.h}.
 469
 470 @comment stdlib.h
 471 @comment ANSI
 472 @deftypefun int mbtowc (wchar_t *@var{result}, const char *@var{string}, size_t @var{size})
 473 The @code{mbtowc} (``multibyte to wide character'') function when called
 474 with non-null @var{string} converts the first multibyte character
 475 beginning at @var{string} to its corresponding wide character code.  It
 476 stores the result in @code{*@var{result}}.
 477
 478 @code{mbtowc} never examines more than @var{size} bytes.  (The idea is
 479 to supply for @var{size} the number of bytes of data you have in hand.)
 480
 481 @code{mbtowc} with non-null @var{string} distinguishes three
 482 possibilities: the first @var{size} bytes at @var{string} start with
 483 valid multibyte character, they start with an invalid byte sequence or
 484 just part of a character, or @var{string} points to an empty string (a
 485 null character).
 486
 487 For a valid multibyte character, @code{mbtowc} converts it to a wide
 488 character and stores that in @code{*@var{result}}, and returns the
 489 number of bytes in that character (always at least @code{1}, and never
 490 more than @var{size}).
 491
 492 For an invalid byte sequence, @code{mbtowc} returns @code{-1}.  For an
 493 empty string, it returns @code{0}, also storing @code{0} in
 494 @code{*@var{result}}.
 495
 496 If the multibyte character code uses shift characters, then
 497 @code{mbtowc} maintains and updates a shift state as it scans.  If you
 498 call @code{mbtowc} with a null pointer for @var{string}, that
 499 initializes the shift state to its standard initial value.  It also
 500 returns nonzero if the multibyte character code in use actually has a
 501 shift state.  @xref{Shift State}.
 502 @end deftypefun
 503
 504 @comment stdlib.h
 505 @comment ANSI
 506 @deftypefun int wctomb (char *@var{string}, wchar_t @var{wchar})
 507 The @code{wctomb} (``wide character to multibyte'') function converts
 508 the wide character code @var{wchar} to its corresponding multibyte
 509 character sequence, and stores the result in bytes starting at
 510 @var{string}.  At most @code{MB_CUR_MAX} characters are stored.
 511
 512 @code{wctomb} with non-null @var{string} distinguishes three
 513 possibilities for @var{wchar}: a valid wide character code (one that can
 514 be translated to a multibyte character), an invalid code, and @code{0}.
 515
 516 Given a valid code, @code{wctomb} converts it to a multibyte character,
 517 storing the bytes starting at @var{string}.  Then it returns the number
 518 of bytes in that character (always at least @code{1}, and never more
 519 than @code{MB_CUR_MAX}).
 520
 521 If @var{wchar} is an invalid wide character code, @code{wctomb} returns
 522 @code{-1}.  If @var{wchar} is @code{0}, it returns @code{0}, also
 523 storing @code{0} in @code{*@var{string}}.
 524
 525 If the multibyte character code uses shift characters, then
 526 @code{wctomb} maintains and updates a shift state as it scans.  If you
 527 call @code{wctomb} with a null pointer for @var{string}, that
 528 initializes the shift state to its standard initial value.  It also
 529 returns nonzero if the multibyte character code in use actually has a
 530 shift state.  @xref{Shift State}.
 531
 532 Calling this function with a @var{wchar} argument of zero when
 533 @var{string} is not null has the side-effect of reinitializing the
 534 stored shift state @emph{as well as} storing the multibyte character
 535 @code{0} and returning @code{0}.
 536 @end deftypefun
 537
 538 @node Example of Conversion, Shift State, Converting One Char, Extended Characters
 539 @section Character-by-Character Conversion Example
 540
 541 Here is an example that reads multibyte character text from descriptor
 542 @code{input} and writes the corresponding wide characters to descriptor
 543 @code{output}.  We need to convert characters one by one for this
 544 example because @code{mbstowcs} is unable to continue past a null
 545 character, and cannot cope with an apparently invalid partial character
 546 by reading more input.
 547
 548 @smallexample
 549 int
 550 file_mbstowcs (int input, int output)
 551 @{
 552   char buffer[BUFSIZ + MB_LEN_MAX];
 553   int filled = 0;
 554   int eof = 0;
 555
 556   while (!eof)
 557     @{
 558       int nread;
 559       int nwrite;
 560       char *inp = buffer;
 561       wchar_t outbuf[BUFSIZ];
 562       wchar_t *outp = outbuf;
 563
 564       /* @r{Fill up the buffer from the input file.}  */
 565       nread = read (input, buffer + filled, BUFSIZ);
 566       if (nread < 0)
 567         @{
 568           perror ("read");
 569           return 0;
 570         @}
 571       /* @r{If we reach end of file, make a note to read no more.} */
 572       if (nread == 0)
 573         eof = 1;
 574
 575       /* @r{@code{filled} is now the number of bytes in @code{buffer}.} */
 576       filled += nread;
 577
 578       /* @r{Convert those bytes to wide characters--as many as we can.} */
 579       while (1)
 580         @{
 581           int thislen = mbtowc (outp, inp, filled);
 582           /* Stop converting at invalid character;
 583              this can mean we have read just the first part
 584              of a valid character.  */
 585           if (thislen == -1)
 586             break;
 587           /* @r{Treat null character like any other,}
 588              @r{but also reset shift state.} */
 589           if (thislen == 0) @{
 590             thislen = 1;
 591             mbtowc (NULL, NULL, 0);
 592           @}
 593           /* @r{Advance past this character.} */
 594           inp += thislen;
 595           filled -= thislen;
 596           outp++;
 597         @}
 598
 599       /* @r{Write the wide characters we just made.}  */
 600       nwrite = write (output, outbuf,
 601                       (outp - outbuf) * sizeof (wchar_t));
 602       if (nwrite < 0)
 603         @{
 604           perror ("write");
 605           return 0;
 606         @}
 607
 608       /* @r{See if we have a @emph{real} invalid character.} */
 609       if ((eof && filled > 0) || filled >= MB_CUR_MAX)
 610         @{
 611           error ("invalid multibyte character");
 612           return 0;
 613         @}
 614
 615       /* @r{If any characters must be carried forward,}
 616          @r{put them at the beginning of @code{buffer}.} */
 617       if (filled > 0)
 618         memcpy (inp, buffer, filled);
 619       @}
 620     @}
 621
 622   return 1;
 623 @}
 624 @end smallexample
 625
 626 @node Shift State,  , Example of Conversion, Extended Characters
 627 @section Multibyte Codes Using Shift Sequences
 628
 629 In some multibyte character codes, the @emph{meaning} of any particular
 630 byte sequence is not fixed; it depends on what other sequences have come
 631 earlier in the same string.  Typically there are just a few sequences
 632 that can change the meaning of other sequences; these few are called
 633 @dfn{shift sequences} and we say that they set the @dfn{shift state} for
 634 other sequences that follow.
 635
 636 To illustrate shift state and shift sequences, suppose we decide that
 637 the sequence @code{0200} (just one byte) enters Japanese mode, in which
 638 pairs of bytes in the range from @code{0240} to @code{0377} are single
 639 characters, while @code{0201} enters Latin-1 mode, in which single bytes
 640 in the range from @code{0240} to @code{0377} are characters, and
 641 interpreted according to the ISO Latin-1 character set.  This is a
 642 multibyte code which has two alternative shift states (``Japanese mode''
 643 and ``Latin-1 mode''), and two shift sequences that specify particular
 644 shift states.
 645
 646 When the multibyte character code in use has shift states, then
 647 @code{mblen}, @code{mbtowc} and @code{wctomb} must maintain and update
 648 the current shift state as they scan the string.  To make this work
 649 properly, you must follow these rules:
 650
 651 @itemize @bullet
 652 @item
 653 Before starting to scan a string, call the function with a null pointer
 654 for the multibyte character address---for example, @code{mblen (NULL,
 655 0)}.  This initializes the shift state to its standard initial value.
 656
 657 @item
 658 Scan the string one character at a time, in order.  Do not ``back up''
 659 and rescan characters already scanned, and do not intersperse the
 660 processing of different strings.
 661 @end itemize
 662
 663 Here is an example of using @code{mblen} following these rules:
 664
 665 @smallexample
 666 void
 667 scan_string (char *s)
 668 @{
 669   int length = strlen (s);
 670
 671   /* @r{Initialize shift state.} */
 672   mblen (NULL, 0);
 673
 674   while (1)
 675     @{
 676       int thischar = mblen (s, length);
 677       /* @r{Deal with end of string and invalid characters.} */
 678       if (thischar == 0)
 679         break;
 680       if (thischar == -1)
 681         @{
 682           error ("invalid multibyte character");
 683           break;
 684         @}
 685       /* @r{Advance past this character.} */
 686       s += thischar;
 687       length -= thischar;
 688     @}
 689 @}
 690 @end smallexample
 691
 692 The functions @code{mblen}, @code{mbtowc} and @code{wctomb} are not
 693 reentrant when using a multibyte code that uses a shift state.  However,
 694 no other library functions call these functions, so you don't have to
 695 worry that the shift state will be changed mysteriously.