manual/mbyte.texi

   1 @node Extended Characters, Locales, String and Array Utilities, Top
   2 @c %MENU% Support for extended character sets
   3 @chapter Extended Characters
   4
   5 A number of languages use character sets that are larger than the range
   6 of values of type @code{char}.  Japanese and Chinese are probably the
   7 most familiar examples.
   8
   9 The GNU C library includes support for two mechanisms for dealing with
  10 extended character sets: multibyte characters and wide characters.  This
  11 chapter describes how to use these mechanisms, and the functions for
  12 converting between them.
  13 @cindex extended character sets
  14
  15 The behavior of the functions in this chapter is affected by the current
  16 locale for character classification---the @code{LC_CTYPE} category; see
  17 @ref{Locale Categories}.  This choice of locale selects which multibyte
  18 code is used, and also controls the meanings and characteristics of wide
  19 character codes.
  20
  21 @menu
  22 * Extended Char Intro::         Multibyte codes versus wide characters.
  23 * Locales and Extended Chars::  The locale selects the character codes.
  24 * Multibyte Char Intro::        How multibyte codes are represented.
  25 * Wide Char Intro::             How wide characters are represented.
  26 * Wide String Conversion::      Converting wide strings to multibyte code
  27                                  and vice versa.
  28 * Length of Char::              how many bytes make up one multibyte char.
  29 * Converting One Char::         Converting a string character by character.
  30 * Example of Conversion::       Example showing why converting
  31                                  one character at a time may be useful.
  32 * Shift State::                 Multibyte codes with "shift characters".
  33 @end menu
  34
  35 @node Extended Char Intro, Locales and Extended Chars,  , Extended Characters
  36 @section Introduction to Extended Characters
  37
  38 You can represent extended characters in either of two ways:
  39
  40 @itemize @bullet
  41 @item
  42 As @dfn{multibyte characters} which can be embedded in an ordinary
  43 string, an array of @code{char} objects.  Their advantage is that many
  44 programs and operating systems can handle occasional multibyte
  45 characters scattered among ordinary ASCII characters, without any
  46 change.
  47
  48 @item
  49 @cindex wide characters
  50 As @dfn{wide characters}, which are like ordinary characters except that
  51 they occupy more bits.  The wide character data type, @code{wchar_t},
  52 has a range large enough to hold extended character codes as well as
  53 old-fashioned ASCII codes.
  54
  55 An advantage of wide characters is that each character is a single data
  56 object, just like ordinary ASCII characters.  There are a few
  57 disadvantages:
  58
  59 @itemize @bullet
  60 @item
  61 Each existing program must be modified and recompiled to make it use
  62 wide characters.
  63
  64 @item
  65 Files of wide characters cannot be read by programs that expect ordinary
  66 characters.
  67 @end itemize
  68 @end itemize
  69
  70 Typically, you use the multibyte character representation as part of the
  71 external program interface, such as reading or writing text to files.
  72 However, it's usually easier to perform internal manipulations on
  73 strings containing extended characters on arrays of @code{wchar_t}
  74 objects, since the uniform representation makes most editing operations
  75 easier.  If you do use multibyte characters for files and wide
  76 characters for internal operations, you need to convert between them
  77 when you read and write data.
  78
  79 If your system supports extended characters, then it supports them both
  80 as multibyte characters and as wide characters.  The library includes
  81 functions you can use to convert between the two representations.
  82 These functions are described in this chapter.
  83
  84 @node Locales and Extended Chars, Multibyte Char Intro, Extended Char Intro, Extended Characters
  85 @section Locales and Extended Characters
  86
  87 A computer system can support more than one multibyte character code,
  88 and more than one wide character code.  The user controls the choice of
  89 codes through the current locale for character classification
  90 (@pxref{Locales}).  Each locale specifies a particular multibyte
  91 character code and a particular wide character code.  The choice of locale
  92 influences the behavior of the conversion functions in the library.
  93
  94 Some locales support neither wide characters nor nontrivial multibyte
  95 characters.  In these locales, the library conversion functions still
  96 work, even though what they do is basically trivial.
  97
  98 If you select a new locale for character classification, the internal
  99 shift state maintained by these functions can become confused, so it's
 100 not a good idea to change the locale while you are in the middle of
 101 processing a string.
 102
 103 @node Multibyte Char Intro, Wide Char Intro, Locales and Extended Chars, Extended Characters
 104 @section Multibyte Characters
 105 @cindex multibyte characters
 106
 107 In the ordinary ASCII code, a sequence of characters is a sequence of
 108 bytes, and each character is one byte.  This is very simple, but
 109 allows for only 256 distinct characters.
 110
 111 In a @dfn{multibyte character code}, a sequence of characters is a
 112 sequence of bytes, but each character may occupy one or more consecutive
 113 bytes of the sequence.
 114
 115 @cindex basic byte sequence
 116 There are many different ways of designing a multibyte character code;
 117 different systems use different codes.  To specify a particular code
 118 means designating the @dfn{basic} byte sequences---those which represent
 119 a single character---and what characters they stand for.  A code that a
 120 computer can actually use must have a finite number of these basic
 121 sequences, and typically none of them is more than a few characters
 122 long.
 123
 124 These sequences need not all have the same length.  In fact, many of
 125 them are just one byte long.  Because the basic ASCII characters in the
 126 range from @code{0} to @code{0177} are so important, they stand for
 127 themselves in all multibyte character codes.  That is to say, a byte
 128 whose value is @code{0} through @code{0177} is always a character in
 129 itself.  The characters which are more than one byte must always start
 130 with a byte in the range from @code{0200} through @code{0377}.
 131
 132 The byte value @code{0} can be used to terminate a string, just as it is
 133 often used in a string of ASCII characters.
 134
 135 Specifying the basic byte sequences that represent single characters
 136 automatically gives meanings to many longer byte sequences, as more than
 137 one character.  For example, if the two byte sequence @code{0205 049}
 138 stands for the Greek letter alpha, then @code{0205 049 065} must stand
 139 for an alpha followed by an @samp{A} (ASCII code 065), and @code{0205 049
 140 0205 049} must stand for two alphas in a row.
 141
 142 If any byte sequence can have more than one meaning as a sequence of
 143 characters, then the multibyte code is ambiguous---and no good.  The
 144 codes that systems actually use are all unambiguous.
 145
 146 In most codes, there are certain sequences of bytes that have no meaning
 147 as a character or characters.  These are called @dfn{invalid}.
 148
 149 The simplest possible multibyte code is a trivial one:
 150
 151 @quotation
 152 The basic sequences consist of single bytes.
 153 @end quotation
 154
 155 This particular code is equivalent to not using multibyte characters at
 156 all.  It has no invalid sequences.  But it can handle only 256 different
 157 characters.
 158
 159 Here is another possible code which can handle 9376 different
 160 characters:
 161
 162 @quotation
 163 The basic sequences consist of
 164
 165 @itemize @bullet
 166 @item
 167 single bytes with values in the range @code{0} through @code{0237}.
 168
 169 @item
 170 two-byte sequences, in which both of the bytes have values in the range
 171 from @code{0240} through @code{0377}.
 172 @end itemize
 173 @end quotation
 174
 175 @noindent
 176 This code or a similar one is used on some systems to represent Japanese
 177 characters.  The invalid sequences are those which consist of an odd
 178 number of consecutive bytes in the range from @code{0240} through
 179 @code{0377}.
 180
 181 Here is another multibyte code which can handle more distinct extended
 182 characters---in fact, almost thirty million:
 183
 184 @quotation
 185 The basic sequences consist of
 186
 187 @itemize @bullet
 188 @item
 189 single bytes with values in the range @code{0} through @code{0177}.
 190
 191 @item
 192 sequences of up to four bytes in which the first byte is in the range
 193 from @code{0200} through @code{0237}, and the remaining bytes are in the
 194 range from @code{0240} through @code{0377}.
 195 @end itemize
 196 @end quotation
 197
 198 @noindent
 199 In this code, any sequence that starts with a byte in the range
 200 from @code{0240} through @code{0377} is invalid.
 201
 202 And here is another variant which has the advantage that removing the
 203 last byte or bytes from a valid character can never produce another
 204 valid character.  (This property is convenient when you want to search
 205 strings for particular characters.)
 206
 207 @quotation
 208 The basic sequences consist of
 209
 210 @itemize @bullet
 211 @item
 212 single bytes with values in the range @code{0} through @code{0177}.
 213
 214 @item
 215 two-byte sequences in which the first byte is in the range from
 216 @code{0200} through @code{0207}, and the second byte is in the range
 217 from @code{0240} through @code{0377}.
 218
 219 @item
 220 three-byte sequences in which the first byte is in the range from
 221 @code{0210} through @code{0217}, and the other bytes are in the range
 222 from @code{0240} through @code{0377}.
 223
 224 @item
 225 four-byte sequences in which the first byte is in the range from
 226 @code{0220} through @code{0227}, and the other bytes are in the range
 227 from @code{0240} through @code{0377}.
 228 @end itemize
 229 @end quotation
 230
 231 @noindent
 232 The list of invalid sequences for this code is long and not worth
 233 stating in full; examples of invalid sequences include @code{0240} and
 234 @code{0220 0300 065}.
 235
 236 The number of @emph{possible} multibyte codes is astronomical.  But a
 237 given computer system will support at most a few different codes.  (One
 238 of these codes may allow for thousands of different characters.)
 239 Another computer system may support a completely different code.  The
 240 library facilities described in this chapter are helpful because they
 241 package up the knowledge of the details of a particular computer
 242 system's multibyte code, so your programs need not know them.
 243
 244 You can use special standard macros to find out the maximum possible
 245 number of bytes in a character in the currently selected multibyte
 246 code with @code{MB_CUR_MAX}, and the maximum for @emph{any} multibyte
 247 code supported on your computer with @code{MB_LEN_MAX}.
 248
 249 @comment limits.h
 250 @comment ISO
 251 @deftypevr Macro int MB_LEN_MAX
 252 This is the maximum length of a multibyte character for any supported
 253 locale.  It is defined in @file{limits.h}.
 254 @pindex limits.h
 255 @end deftypevr
 256
 257 @comment stdlib.h
 258 @comment ISO
 259 @deftypevr Macro int MB_CUR_MAX
 260 This macro expands into a (possibly non-constant) positive integer
 261 expression that is the maximum number of bytes in a multibyte character
 262 in the current locale.  The value is never greater than @code{MB_LEN_MAX}.
 263
 264 @pindex stdlib.h
 265 @code{MB_CUR_MAX} is defined in @file{stdlib.h}.
 266 @end deftypevr
 267
 268 Normally, each basic sequence in a particular character code stands for
 269 one character, the same character regardless of context.  Some multibyte
 270 character codes have a concept of @dfn{shift state}; certain codes,
 271 called @dfn{shift sequences}, change to a different shift state, and the
 272 meaning of some or all basic sequences varies according to the current
 273 shift state.  In fact, the set of basic sequences might even be
 274 different depending on the current shift state.  @xref{Shift State}, for
 275 more information on handling this sort of code.
 276
 277 What happens if you try to pass a string containing multibyte characters
 278 to a function that doesn't know about them?  Normally, such a function
 279 treats a string as a sequence of bytes, and interprets certain byte
 280 values specially; all other byte values are ``ordinary''.  As long as a
 281 multibyte character doesn't contain any of the special byte values, the
 282 function should pass it through as if it were several ordinary
 283 characters.
 284
 285 For example, let's figure out what happens if you use multibyte
 286 characters in a file name.  The functions such as @code{open} and
 287 @code{unlink} that operate on file names treat the name as a sequence of
 288 byte values, with @samp{/} as the only special value.  Any other byte
 289 values are copied, or compared, in sequence, and all byte values are
 290 treated alike.  Thus, you may think of the file name as a sequence of
 291 bytes or as a string containing multibyte characters; the same behavior
 292 makes sense equally either way, provided no multibyte character contains
 293 a @samp{/}.
 294
 295 @node Wide Char Intro, Wide String Conversion, Multibyte Char Intro, Extended Characters
 296 @section Wide Character Introduction
 297
 298 @dfn{Wide characters} are much simpler than multibyte characters.  They
 299 are simply characters with more than eight bits, so that they have room
 300 for more than 256 distinct codes.  The wide character data type,
 301 @code{wchar_t}, has a range large enough to hold extended character
 302 codes as well as old-fashioned ASCII codes.
 303
 304 An advantage of wide characters is that each character is a single data
 305 object, just like ordinary ASCII characters.  Wide characters also have
 306 some disadvantages:
 307
 308 @itemize @bullet
 309 @item
 310 A program must be modified and recompiled in order to use wide
 311 characters at all.
 312
 313 @item
 314 Files of wide characters cannot be read by programs that expect ordinary
 315 characters.
 316 @end itemize
 317
 318 Wide character values @code{0} through @code{0177} are always identical
 319 in meaning to the ASCII character codes.  The wide character value zero
 320 is often used to terminate a string of wide characters, just as a single
 321 byte with value zero often terminates a string of ordinary characters.
 322
 323 @comment stddef.h
 324 @comment ISO
 325 @deftp {Data Type} wchar_t
 326 This is the ``wide character'' type, an integer type whose range is
 327 large enough to represent all distinct values in any extended character
 328 set in the supported locales.  @xref{Locales}, for more information
 329 about locales.  This type is defined in the header file @file{stddef.h}.
 330 @pindex stddef.h
 331 @end deftp
 332
 333 If your system supports extended characters, then each extended
 334 character has both a wide character code and a corresponding multibyte
 335 basic sequence.
 336
 337 @cindex code, character
 338 @cindex character code
 339 In this chapter, the term @dfn{code} is used to refer to a single
 340 extended character object to emphasize the distinction from the
 341 @code{char} data type.
 342
 343 @node Wide String Conversion, Length of Char, Wide Char Intro, Extended Characters
 344 @section Conversion of Extended Strings
 345 @cindex extended strings, converting representations
 346 @cindex converting extended strings
 347
 348 @pindex stdlib.h
 349 The @code{mbstowcs} function converts a string of multibyte characters
 350 to a wide character array.  The @code{wcstombs} function does the
 351 reverse.  These functions are declared in the header file
 352 @file{stdlib.h}.
 353
 354 In most programs, these functions are the only ones you need for
 355 conversion between wide strings and multibyte character strings.  But
 356 they have limitations.  If your data is not null-terminated or is not
 357 all in core at once, you probably need to use the low-level conversion
 358 functions to convert one character at a time.  @xref{Converting One
 359 Char}.
 360
 361 @comment stdlib.h
 362 @comment ISO
 363 @deftypefun size_t mbstowcs (wchar_t *@var{wstring}, const char *@var{string}, size_t @var{size})
 364 The @code{mbstowcs} (``multibyte string to wide character string'')
 365 function converts the null-terminated string of multibyte characters
 366 @var{string} to an array of wide character codes, storing not more than
 367 @var{size} wide characters into the array beginning at @var{wstring}.
 368 The terminating null character counts towards the size, so if @var{size}
 369 is less than the actual number of wide characters resulting from
 370 @var{string}, no terminating null character is stored.
 371
 372 The conversion of characters from @var{string} begins in the initial
 373 shift state.
 374
 375 If an invalid multibyte character sequence is found, this function
 376 returns a value of @code{-1}.  Otherwise, it returns the number of wide
 377 characters stored in the array @var{wstring}.  This number does not
 378 include the terminating null character, which is present if the number
 379 is less than @var{size}.
 380
 381 Here is an example showing how to convert a string of multibyte
 382 characters, allocating enough space for the result.
 383
 384 @smallexample
 385 wchar_t *
 386 mbstowcs_alloc (const char *string)
 387 @{
 388   size_t size = strlen (string) + 1;
 389   wchar_t *buf = xmalloc (size * sizeof (wchar_t));
 390
 391   size = mbstowcs (buf, string, size);
 392   if (size == (size_t) -1)
 393     return NULL;
 394   buf = xrealloc (buf, (size + 1) * sizeof (wchar_t));
 395   return buf;
 396 @}
 397 @end smallexample
 398
 399 @end deftypefun
 400
 401 @comment stdlib.h
 402 @comment ISO
 403 @deftypefun size_t wcstombs (char *@var{string}, const wchar_t *@var{wstring}, size_t @var{size})
 404 The @code{wcstombs} (``wide character string to multibyte string'')
 405 function converts the null-terminated wide character array @var{wstring}
 406 into a string containing multibyte characters, storing not more than
 407 @var{size} bytes starting at @var{string}, followed by a terminating
 408 null character if there is room.  The conversion of characters begins in
 409 the initial shift state.
 410
 411 The terminating null character counts towards the size, so if @var{size}
 412 is less than or equal to the number of bytes needed in @var{wstring}, no
 413 terminating null character is stored.
 414
 415 If a code that does not correspond to a valid multibyte character is
 416 found, this function returns a value of @code{-1}.  Otherwise, the
 417 return value is the number of bytes stored in the array @var{string}.
 418 This number does not include the terminating null character, which is
 419 present if the number is less than @var{size}.
 420 @end deftypefun
 421
 422 @node Length of Char, Converting One Char, Wide String Conversion, Extended Characters
 423 @section Multibyte Character Length
 424 @cindex multibyte character, length of
 425 @cindex length of multibyte character
 426
 427 This section describes how to scan a string containing multibyte
 428 characters, one character at a time.  The difficulty in doing this
 429 is to know how many bytes each character contains.  Your program
 430 can use @code{mblen} to find this out.
 431
 432 @comment stdlib.h
 433 @comment ISO
 434 @deftypefun int mblen (const char *@var{string}, size_t @var{size})
 435 The @code{mblen} function with a non-null @var{string} argument returns
 436 the number of bytes that make up the multibyte character beginning at
 437 @var{string}, never examining more than @var{size} bytes.  (The idea is
 438 to supply for @var{size} the number of bytes of data you have in hand.)
 439
 440 The return value of @code{mblen} distinguishes three possibilities: the
 441 first @var{size} bytes at @var{string} start with valid multibyte
 442 character, they start with an invalid byte sequence or just part of a
 443 character, or @var{string} points to an empty string (a null character).
 444
 445 For a valid multibyte character, @code{mblen} returns the number of
 446 bytes in that character (always at least @code{1}, and never more than
 447 @var{size}).  For an invalid byte sequence, @code{mblen} returns
 448 @code{-1}.  For an empty string, it returns @code{0}.
 449
 450 If the multibyte character code uses shift characters, then @code{mblen}
 451 maintains and updates a shift state as it scans.  If you call
 452 @code{mblen} with a null pointer for @var{string}, that initializes the
 453 shift state to its standard initial value.  It also returns nonzero if
 454 the multibyte character code in use actually has a shift state.
 455 @xref{Shift State}.
 456
 457 @pindex stdlib.h
 458 The function @code{mblen} is declared in @file{stdlib.h}.
 459 @end deftypefun
 460
 461 @node Converting One Char, Example of Conversion, Length of Char, Extended Characters
 462 @section Conversion of Extended Characters One by One
 463 @cindex extended characters, converting
 464 @cindex converting extended characters
 465
 466 @pindex stdlib.h
 467 You can convert multibyte characters one at a time to wide characters
 468 with the @code{mbtowc} function.  The @code{wctomb} function does the
 469 reverse.  These functions are declared in @file{stdlib.h}.
 470
 471 @comment stdlib.h
 472 @comment ISO
 473 @deftypefun int mbtowc (wchar_t *@var{result}, const char *@var{string}, size_t @var{size})
 474 The @code{mbtowc} (``multibyte to wide character'') function when called
 475 with non-null @var{string} converts the first multibyte character
 476 beginning at @var{string} to its corresponding wide character code.  It
 477 stores the result in @code{*@var{result}}.
 478
 479 @code{mbtowc} never examines more than @var{size} bytes.  (The idea is
 480 to supply for @var{size} the number of bytes of data you have in hand.)
 481
 482 @code{mbtowc} with non-null @var{string} distinguishes three
 483 possibilities: the first @var{size} bytes at @var{string} start with
 484 valid multibyte character, they start with an invalid byte sequence or
 485 just part of a character, or @var{string} points to an empty string (a
 486 null character).
 487
 488 For a valid multibyte character, @code{mbtowc} converts it to a wide
 489 character and stores that in @code{*@var{result}}, and returns the
 490 number of bytes in that character (always at least @code{1}, and never
 491 more than @var{size}).
 492
 493 For an invalid byte sequence, @code{mbtowc} returns @code{-1}.  For an
 494 empty string, it returns @code{0}, also storing @code{0} in
 495 @code{*@var{result}}.
 496
 497 If the multibyte character code uses shift characters, then
 498 @code{mbtowc} maintains and updates a shift state as it scans.  If you
 499 call @code{mbtowc} with a null pointer for @var{string}, that
 500 initializes the shift state to its standard initial value.  It also
 501 returns nonzero if the multibyte character code in use actually has a
 502 shift state.  @xref{Shift State}.
 503 @end deftypefun
 504
 505 @comment stdlib.h
 506 @comment ISO
 507 @deftypefun int wctomb (char *@var{string}, wchar_t @var{wchar})
 508 The @code{wctomb} (``wide character to multibyte'') function converts
 509 the wide character code @var{wchar} to its corresponding multibyte
 510 character sequence, and stores the result in bytes starting at
 511 @var{string}.  At most @code{MB_CUR_MAX} characters are stored.
 512
 513 @code{wctomb} with non-null @var{string} distinguishes three
 514 possibilities for @var{wchar}: a valid wide character code (one that can
 515 be translated to a multibyte character), an invalid code, and @code{0}.
 516
 517 Given a valid code, @code{wctomb} converts it to a multibyte character,
 518 storing the bytes starting at @var{string}.  Then it returns the number
 519 of bytes in that character (always at least @code{1}, and never more
 520 than @code{MB_CUR_MAX}).
 521
 522 If @var{wchar} is an invalid wide character code, @code{wctomb} returns
 523 @code{-1}.  If @var{wchar} is @code{0}, it returns @code{0}, also
 524 storing @code{0} in @code{*@var{string}}.
 525
 526 If the multibyte character code uses shift characters, then
 527 @code{wctomb} maintains and updates a shift state as it scans.  If you
 528 call @code{wctomb} with a null pointer for @var{string}, that
 529 initializes the shift state to its standard initial value.  It also
 530 returns nonzero if the multibyte character code in use actually has a
 531 shift state.  @xref{Shift State}.
 532
 533 Calling this function with a @var{wchar} argument of zero when
 534 @var{string} is not null has the side-effect of reinitializing the
 535 stored shift state @emph{as well as} storing the multibyte character
 536 @code{0} and returning @code{0}.
 537 @end deftypefun
 538
 539 @node Example of Conversion, Shift State, Converting One Char, Extended Characters
 540 @section Character-by-Character Conversion Example
 541
 542 Here is an example that reads multibyte character text from descriptor
 543 @code{input} and writes the corresponding wide characters to descriptor
 544 @code{output}.  We need to convert characters one by one for this
 545 example because @code{mbstowcs} is unable to continue past a null
 546 character, and cannot cope with an apparently invalid partial character
 547 by reading more input.
 548
 549 @smallexample
 550 int
 551 file_mbstowcs (int input, int output)
 552 @{
 553   char buffer[BUFSIZ + MB_LEN_MAX];
 554   int filled = 0;
 555   int eof = 0;
 556
 557   while (!eof)
 558     @{
 559       int nread;
 560       int nwrite;
 561       char *inp = buffer;
 562       wchar_t outbuf[BUFSIZ];
 563       wchar_t *outp = outbuf;
 564
 565       /* @r{Fill up the buffer from the input file.}  */
 566       nread = read (input, buffer + filled, BUFSIZ);
 567       if (nread < 0)
 568         @{
 569           perror ("read");
 570           return 0;
 571         @}
 572       /* @r{If we reach end of file, make a note to read no more.} */
 573       if (nread == 0)
 574         eof = 1;
 575
 576       /* @r{@code{filled} is now the number of bytes in @code{buffer}.} */
 577       filled += nread;
 578
 579       /* @r{Convert those bytes to wide characters--as many as we can.} */
 580       while (1)
 581         @{
 582           int thislen = mbtowc (outp, inp, filled);
 583           /* Stop converting at invalid character;
 584              this can mean we have read just the first part
 585              of a valid character.  */
 586           if (thislen == -1)
 587             break;
 588           /* @r{Treat null character like any other,}
 589              @r{but also reset shift state.} */
 590           if (thislen == 0) @{
 591             thislen = 1;
 592             mbtowc (NULL, NULL, 0);
 593           @}
 594           /* @r{Advance past this character.} */
 595           inp += thislen;
 596           filled -= thislen;
 597           outp++;
 598         @}
 599
 600       /* @r{Write the wide characters we just made.}  */
 601       nwrite = write (output, outbuf,
 602                       (outp - outbuf) * sizeof (wchar_t));
 603       if (nwrite < 0)
 604         @{
 605           perror ("write");
 606           return 0;
 607         @}
 608
 609       /* @r{See if we have a @emph{real} invalid character.} */
 610       if ((eof && filled > 0) || filled >= MB_CUR_MAX)
 611         @{
 612           error ("invalid multibyte character");
 613           return 0;
 614         @}
 615
 616       /* @r{If any characters must be carried forward,}
 617          @r{put them at the beginning of @code{buffer}.} */
 618       if (filled > 0)
 619         memcpy (inp, buffer, filled);
 620       @}
 621     @}
 622
 623   return 1;
 624 @}
 625 @end smallexample
 626
 627 @node Shift State,  , Example of Conversion, Extended Characters
 628 @section Multibyte Codes Using Shift Sequences
 629
 630 In some multibyte character codes, the @emph{meaning} of any particular
 631 byte sequence is not fixed; it depends on what other sequences have come
 632 earlier in the same string.  Typically there are just a few sequences
 633 that can change the meaning of other sequences; these few are called
 634 @dfn{shift sequences} and we say that they set the @dfn{shift state} for
 635 other sequences that follow.
 636
 637 To illustrate shift state and shift sequences, suppose we decide that
 638 the sequence @code{0200} (just one byte) enters Japanese mode, in which
 639 pairs of bytes in the range from @code{0240} to @code{0377} are single
 640 characters, while @code{0201} enters Latin-1 mode, in which single bytes
 641 in the range from @code{0240} to @code{0377} are characters, and
 642 interpreted according to the ISO Latin-1 character set.  This is a
 643 multibyte code which has two alternative shift states (``Japanese mode''
 644 and ``Latin-1 mode''), and two shift sequences that specify particular
 645 shift states.
 646
 647 When the multibyte character code in use has shift states, then
 648 @code{mblen}, @code{mbtowc} and @code{wctomb} must maintain and update
 649 the current shift state as they scan the string.  To make this work
 650 properly, you must follow these rules:
 651
 652 @itemize @bullet
 653 @item
 654 Before starting to scan a string, call the function with a null pointer
 655 for the multibyte character address---for example, @code{mblen (NULL,
 656 0)}.  This initializes the shift state to its standard initial value.
 657
 658 @item
 659 Scan the string one character at a time, in order.  Do not ``back up''
 660 and rescan characters already scanned, and do not intersperse the
 661 processing of different strings.
 662 @end itemize
 663
 664 Here is an example of using @code{mblen} following these rules:
 665
 666 @smallexample
 667 void
 668 scan_string (char *s)
 669 @{
 670   int length = strlen (s);
 671
 672   /* @r{Initialize shift state.} */
 673   mblen (NULL, 0);
 674
 675   while (1)
 676     @{
 677       int thischar = mblen (s, length);
 678       /* @r{Deal with end of string and invalid characters.} */
 679       if (thischar == 0)
 680         break;
 681       if (thischar == -1)
 682         @{
 683           error ("invalid multibyte character");
 684           break;
 685         @}
 686       /* @r{Advance past this character.} */
 687       s += thischar;
 688       length -= thischar;
 689     @}
 690 @}
 691 @end smallexample
 692
 693 The functions @code{mblen}, @code{mbtowc} and @code{wctomb} are not
 694 reentrant when using a multibyte code that uses a shift state.  However,
 695 no other library functions call these functions, so you don't have to
 696 worry that the shift state will be changed mysteriously.