1 @node Extended Characters, Locales, String and Array Utilities, Top
2 @chapter Extended Characters
4 A number of languages use character sets that are larger than the range
5 of values of type @code{char}. Japanese and Chinese are probably the
6 most familiar examples.
8 The GNU C library includes support for two mechanisms for dealing with
9 extended character sets: multibyte characters and wide characters. This
10 chapter describes how to use these mechanisms, and the functions for
11 converting between them.
12 @cindex extended character sets
14 The behavior of the functions in this chapter is affected by the current
15 locale for character classification---the @code{LC_CTYPE} category; see
16 @ref{Locale Categories}. This choice of locale selects which multibyte
17 code is used, and also controls the meanings and characteristics of wide
21 * Extended Char Intro:: Multibyte codes versus wide characters.
22 * Locales and Extended Chars:: The locale selects the character codes.
23 * Multibyte Char Intro:: How multibyte codes are represented.
24 * Wide Char Intro:: How wide characters are represented.
25 * Wide String Conversion:: Converting wide strings to multibyte code
27 * Length of Char:: how many bytes make up one multibyte char.
28 * Converting One Char:: Converting a string character by character.
29 * Example of Conversion:: Example showing why converting
30 one character at a time may be useful.
31 * Shift State:: Multibyte codes with "shift characters".
34 @node Extended Char Intro, Locales and Extended Chars, , Extended Characters
35 @section Introduction to Extended Characters
37 You can represent extended characters in either of two ways:
41 As @dfn{multibyte characters} which can be embedded in an ordinary
42 string, an array of @code{char} objects. Their advantage is that many
43 programs and operating systems can handle occasional multibyte
44 characters scattered among ordinary ASCII characters, without any
48 @cindex wide characters
49 As @dfn{wide characters}, which are like ordinary characters except that
50 they occupy more bits. The wide character data type, @code{wchar_t},
51 has a range large enough to hold extended character codes as well as
52 old-fashioned ASCII codes.
54 An advantage of wide characters is that each character is a single data
55 object, just like ordinary ASCII characters. There are a few
60 Each existing program must be modified and recompiled to make it use
64 Files of wide characters cannot be read by programs that expect ordinary
69 Typically, you use the multibyte character representation as part of the
70 external program interface, such as reading or writing text to files.
71 However, it's usually easier to perform internal manipulations on
72 strings containing extended characters on arrays of @code{wchar_t}
73 objects, since the uniform representation makes most editing operations
74 easier. If you do use multibyte characters for files and wide
75 characters for internal operations, you need to convert between them
76 when you read and write data.
78 If your system supports extended characters, then it supports them both
79 as multibyte characters and as wide characters. The library includes
80 functions you can use to convert between the two representations.
81 These functions are described in this chapter.
83 @node Locales and Extended Chars, Multibyte Char Intro, Extended Char Intro, Extended Characters
84 @section Locales and Extended Characters
86 A computer system can support more than one multibyte character code,
87 and more than one wide character code. The user controls the choice of
88 codes through the current locale for character classification
89 (@pxref{Locales}). Each locale specifies a particular multibyte
90 character code and a particular wide character code. The choice of locale
91 influences the behavior of the conversion functions in the library.
93 Some locales support neither wide characters nor nontrivial multibyte
94 characters. In these locales, the library conversion functions still
95 work, even though what they do is basically trivial.
97 If you select a new locale for character classification, the internal
98 shift state maintained by these functions can become confused, so it's
99 not a good idea to change the locale while you are in the middle of
102 @node Multibyte Char Intro, Wide Char Intro, Locales and Extended Chars, Extended Characters
103 @section Multibyte Characters
104 @cindex multibyte characters
106 In the ordinary ASCII code, a sequence of characters is a sequence of
107 bytes, and each character is one byte. This is very simple, but
108 allows for only 256 distinct characters.
110 In a @dfn{multibyte character code}, a sequence of characters is a
111 sequence of bytes, but each character may occupy one or more consecutive
112 bytes of the sequence.
114 @cindex basic byte sequence
115 There are many different ways of designing a multibyte character code;
116 different systems use different codes. To specify a particular code
117 means designating the @dfn{basic} byte sequences---those which represent
118 a single character---and what characters they stand for. A code that a
119 computer can actually use must have a finite number of these basic
120 sequences, and typically none of them is more than a few characters
123 These sequences need not all have the same length. In fact, many of
124 them are just one byte long. Because the basic ASCII characters in the
125 range from @code{0} to @code{0177} are so important, they stand for
126 themselves in all multibyte character codes. That is to say, a byte
127 whose value is @code{0} through @code{0177} is always a character in
128 itself. The characters which are more than one byte must always start
129 with a byte in the range from @code{0200} through @code{0377}.
131 The byte value @code{0} can be used to terminate a string, just as it is
132 often used in a string of ASCII characters.
134 Specifying the basic byte sequences that represent single characters
135 automatically gives meanings to many longer byte sequences, as more than
136 one character. For example, if the two byte sequence @code{0205 049}
137 stands for the Greek letter alpha, then @code{0205 049 065} must stand
138 for an alpha followed by an @samp{A} (ASCII code 065), and @code{0205 049
139 0205 049} must stand for two alphas in a row.
141 If any byte sequence can have more than one meaning as a sequence of
142 characters, then the multibyte code is ambiguous---and no good. The
143 codes that systems actually use are all unambiguous.
145 In most codes, there are certain sequences of bytes that have no meaning
146 as a character or characters. These are called @dfn{invalid}.
148 The simplest possible multibyte code is a trivial one:
151 The basic sequences consist of single bytes.
154 This particular code is equivalent to not using multibyte characters at
155 all. It has no invalid sequences. But it can handle only 256 different
158 Here is another possible code which can handle 9376 different
162 The basic sequences consist of
166 single bytes with values in the range @code{0} through @code{0237}.
169 two-byte sequences, in which both of the bytes have values in the range
170 from @code{0240} through @code{0377}.
175 This code or a similar one is used on some systems to represent Japanese
176 characters. The invalid sequences are those which consist of an odd
177 number of consecutive bytes in the range from @code{0240} through
180 Here is another multibyte code which can handle more distinct extended
181 characters---in fact, almost thirty million:
184 The basic sequences consist of
188 single bytes with values in the range @code{0} through @code{0177}.
191 sequences of up to four bytes in which the first byte is in the range
192 from @code{0200} through @code{0237}, and the remaining bytes are in the
193 range from @code{0240} through @code{0377}.
198 In this code, any sequence that starts with a byte in the range
199 from @code{0240} through @code{0377} is invalid.
201 And here is another variant which has the advantage that removing the
202 last byte or bytes from a valid character can never produce another
203 valid character. (This property is convenient when you want to search
204 strings for particular characters.)
207 The basic sequences consist of
211 single bytes with values in the range @code{0} through @code{0177}.
214 two-byte sequences in which the first byte is in the range from
215 @code{0200} through @code{0207}, and the second byte is in the range
216 from @code{0240} through @code{0377}.
219 three-byte sequences in which the first byte is in the range from
220 @code{0210} through @code{0217}, and the other bytes are in the range
221 from @code{0240} through @code{0377}.
224 four-byte sequences in which the first byte is in the range from
225 @code{0220} through @code{0227}, and the other bytes are in the range
226 from @code{0240} through @code{0377}.
231 The list of invalid sequences for this code is long and not worth
232 stating in full; examples of invalid sequences include @code{0240} and
233 @code{0220 0300 065}.
235 The number of @emph{possible} multibyte codes is astronomical. But a
236 given computer system will support at most a few different codes. (One
237 of these codes may allow for thousands of different characters.)
238 Another computer system may support a completely different code. The
239 library facilities described in this chapter are helpful because they
240 package up the knowledge of the details of a particular computer
241 system's multibyte code, so your programs need not know them.
243 You can use special standard macros to find out the maximum possible
244 number of bytes in a character in the currently selected multibyte
245 code with @code{MB_CUR_MAX}, and the maximum for @emph{any} multibyte
246 code supported on your computer with @code{MB_LEN_MAX}.
250 @deftypevr Macro int MB_LEN_MAX
251 This is the maximum length of a multibyte character for any supported
252 locale. It is defined in @file{limits.h}.
258 @deftypevr Macro int MB_CUR_MAX
259 This macro expands into a (possibly non-constant) positive integer
260 expression that is the maximum number of bytes in a multibyte character
261 in the current locale. The value is never greater than @code{MB_LEN_MAX}.
264 @code{MB_CUR_MAX} is defined in @file{stdlib.h}.
267 Normally, each basic sequence in a particular character code stands for
268 one character, the same character regardless of context. Some multibyte
269 character codes have a concept of @dfn{shift state}; certain codes,
270 called @dfn{shift sequences}, change to a different shift state, and the
271 meaning of some or all basic sequences varies according to the current
272 shift state. In fact, the set of basic sequences might even be
273 different depending on the current shift state. @xref{Shift State}, for
274 more information on handling this sort of code.
276 What happens if you try to pass a string containing multibyte characters
277 to a function that doesn't know about them? Normally, such a function
278 treats a string as a sequence of bytes, and interprets certain byte
279 values specially; all other byte values are ``ordinary''. As long as a
280 multibyte character doesn't contain any of the special byte values, the
281 function should pass it through as if it were several ordinary
284 For example, let's figure out what happens if you use multibyte
285 characters in a file name. The functions such as @code{open} and
286 @code{unlink} that operate on file names treat the name as a sequence of
287 byte values, with @samp{/} as the only special value. Any other byte
288 values are copied, or compared, in sequence, and all byte values are
289 treated alike. Thus, you may think of the file name as a sequence of
290 bytes or as a string containing multibyte characters; the same behavior
291 makes sense equally either way, provided no multibyte character contains
294 @node Wide Char Intro, Wide String Conversion, Multibyte Char Intro, Extended Characters
295 @section Wide Character Introduction
297 @dfn{Wide characters} are much simpler than multibyte characters. They
298 are simply characters with more than eight bits, so that they have room
299 for more than 256 distinct codes. The wide character data type,
300 @code{wchar_t}, has a range large enough to hold extended character
301 codes as well as old-fashioned ASCII codes.
303 An advantage of wide characters is that each character is a single data
304 object, just like ordinary ASCII characters. Wide characters also have
309 A program must be modified and recompiled in order to use wide
313 Files of wide characters cannot be read by programs that expect ordinary
317 Wide character values @code{0} through @code{0177} are always identical
318 in meaning to the ASCII character codes. The wide character value zero
319 is often used to terminate a string of wide characters, just as a single
320 byte with value zero often terminates a string of ordinary characters.
324 @deftp {Data Type} wchar_t
325 This is the ``wide character'' type, an integer type whose range is
326 large enough to represent all distinct values in any extended character
327 set in the supported locales. @xref{Locales}, for more information
328 about locales. This type is defined in the header file @file{stddef.h}.
332 If your system supports extended characters, then each extended
333 character has both a wide character code and a corresponding multibyte
336 @cindex code, character
337 @cindex character code
338 In this chapter, the term @dfn{code} is used to refer to a single
339 extended character object to emphasize the distinction from the
340 @code{char} data type.
342 @node Wide String Conversion, Length of Char, Wide Char Intro, Extended Characters
343 @section Conversion of Extended Strings
344 @cindex extended strings, converting representations
345 @cindex converting extended strings
348 The @code{mbstowcs} function converts a string of multibyte characters
349 to a wide character array. The @code{wcstombs} function does the
350 reverse. These functions are declared in the header file
353 In most programs, these functions are the only ones you need for
354 conversion between wide strings and multibyte character strings. But
355 they have limitations. If your data is not null-terminated or is not
356 all in core at once, you probably need to use the low-level conversion
357 functions to convert one character at a time. @xref{Converting One
362 @deftypefun size_t mbstowcs (wchar_t *@var{wstring}, const char *@var{string}, size_t @var{size})
363 The @code{mbstowcs} (``multibyte string to wide character string'')
364 function converts the null-terminated string of multibyte characters
365 @var{string} to an array of wide character codes, storing not more than
366 @var{size} wide characters into the array beginning at @var{wstring}.
367 The terminating null character counts towards the size, so if @var{size}
368 is less than the actual number of wide characters resulting from
369 @var{string}, no terminating null character is stored.
371 The conversion of characters from @var{string} begins in the initial
374 If an invalid multibyte character sequence is found, this function
375 returns a value of @code{-1}. Otherwise, it returns the number of wide
376 characters stored in the array @var{wstring}. This number does not
377 include the terminating null character, which is present if the number
378 is less than @var{size}.
380 Here is an example showing how to convert a string of multibyte
381 characters, allocating enough space for the result.
385 mbstowcs_alloc (const char *string)
387 size_t size = strlen (string) + 1;
388 wchar_t *buf = xmalloc (size * sizeof (wchar_t));
390 size = mbstowcs (buf, string, size);
391 if (size == (size_t) -1)
393 buf = xrealloc (buf, (size + 1) * sizeof (wchar_t));
402 @deftypefun size_t wcstombs (char *@var{string}, const wchar_t @var{wstring}, size_t @var{size})
403 The @code{wcstombs} (``wide character string to multibyte string'')
404 function converts the null-terminated wide character array @var{wstring}
405 into a string containing multibyte characters, storing not more than
406 @var{size} bytes starting at @var{string}, followed by a terminating
407 null character if there is room. The conversion of characters begins in
408 the initial shift state.
410 The terminating null character counts towards the size, so if @var{size}
411 is less than or equal to the number of bytes needed in @var{wstring}, no
412 terminating null character is stored.
414 If a code that does not correspond to a valid multibyte character is
415 found, this function returns a value of @code{-1}. Otherwise, the
416 return value is the number of bytes stored in the array @var{string}.
417 This number does not include the terminating null character, which is
418 present if the number is less than @var{size}.
421 @node Length of Char, Converting One Char, Wide String Conversion, Extended Characters
422 @section Multibyte Character Length
423 @cindex multibyte character, length of
424 @cindex length of multibyte character
426 This section describes how to scan a string containing multibyte
427 characters, one character at a time. The difficulty in doing this
428 is to know how many bytes each character contains. Your program
429 can use @code{mblen} to find this out.
433 @deftypefun int mblen (const char *@var{string}, size_t @var{size})
434 The @code{mblen} function with a non-null @var{string} argument returns
435 the number of bytes that make up the multibyte character beginning at
436 @var{string}, never examining more than @var{size} bytes. (The idea is
437 to supply for @var{size} the number of bytes of data you have in hand.)
439 The return value of @code{mblen} distinguishes three possibilities: the
440 first @var{size} bytes at @var{string} start with valid multibyte
441 character, they start with an invalid byte sequence or just part of a
442 character, or @var{string} points to an empty string (a null character).
444 For a valid multibyte character, @code{mblen} returns the number of
445 bytes in that character (always at least @code{1}, and never more than
446 @var{size}). For an invalid byte sequence, @code{mblen} returns
447 @code{-1}. For an empty string, it returns @code{0}.
449 If the multibyte character code uses shift characters, then @code{mblen}
450 maintains and updates a shift state as it scans. If you call
451 @code{mblen} with a null pointer for @var{string}, that initializes the
452 shift state to its standard initial value. It also returns nonzero if
453 the multibyte character code in use actually has a shift state.
457 The function @code{mblen} is declared in @file{stdlib.h}.
460 @node Converting One Char, Example of Conversion, Length of Char, Extended Characters
461 @section Conversion of Extended Characters One by One
462 @cindex extended characters, converting
463 @cindex converting extended characters
466 You can convert multibyte characters one at a time to wide characters
467 with the @code{mbtowc} function. The @code{wctomb} function does the
468 reverse. These functions are declared in @file{stdlib.h}.
472 @deftypefun int mbtowc (wchar_t *@var{result}, const char *@var{string}, size_t @var{size})
473 The @code{mbtowc} (``multibyte to wide character'') function when called
474 with non-null @var{string} converts the first multibyte character
475 beginning at @var{string} to its corresponding wide character code. It
476 stores the result in @code{*@var{result}}.
478 @code{mbtowc} never examines more than @var{size} bytes. (The idea is
479 to supply for @var{size} the number of bytes of data you have in hand.)
481 @code{mbtowc} with non-null @var{string} distinguishes three
482 possibilities: the first @var{size} bytes at @var{string} start with
483 valid multibyte character, they start with an invalid byte sequence or
484 just part of a character, or @var{string} points to an empty string (a
487 For a valid multibyte character, @code{mbtowc} converts it to a wide
488 character and stores that in @code{*@var{result}}, and returns the
489 number of bytes in that character (always at least @code{1}, and never
490 more than @var{size}).
492 For an invalid byte sequence, @code{mbtowc} returns @code{-1}. For an
493 empty string, it returns @code{0}, also storing @code{0} in
494 @code{*@var{result}}.
496 If the multibyte character code uses shift characters, then
497 @code{mbtowc} maintains and updates a shift state as it scans. If you
498 call @code{mbtowc} with a null pointer for @var{string}, that
499 initializes the shift state to its standard initial value. It also
500 returns nonzero if the multibyte character code in use actually has a
501 shift state. @xref{Shift State}.
506 @deftypefun int wctomb (char *@var{string}, wchar_t @var{wchar})
507 The @code{wctomb} (``wide character to multibyte'') function converts
508 the wide character code @var{wchar} to its corresponding multibyte
509 character sequence, and stores the result in bytes starting at
510 @var{string}. At most @code{MB_CUR_MAX} characters are stored.
512 @code{wctomb} with non-null @var{string} distinguishes three
513 possibilities for @var{wchar}: a valid wide character code (one that can
514 be translated to a multibyte character), an invalid code, and @code{0}.
516 Given a valid code, @code{wctomb} converts it to a multibyte character,
517 storing the bytes starting at @var{string}. Then it returns the number
518 of bytes in that character (always at least @code{1}, and never more
519 than @code{MB_CUR_MAX}).
521 If @var{wchar} is an invalid wide character code, @code{wctomb} returns
522 @code{-1}. If @var{wchar} is @code{0}, it returns @code{0}, also
523 storing @code{0} in @code{*@var{string}}.
525 If the multibyte character code uses shift characters, then
526 @code{wctomb} maintains and updates a shift state as it scans. If you
527 call @code{wctomb} with a null pointer for @var{string}, that
528 initializes the shift state to its standard initial value. It also
529 returns nonzero if the multibyte character code in use actually has a
530 shift state. @xref{Shift State}.
532 Calling this function with a @var{wchar} argument of zero when
533 @var{string} is not null has the side-effect of reinitializing the
534 stored shift state @emph{as well as} storing the multibyte character
535 @code{0} and returning @code{0}.
538 @node Example of Conversion, Shift State, Converting One Char, Extended Characters
539 @section Character-by-Character Conversion Example
541 Here is an example that reads multibyte character text from descriptor
542 @code{input} and writes the corresponding wide characters to descriptor
543 @code{output}. We need to convert characters one by one for this
544 example because @code{mbstowcs} is unable to continue past a null
545 character, and cannot cope with an apparently invalid partial character
546 by reading more input.
550 file_mbstowcs (int input, int output)
552 char buffer[BUFSIZ + MB_LEN_MAX];
561 wchar_t outbuf[BUFSIZ];
562 wchar_t *outp = outbuf;
564 /* @r{Fill up the buffer from the input file.} */
565 nread = read (input, buffer + filled, BUFSIZ);
571 /* @r{If we reach end of file, make a note to read no more.} */
575 /* @r{@code{filled} is now the number of bytes in @code{buffer}.} */
578 /* @r{Convert those bytes to wide characters--as many as we can.} */
581 int thislen = mbtowc (outp, inp, filled);
582 /* Stop converting at invalid character;
583 this can mean we have read just the first part
584 of a valid character. */
587 /* @r{Treat null character like any other,}
588 @r{but also reset shift state.} */
591 mbtowc (NULL, NULL, 0);
593 /* @r{Advance past this character.} */
599 /* @r{Write the wide characters we just made.} */
600 nwrite = write (output, outbuf,
601 (outp - outbuf) * sizeof (wchar_t));
608 /* @r{See if we have a @emph{real} invalid character.} */
609 if ((eof && filled > 0) || filled >= MB_CUR_MAX)
611 error ("invalid multibyte character");
615 /* @r{If any characters must be carried forward,}
616 @r{put them at the beginning of @code{buffer}.} */
618 memcpy (inp, buffer, filled);
626 @node Shift State, , Example of Conversion, Extended Characters
627 @section Multibyte Codes Using Shift Sequences
629 In some multibyte character codes, the @emph{meaning} of any particular
630 byte sequence is not fixed; it depends on what other sequences have come
631 earlier in the same string. Typically there are just a few sequences
632 that can change the meaning of other sequences; these few are called
633 @dfn{shift sequences} and we say that they set the @dfn{shift state} for
634 other sequences that follow.
636 To illustrate shift state and shift sequences, suppose we decide that
637 the sequence @code{0200} (just one byte) enters Japanese mode, in which
638 pairs of bytes in the range from @code{0240} to @code{0377} are single
639 characters, while @code{0201} enters Latin-1 mode, in which single bytes
640 in the range from @code{0240} to @code{0377} are characters, and
641 interpreted according to the ISO Latin-1 character set. This is a
642 multibyte code which has two alternative shift states (``Japanese mode''
643 and ``Latin-1 mode''), and two shift sequences that specify particular
646 When the multibyte character code in use has shift states, then
647 @code{mblen}, @code{mbtowc} and @code{wctomb} must maintain and update
648 the current shift state as they scan the string. To make this work
649 properly, you must follow these rules:
653 Before starting to scan a string, call the function with a null pointer
654 for the multibyte character address---for example, @code{mblen (NULL,
655 0)}. This initializes the shift state to its standard initial value.
658 Scan the string one character at a time, in order. Do not ``back up''
659 and rescan characters already scanned, and do not intersperse the
660 processing of different strings.
663 Here is an example of using @code{mblen} following these rules:
667 scan_string (char *s)
669 int length = strlen (s);
671 /* @r{Initialize shift state.} */
676 int thischar = mblen (s, length);
677 /* @r{Deal with end of string and invalid characters.} */
682 error ("invalid multibyte character");
685 /* @r{Advance past this character.} */
692 The functions @code{mblen}, @code{mbtowc} and @code{wctomb} are not
693 reentrant when using a multibyte code that uses a shift state. However,
694 no other library functions call these functions, so you don't have to
695 worry that the shift state will be changed mysteriously.