1 @node Character Handling, String and Array Utilities, Memory, Top
2 @c %MENU% Character testing and conversion functions
3 @chapter Character Handling
5 Programs that work with characters and strings often need to classify a
6 character---is it alphabetic, is it a digit, is it whitespace, and so
7 on---and perform case conversion operations on characters. The
8 functions in the header file @file{ctype.h} are provided for this
12 Since the choice of locale and character set can alter the
13 classifications of particular character codes, all of these functions
14 are affected by the current locale. (More precisely, they are affected
15 by the locale currently selected for character classification---the
16 @code{LC_CTYPE} category; see @ref{Locale Categories}.)
18 The @w{ISO C} standard specifies two different sets of functions. The
19 one set works on @code{char} type characters, the other one on
20 @code{wchar_t} wide characters (@pxref{Extended Char Intro}).
23 * Classification of Characters:: Testing whether characters are
24 letters, digits, punctuation, etc.
26 * Case Conversion:: Case mapping, and the like.
27 * Classification of Wide Characters:: Character class determination for
29 * Using Wide Char Classes:: Notes on using the wide character
31 * Wide Character Case Conversion:: Mapping of wide characters.
34 @node Classification of Characters, Case Conversion, , Character Handling
35 @section Classification of Characters
36 @cindex character testing
37 @cindex classification of characters
38 @cindex predicates on characters
39 @cindex character predicates
41 This section explains the library functions for classifying characters.
42 For example, @code{isalpha} is the function to test for an alphabetic
43 character. It takes one argument, the character to test, and returns a
44 nonzero integer if the character is alphabetic, and zero otherwise. You
45 would use it like this:
49 printf ("The character `%c' is alphabetic.\n", c);
52 Each of the functions in this section tests for membership in a
53 particular class of characters; each has a name starting with @samp{is}.
54 Each of them takes one argument, which is a character to test, and
55 returns an @code{int} which is treated as a boolean value. The
56 character argument is passed as an @code{int}, and it may be the
57 constant value @code{EOF} instead of a real character.
59 The attributes of any given character can vary between locales.
60 @xref{Locales}, for more information on locales.@refill
62 These functions are declared in the header file @file{ctype.h}.
65 @cindex lower-case character
68 @deftypefun int islower (int @var{c})
69 @safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}}
70 @c The is* macros call __ctype_b_loc to get the ctype array from the
71 @c current locale, and then index it by c. __ctype_b_loc reads from
72 @c thread-local memory the (indirect) pointer to the ctype array, which
73 @c may involve one word access to the global locale object, if that's
74 @c the active locale for the thread, and the array, being part of the
75 @c locale data, is undeletable, so there's no thread-safety issue. We
76 @c might want to mark these with @mtslocale to flag to callers that
77 @c changing locales might affect them, even if not these simpler
79 Returns true if @var{c} is a lower-case letter. The letter need not be
80 from the Latin alphabet, any alphabet representable is valid.
83 @cindex upper-case character
86 @deftypefun int isupper (int @var{c})
87 @safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}}
88 Returns true if @var{c} is an upper-case letter. The letter need not be
89 from the Latin alphabet, any alphabet representable is valid.
92 @cindex alphabetic character
95 @deftypefun int isalpha (int @var{c})
96 @safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}}
97 Returns true if @var{c} is an alphabetic character (a letter). If
98 @code{islower} or @code{isupper} is true of a character, then
99 @code{isalpha} is also true.
101 In some locales, there may be additional characters for which
102 @code{isalpha} is true---letters which are neither upper case nor lower
103 case. But in the standard @code{"C"} locale, there are no such
104 additional characters.
107 @cindex digit character
108 @cindex decimal digit character
111 @deftypefun int isdigit (int @var{c})
112 @safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}}
113 Returns true if @var{c} is a decimal digit (@samp{0} through @samp{9}).
116 @cindex alphanumeric character
119 @deftypefun int isalnum (int @var{c})
120 @safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}}
121 Returns true if @var{c} is an alphanumeric character (a letter or
122 number); in other words, if either @code{isalpha} or @code{isdigit} is
123 true of a character, then @code{isalnum} is also true.
126 @cindex hexadecimal digit character
129 @deftypefun int isxdigit (int @var{c})
130 @safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}}
131 Returns true if @var{c} is a hexadecimal digit.
132 Hexadecimal digits include the normal decimal digits @samp{0} through
133 @samp{9} and the letters @samp{A} through @samp{F} and
134 @samp{a} through @samp{f}.
137 @cindex punctuation character
140 @deftypefun int ispunct (int @var{c})
141 @safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}}
142 Returns true if @var{c} is a punctuation character.
143 This means any printing character that is not alphanumeric or a space
147 @cindex whitespace character
150 @deftypefun int isspace (int @var{c})
151 @safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}}
152 Returns true if @var{c} is a @dfn{whitespace} character. In the standard
153 @code{"C"} locale, @code{isspace} returns true for only the standard
154 whitespace characters:
177 @cindex blank character
180 @deftypefun int isblank (int @var{c})
181 @safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}}
182 Returns true if @var{c} is a blank character; that is, a space or a tab.
183 This function was originally a GNU extension, but was added in @w{ISO C99}.
186 @cindex graphic character
189 @deftypefun int isgraph (int @var{c})
190 @safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}}
191 Returns true if @var{c} is a graphic character; that is, a character
192 that has a glyph associated with it. The whitespace characters are not
196 @cindex printing character
199 @deftypefun int isprint (int @var{c})
200 @safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}}
201 Returns true if @var{c} is a printing character. Printing characters
202 include all the graphic characters, plus the space (@samp{ }) character.
205 @cindex control character
208 @deftypefun int iscntrl (int @var{c})
209 @safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}}
210 Returns true if @var{c} is a control character (that is, a character that
211 is not a printing character).
214 @cindex ASCII character
217 @deftypefun int isascii (int @var{c})
218 @safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}}
219 Returns true if @var{c} is a 7-bit @code{unsigned char} value that fits
220 into the US/UK ASCII character set. This function is a BSD extension
221 and is also an SVID extension.
224 @node Case Conversion, Classification of Wide Characters, Classification of Characters, Character Handling
225 @section Case Conversion
226 @cindex character case conversion
227 @cindex case conversion of characters
228 @cindex converting case of characters
230 This section explains the library functions for performing conversions
231 such as case mappings on characters. For example, @code{toupper}
232 converts any character to upper case if possible. If the character
233 can't be converted, @code{toupper} returns it unchanged.
235 These functions take one argument of type @code{int}, which is the
236 character to convert, and return the converted character as an
237 @code{int}. If the conversion is not applicable to the argument given,
238 the argument is returned unchanged.
240 @strong{Compatibility Note:} In pre-@w{ISO C} dialects, instead of
241 returning the argument unchanged, these functions may fail when the
242 argument is not suitable for the conversion. Thus for portability, you
243 may need to write @code{islower(c) ? toupper(c) : c} rather than just
246 These functions are declared in the header file @file{ctype.h}.
251 @deftypefun int tolower (int @var{c})
252 @safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}}
253 @c The to* macros/functions call different functions that use different
254 @c arrays than those of__ctype_b_loc, but the access patterns and
255 @c thus safety guarantees are the same.
256 If @var{c} is an upper-case letter, @code{tolower} returns the corresponding
257 lower-case letter. If @var{c} is not an upper-case letter,
258 @var{c} is returned unchanged.
263 @deftypefun int toupper (int @var{c})
264 @safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}}
265 If @var{c} is a lower-case letter, @code{toupper} returns the corresponding
266 upper-case letter. Otherwise @var{c} is returned unchanged.
271 @deftypefun int toascii (int @var{c})
272 @safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}}
273 This function converts @var{c} to a 7-bit @code{unsigned char} value
274 that fits into the US/UK ASCII character set, by clearing the high-order
275 bits. This function is a BSD extension and is also an SVID extension.
280 @deftypefun int _tolower (int @var{c})
281 @safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}}
282 This is identical to @code{tolower}, and is provided for compatibility
283 with the SVID. @xref{SVID}.@refill
288 @deftypefun int _toupper (int @var{c})
289 @safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}}
290 This is identical to @code{toupper}, and is provided for compatibility
295 @node Classification of Wide Characters, Using Wide Char Classes, Case Conversion, Character Handling
296 @section Character class determination for wide characters
298 @w{Amendment 1} to @w{ISO C90} defines functions to classify wide
299 characters. Although the original @w{ISO C90} standard already defined
300 the type @code{wchar_t}, no functions operating on them were defined.
302 The general design of the classification functions for wide characters
303 is more general. It allows extensions to the set of available
304 classifications, beyond those which are always available. The POSIX
305 standard specifies how extensions can be made, and this is already
306 implemented in the @glibcadj{} implementation of the @code{localedef}
309 The character class functions are normally implemented with bitsets,
310 with a bitset per character. For a given character, the appropriate
311 bitset is read from a table and a test is performed as to whether a
312 certain bit is set. Which bit is tested for is determined by the
315 For the wide character classification functions this is made visible.
316 There is a type classification type defined, a function to retrieve this
317 value for a given class, and a function to test whether a given
318 character is in this class, using the classification value. On top of
319 this the normal character classification functions as used for
320 @code{char} objects can be defined.
324 @deftp {Data type} wctype_t
325 The @code{wctype_t} can hold a value which represents a character class.
326 The only defined way to generate such a value is by using the
327 @code{wctype} function.
330 This type is defined in @file{wctype.h}.
335 @deftypefun wctype_t wctype (const char *@var{property})
336 @safety{@prelim{}@mtsafe{@mtslocale{}}@assafe{}@acsafe{}}
337 @c Although the source code of wctype contains multiple references to
338 @c the locale, that could each reference different locale_data objects
339 @c should the global locale object change while active, the compiler can
340 @c and does combine them all into a single dereference that resolves
341 @c once to the LCTYPE locale object used throughout the function, so it
342 @c is safe in (optimized) practice, if not in theory, even when the
343 @c locale changes. Ideally we'd explicitly save the resolved
344 @c locale_data object to make it visibly safe instead of safe only under
345 @c compiler optimizations, but given the decision that setlocale is
346 @c MT-Unsafe, all this would afford us would be the ability to not mark
347 @c this function with @mtslocale.
348 The @code{wctype} returns a value representing a class of wide
349 characters which is identified by the string @var{property}. Beside
350 some standard properties each locale can define its own ones. In case
351 no property with the given name is known for the current locale
352 selected for the @code{LC_CTYPE} category, the function returns zero.
355 The properties known in every locale are:
357 @multitable @columnfractions .25 .25 .25 .25
359 @code{"alnum"} @tab @code{"alpha"} @tab @code{"cntrl"} @tab @code{"digit"}
361 @code{"graph"} @tab @code{"lower"} @tab @code{"print"} @tab @code{"punct"}
363 @code{"space"} @tab @code{"upper"} @tab @code{"xdigit"}
367 This function is declared in @file{wctype.h}.
370 To test the membership of a character to one of the non-standard classes
371 the @w{ISO C} standard defines a completely new function.
375 @deftypefun int iswctype (wint_t @var{wc}, wctype_t @var{desc})
376 @safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}}
377 @c The compressed lookup table returned by wctype is read-only.
378 This function returns a nonzero value if @var{wc} is in the character
379 class specified by @var{desc}. @var{desc} must previously be returned
380 by a successful call to @code{wctype}.
383 This function is declared in @file{wctype.h}.
386 To make it easier to use the commonly-used classification functions,
387 they are defined in the C library. There is no need to use
388 @code{wctype} if the property string is one of the known character
389 classes. In some situations it is desirable to construct the property
390 strings, and then it is important that @code{wctype} can also handle the
393 @cindex alphanumeric character
396 @deftypefun int iswalnum (wint_t @var{wc})
397 @safety{@prelim{}@mtsafe{@mtslocale{}}@assafe{}@acsafe{}}
398 @c The implicit wctype call in the isw* functions is actually an
399 @c optimized version because the category has a known offset, but the
400 @c wctype is equally safe when optimized, unsafe with changing locales
401 @c if not optimized (thus @mtslocale). Since it's not a macro, we
402 @c always optimize, and the locale can't change in any MT-Safe way, it's
403 @c fine. The test whether wc is ASCII to use the non-wide is*
404 @c macro/function doesn't bring any other safety issues: the test does
405 @c not depend on the locale, and each path after the decision resolves
406 @c the locale object only once.
407 This function returns a nonzero value if @var{wc} is an alphanumeric
408 character (a letter or number); in other words, if either @code{iswalpha}
409 or @code{iswdigit} is true of a character, then @code{iswalnum} is also
413 This function can be implemented using
416 iswctype (wc, wctype ("alnum"))
420 It is declared in @file{wctype.h}.
423 @cindex alphabetic character
426 @deftypefun int iswalpha (wint_t @var{wc})
427 @safety{@prelim{}@mtsafe{@mtslocale{}}@assafe{}@acsafe{}}
428 Returns true if @var{wc} is an alphabetic character (a letter). If
429 @code{iswlower} or @code{iswupper} is true of a character, then
430 @code{iswalpha} is also true.
432 In some locales, there may be additional characters for which
433 @code{iswalpha} is true---letters which are neither upper case nor lower
434 case. But in the standard @code{"C"} locale, there are no such
435 additional characters.
438 This function can be implemented using
441 iswctype (wc, wctype ("alpha"))
445 It is declared in @file{wctype.h}.
448 @cindex control character
451 @deftypefun int iswcntrl (wint_t @var{wc})
452 @safety{@prelim{}@mtsafe{@mtslocale{}}@assafe{}@acsafe{}}
453 Returns true if @var{wc} is a control character (that is, a character that
454 is not a printing character).
457 This function can be implemented using
460 iswctype (wc, wctype ("cntrl"))
464 It is declared in @file{wctype.h}.
467 @cindex digit character
470 @deftypefun int iswdigit (wint_t @var{wc})
471 @safety{@prelim{}@mtsafe{@mtslocale{}}@assafe{}@acsafe{}}
472 Returns true if @var{wc} is a digit (e.g., @samp{0} through @samp{9}).
473 Please note that this function does not only return a nonzero value for
474 @emph{decimal} digits, but for all kinds of digits. A consequence is
475 that code like the following will @strong{not} work unconditionally for
480 while (iswdigit (*wc))
488 This function can be implemented using
491 iswctype (wc, wctype ("digit"))
495 It is declared in @file{wctype.h}.
498 @cindex graphic character
501 @deftypefun int iswgraph (wint_t @var{wc})
502 @safety{@prelim{}@mtsafe{@mtslocale{}}@assafe{}@acsafe{}}
503 Returns true if @var{wc} is a graphic character; that is, a character
504 that has a glyph associated with it. The whitespace characters are not
508 This function can be implemented using
511 iswctype (wc, wctype ("graph"))
515 It is declared in @file{wctype.h}.
518 @cindex lower-case character
521 @deftypefun int iswlower (wint_t @var{wc})
522 @safety{@prelim{}@mtsafe{@mtslocale{}}@assafe{}@acsafe{}}
523 Returns true if @var{wc} is a lower-case letter. The letter need not be
524 from the Latin alphabet, any alphabet representable is valid.
527 This function can be implemented using
530 iswctype (wc, wctype ("lower"))
534 It is declared in @file{wctype.h}.
537 @cindex printing character
540 @deftypefun int iswprint (wint_t @var{wc})
541 @safety{@prelim{}@mtsafe{@mtslocale{}}@assafe{}@acsafe{}}
542 Returns true if @var{wc} is a printing character. Printing characters
543 include all the graphic characters, plus the space (@samp{ }) character.
546 This function can be implemented using
549 iswctype (wc, wctype ("print"))
553 It is declared in @file{wctype.h}.
556 @cindex punctuation character
559 @deftypefun int iswpunct (wint_t @var{wc})
560 @safety{@prelim{}@mtsafe{@mtslocale{}}@assafe{}@acsafe{}}
561 Returns true if @var{wc} is a punctuation character.
562 This means any printing character that is not alphanumeric or a space
566 This function can be implemented using
569 iswctype (wc, wctype ("punct"))
573 It is declared in @file{wctype.h}.
576 @cindex whitespace character
579 @deftypefun int iswspace (wint_t @var{wc})
580 @safety{@prelim{}@mtsafe{@mtslocale{}}@assafe{}@acsafe{}}
581 Returns true if @var{wc} is a @dfn{whitespace} character. In the standard
582 @code{"C"} locale, @code{iswspace} returns true for only the standard
583 whitespace characters:
606 This function can be implemented using
609 iswctype (wc, wctype ("space"))
613 It is declared in @file{wctype.h}.
616 @cindex upper-case character
619 @deftypefun int iswupper (wint_t @var{wc})
620 @safety{@prelim{}@mtsafe{@mtslocale{}}@assafe{}@acsafe{}}
621 Returns true if @var{wc} is an upper-case letter. The letter need not be
622 from the Latin alphabet, any alphabet representable is valid.
625 This function can be implemented using
628 iswctype (wc, wctype ("upper"))
632 It is declared in @file{wctype.h}.
635 @cindex hexadecimal digit character
638 @deftypefun int iswxdigit (wint_t @var{wc})
639 @safety{@prelim{}@mtsafe{@mtslocale{}}@assafe{}@acsafe{}}
640 Returns true if @var{wc} is a hexadecimal digit.
641 Hexadecimal digits include the normal decimal digits @samp{0} through
642 @samp{9} and the letters @samp{A} through @samp{F} and
643 @samp{a} through @samp{f}.
646 This function can be implemented using
649 iswctype (wc, wctype ("xdigit"))
653 It is declared in @file{wctype.h}.
656 @Theglibc{} also provides a function which is not defined in the
657 @w{ISO C} standard but which is available as a version for single byte
660 @cindex blank character
663 @deftypefun int iswblank (wint_t @var{wc})
664 @safety{@prelim{}@mtsafe{@mtslocale{}}@assafe{}@acsafe{}}
665 Returns true if @var{wc} is a blank character; that is, a space or a tab.
666 This function was originally a GNU extension, but was added in @w{ISO C99}.
667 It is declared in @file{wchar.h}.
670 @node Using Wide Char Classes, Wide Character Case Conversion, Classification of Wide Characters, Character Handling
671 @section Notes on using the wide character classes
673 The first note is probably not astonishing but still occasionally a
674 cause of problems. The @code{isw@var{XXX}} functions can be implemented
675 using macros and in fact, @theglibc{} does this. They are still
676 available as real functions but when the @file{wctype.h} header is
677 included the macros will be used. This is the same as the
678 @code{char} type versions of these functions.
680 The second note covers something new. It can be best illustrated by a
681 (real-world) example. The first piece of code is an excerpt from the
682 original code. It is truncated a bit but the intention should be clear.
686 is_in_class (int c, const char *class)
688 if (strcmp (class, "alnum") == 0)
690 if (strcmp (class, "alpha") == 0)
692 if (strcmp (class, "cntrl") == 0)
699 Now, with the @code{wctype} and @code{iswctype} you can avoid the
700 @code{if} cascades, but rewriting the code as follows is wrong:
704 is_in_class (int c, const char *class)
706 wctype_t desc = wctype (class);
707 return desc ? iswctype ((wint_t) c, desc) : 0;
711 The problem is that it is not guaranteed that the wide character
712 representation of a single-byte character can be found using casting.
713 In fact, usually this fails miserably. The correct solution to this
714 problem is to write the code as follows:
718 is_in_class (int c, const char *class)
720 wctype_t desc = wctype (class);
721 return desc ? iswctype (btowc (c), desc) : 0;
725 @xref{Converting a Character}, for more information on @code{btowc}.
726 Note that this change probably does not improve the performance
727 of the program a lot since the @code{wctype} function still has to make
728 the string comparisons. It gets really interesting if the
729 @code{is_in_class} function is called more than once for the
730 same class name. In this case the variable @var{desc} could be computed
731 once and reused for all the calls. Therefore the above form of the
732 function is probably not the final one.
735 @node Wide Character Case Conversion, , Using Wide Char Classes, Character Handling
736 @section Mapping of wide characters.
738 The classification functions are also generalized by the @w{ISO C}
739 standard. Instead of just allowing the two standard mappings, a
740 locale can contain others. Again, the @code{localedef} program
741 already supports generating such locale data files.
745 @deftp {Data Type} wctrans_t
746 This data type is defined as a scalar type which can hold a value
747 representing the locale-dependent character mapping. There is no way to
748 construct such a value apart from using the return value of the
749 @code{wctrans} function.
753 This type is defined in @file{wctype.h}.
758 @deftypefun wctrans_t wctrans (const char *@var{property})
759 @safety{@prelim{}@mtsafe{@mtslocale{}}@assafe{}@acsafe{}}
760 @c Similar implementation, same caveats as wctype.
761 The @code{wctrans} function has to be used to find out whether a named
762 mapping is defined in the current locale selected for the
763 @code{LC_CTYPE} category. If the returned value is non-zero, you can use
764 it afterwards in calls to @code{towctrans}. If the return value is
765 zero no such mapping is known in the current locale.
767 Beside locale-specific mappings there are two mappings which are
768 guaranteed to be available in every locale:
770 @multitable @columnfractions .5 .5
772 @code{"tolower"} @tab @code{"toupper"}
777 These functions are declared in @file{wctype.h}.
782 @deftypefun wint_t towctrans (wint_t @var{wc}, wctrans_t @var{desc})
783 @safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}}
784 @c Same caveats as iswctype.
785 @code{towctrans} maps the input character @var{wc}
786 according to the rules of the mapping for which @var{desc} is a
787 descriptor, and returns the value it finds. @var{desc} must be
788 obtained by a successful call to @code{wctrans}.
792 This function is declared in @file{wctype.h}.
795 For the generally available mappings, the @w{ISO C} standard defines
796 convenient shortcuts so that it is not necessary to call @code{wctrans}
801 @deftypefun wint_t towlower (wint_t @var{wc})
802 @safety{@prelim{}@mtsafe{@mtslocale{}}@assafe{}@acsafe{}}
803 @c Same caveats as iswalnum, just using a wctrans rather than a wctype
805 If @var{wc} is an upper-case letter, @code{towlower} returns the corresponding
806 lower-case letter. If @var{wc} is not an upper-case letter,
807 @var{wc} is returned unchanged.
810 @code{towlower} can be implemented using
813 towctrans (wc, wctrans ("tolower"))
818 This function is declared in @file{wctype.h}.
823 @deftypefun wint_t towupper (wint_t @var{wc})
824 @safety{@prelim{}@mtsafe{@mtslocale{}}@assafe{}@acsafe{}}
825 If @var{wc} is a lower-case letter, @code{towupper} returns the corresponding
826 upper-case letter. Otherwise @var{wc} is returned unchanged.
829 @code{towupper} can be implemented using
832 towctrans (wc, wctrans ("toupper"))
837 This function is declared in @file{wctype.h}.
840 The same warnings given in the last section for the use of the wide
841 character classification functions apply here. It is not possible to
842 simply cast a @code{char} type value to a @code{wint_t} and use it as an
843 argument to @code{towctrans} calls.