manual/string.texi

   1 @node String and Array Utilities, Extended Characters, Character Handling, Top
   2 @c %MENU% Utilities for copying and comparing strings and arrays
   3 @chapter String and Array Utilities
   4
   5 Operations on strings (or arrays of characters) are an important part of
   6 many programs.  The GNU C library provides an extensive set of string
   7 utility functions, including functions for copying, concatenating,
   8 comparing, and searching strings.  Many of these functions can also
   9 operate on arbitrary regions of storage; for example, the @code{memcpy}
  10 function can be used to copy the contents of any kind of array.
  11
  12 It's fairly common for beginning C programmers to ``reinvent the wheel''
  13 by duplicating this functionality in their own code, but it pays to
  14 become familiar with the library functions and to make use of them,
  15 since this offers benefits in maintenance, efficiency, and portability.
  16
  17 For instance, you could easily compare one string to another in two
  18 lines of C code, but if you use the built-in @code{strcmp} function,
  19 you're less likely to make a mistake.  And, since these library
  20 functions are typically highly optimized, your program may run faster
  21 too.
  22
  23 @menu
  24 * Representation of Strings::   Introduction to basic concepts.
  25 * String/Array Conventions::    Whether to use a string function or an
  26                                  arbitrary array function.
  27 * String Length::               Determining the length of a string.
  28 * Copying and Concatenation::   Functions to copy the contents of strings
  29                                  and arrays.
  30 * String/Array Comparison::     Functions for byte-wise and character-wise
  31                                  comparison.
  32 * Collation Functions::         Functions for collating strings.
  33 * Search Functions::            Searching for a specific element or substring.
  34 * Finding Tokens in a String::  Splitting a string into tokens by looking
  35                                  for delimiters.
  36 * Encode Binary Data::          Encoding and Decoding of Binary Data.
  37 * Argz and Envz Vectors::       Null-separated string vectors.
  38 @end menu
  39
  40 @node Representation of Strings
  41 @section Representation of Strings
  42 @cindex string, representation of
  43
  44 This section is a quick summary of string concepts for beginning C
  45 programmers.  It describes how character strings are represented in C
  46 and some common pitfalls.  If you are already familiar with this
  47 material, you can skip this section.
  48
  49 @cindex string
  50 @cindex null character
  51 A @dfn{string} is an array of @code{char} objects.  But string-valued
  52 variables are usually declared to be pointers of type @code{char *}.
  53 Such variables do not include space for the text of a string; that has
  54 to be stored somewhere else---in an array variable, a string constant,
  55 or dynamically allocated memory (@pxref{Memory Allocation}).  It's up to
  56 you to store the address of the chosen memory space into the pointer
  57 variable.  Alternatively you can store a @dfn{null pointer} in the
  58 pointer variable.  The null pointer does not point anywhere, so
  59 attempting to reference the string it points to gets an error.
  60
  61 By convention, a @dfn{null character}, @code{'\0'}, marks the end of a
  62 string.  For example, in testing to see whether the @code{char *}
  63 variable @var{p} points to a null character marking the end of a string,
  64 you can write @code{!*@var{p}} or @code{*@var{p} == '\0'}.
  65
  66 A null character is quite different conceptually from a null pointer,
  67 although both are represented by the integer @code{0}.
  68
  69 @cindex string literal
  70 @dfn{String literals} appear in C program source as strings of
  71 characters between double-quote characters (@samp{"}).  In @w{ISO C},
  72 string literals can also be formed by @dfn{string concatenation}:
  73 @code{"a" "b"} is the same as @code{"ab"}.  Modification of string
  74 literals is not allowed by the GNU C compiler, because literals
  75 are placed in read-only storage.
  76
  77 Character arrays that are declared @code{const} cannot be modified
  78 either.  It's generally good style to declare non-modifiable string
  79 pointers to be of type @code{const char *}, since this often allows the
  80 C compiler to detect accidental modifications as well as providing some
  81 amount of documentation about what your program intends to do with the
  82 string.
  83
  84 The amount of memory allocated for the character array may extend past
  85 the null character that normally marks the end of the string.  In this
  86 document, the term @dfn{allocated size} is always used to refer to the
  87 total amount of memory allocated for the string, while the term
  88 @dfn{length} refers to the number of characters up to (but not
  89 including) the terminating null character.
  90 @cindex length of string
  91 @cindex allocation size of string
  92 @cindex size of string
  93 @cindex string length
  94 @cindex string allocation
  95
  96 A notorious source of program bugs is trying to put more characters in a
  97 string than fit in its allocated size.  When writing code that extends
  98 strings or moves characters into a pre-allocated array, you should be
  99 very careful to keep track of the length of the text and make explicit
 100 checks for overflowing the array.  Many of the library functions
 101 @emph{do not} do this for you!  Remember also that you need to allocate
 102 an extra byte to hold the null character that marks the end of the
 103 string.
 104
 105 @node String/Array Conventions
 106 @section String and Array Conventions
 107
 108 This chapter describes both functions that work on arbitrary arrays or
 109 blocks of memory, and functions that are specific to null-terminated
 110 arrays of characters.
 111
 112 Functions that operate on arbitrary blocks of memory have names
 113 beginning with @samp{mem} (such as @code{memcpy}) and invariably take an
 114 argument which specifies the size (in bytes) of the block of memory to
 115 operate on.  The array arguments and return values for these functions
 116 have type @code{void *}, and as a matter of style, the elements of these
 117 arrays are referred to as ``bytes''.  You can pass any kind of pointer
 118 to these functions, and the @code{sizeof} operator is useful in
 119 computing the value for the size argument.
 120
 121 In contrast, functions that operate specifically on strings have names
 122 beginning with @samp{str} (such as @code{strcpy}) and look for a null
 123 character to terminate the string instead of requiring an explicit size
 124 argument to be passed.  (Some of these functions accept a specified
 125 maximum length, but they also check for premature termination with a
 126 null character.)  The array arguments and return values for these
 127 functions have type @code{char *}, and the array elements are referred
 128 to as ``characters''.
 129
 130 In many cases, there are both @samp{mem} and @samp{str} versions of a
 131 function.  The one that is more appropriate to use depends on the exact
 132 situation.  When your program is manipulating arbitrary arrays or blocks of
 133 storage, then you should always use the @samp{mem} functions.  On the
 134 other hand, when you are manipulating null-terminated strings it is
 135 usually more convenient to use the @samp{str} functions, unless you
 136 already know the length of the string in advance.
 137
 138 @node String Length
 139 @section String Length
 140
 141 You can get the length of a string using the @code{strlen} function.
 142 This function is declared in the header file @file{string.h}.
 143 @pindex string.h
 144
 145 @comment string.h
 146 @comment ISO
 147 @deftypefun size_t strlen (const char *@var{s})
 148 The @code{strlen} function returns the length of the null-terminated
 149 string @var{s}.  (In other words, it returns the offset of the terminating
 150 null character within the array.)
 151
 152 For example,
 153 @smallexample
 154 strlen ("hello, world")
 155     @result{} 12
 156 @end smallexample
 157
 158 When applied to a character array, the @code{strlen} function returns
 159 the length of the string stored there, not its allocated size.  You can
 160 get the allocated size of the character array that holds a string using
 161 the @code{sizeof} operator:
 162
 163 @smallexample
 164 char string[32] = "hello, world";
 165 sizeof (string)
 166     @result{} 32
 167 strlen (string)
 168     @result{} 12
 169 @end smallexample
 170
 171 But beware, this will not work unless @var{string} is the character
 172 array itself, not a pointer to it.  For example:
 173
 174 @smallexample
 175 char string[32] = "hello, world";
 176 char *ptr = string;
 177 sizeof (string)
 178     @result{} 32
 179 sizeof (ptr)
 180     @result{} 4  /* @r{(on a machine with 4 byte pointers)} */
 181 @end smallexample
 182
 183 This is an easy mistake to make when you are working with functions that
 184 take string arguments; those arguments are always pointers, not arrays.
 185
 186 @end deftypefun
 187
 188 @comment string.h
 189 @comment GNU
 190 @deftypefun size_t strnlen (const char *@var{s}, size_t @var{maxlen})
 191 The @code{strnlen} function returns the length of the null-terminated
 192 string @var{s} is this length is smaller than @var{maxlen}.  Otherwise
 193 it returns @var{maxlen}.  Therefore this function is equivalent to
 194 @code{(strlen (@var{s}) < n ? strlen (@var{s}) : @var{maxlen})} but it
 195 is more efficient.
 196
 197 @smallexample
 198 char string[32] = "hello, world";
 199 strnlen (string, 32)
 200     @result{} 12
 201 strnlen (string, 5)
 202     @result{} 5
 203 @end smallexample
 204
 205 This function is a GNU extension.
 206 @end deftypefun
 207
 208 @node Copying and Concatenation
 209 @section Copying and Concatenation
 210
 211 You can use the functions described in this section to copy the contents
 212 of strings and arrays, or to append the contents of one string to
 213 another.  These functions are declared in the header file
 214 @file{string.h}.
 215 @pindex string.h
 216 @cindex copying strings and arrays
 217 @cindex string copy functions
 218 @cindex array copy functions
 219 @cindex concatenating strings
 220 @cindex string concatenation functions
 221
 222 A helpful way to remember the ordering of the arguments to the functions
 223 in this section is that it corresponds to an assignment expression, with
 224 the destination array specified to the left of the source array.  All
 225 of these functions return the address of the destination array.
 226
 227 Most of these functions do not work properly if the source and
 228 destination arrays overlap.  For example, if the beginning of the
 229 destination array overlaps the end of the source array, the original
 230 contents of that part of the source array may get overwritten before it
 231 is copied.  Even worse, in the case of the string functions, the null
 232 character marking the end of the string may be lost, and the copy
 233 function might get stuck in a loop trashing all the memory allocated to
 234 your program.
 235
 236 All functions that have problems copying between overlapping arrays are
 237 explicitly identified in this manual.  In addition to functions in this
 238 section, there are a few others like @code{sprintf} (@pxref{Formatted
 239 Output Functions}) and @code{scanf} (@pxref{Formatted Input
 240 Functions}).
 241
 242 @comment string.h
 243 @comment ISO
 244 @deftypefun {void *} memcpy (void *@var{to}, const void *@var{from}, size_t @var{size})
 245 The @code{memcpy} function copies @var{size} bytes from the object
 246 beginning at @var{from} into the object beginning at @var{to}.  The
 247 behavior of this function is undefined if the two arrays @var{to} and
 248 @var{from} overlap; use @code{memmove} instead if overlapping is possible.
 249
 250 The value returned by @code{memcpy} is the value of @var{to}.
 251
 252 Here is an example of how you might use @code{memcpy} to copy the
 253 contents of an array:
 254
 255 @smallexample
 256 struct foo *oldarray, *newarray;
 257 int arraysize;
 258 @dots{}
 259 memcpy (new, old, arraysize * sizeof (struct foo));
 260 @end smallexample
 261 @end deftypefun
 262
 263 @comment string.h
 264 @comment GNU
 265 @deftypefun {void *} mempcpy (void *@var{to}, const void *@var{from}, size_t @var{size})
 266 The @code{mempcpy} function is nearly identical to the @code{memcpy}
 267 function.  It copies @var{size} bytes from the object beginning at
 268 @code{from} into the object pointed to by @var{to}.  But instead of
 269 returning the value of @code{to} it returns a pointer to the byte
 270 following the last written byte in the object beginning at @var{to}.
 271 I.e., the value is @code{((void *) ((char *) @var{to} + @var{size}))}.
 272
 273 This function is useful in situations where a number of objects shall be
 274 copied to consecutive memory positions.
 275
 276 @smallexample
 277 void *
 278 combine (void *o1, size_t s1, void *o2, size_t s2)
 279 @{
 280   void *result = malloc (s1 + s2);
 281   if (result != NULL)
 282     mempcpy (mempcpy (result, o1, s1), o2, s2);
 283   return result;
 284 @}
 285 @end smallexample
 286
 287 This function is a GNU extension.
 288 @end deftypefun
 289
 290 @comment string.h
 291 @comment ISO
 292 @deftypefun {void *} memmove (void *@var{to}, const void *@var{from}, size_t @var{size})
 293 @code{memmove} copies the @var{size} bytes at @var{from} into the
 294 @var{size} bytes at @var{to}, even if those two blocks of space
 295 overlap.  In the case of overlap, @code{memmove} is careful to copy the
 296 original values of the bytes in the block at @var{from}, including those
 297 bytes which also belong to the block at @var{to}.
 298 @end deftypefun
 299
 300 @comment string.h
 301 @comment SVID
 302 @deftypefun {void *} memccpy (void *@var{to}, const void *@var{from}, int @var{c}, size_t @var{size})
 303 This function copies no more than @var{size} bytes from @var{from} to
 304 @var{to}, stopping if a byte matching @var{c} is found.  The return
 305 value is a pointer into @var{to} one byte past where @var{c} was copied,
 306 or a null pointer if no byte matching @var{c} appeared in the first
 307 @var{size} bytes of @var{from}.
 308 @end deftypefun
 309
 310 @comment string.h
 311 @comment ISO
 312 @deftypefun {void *} memset (void *@var{block}, int @var{c}, size_t @var{size})
 313 This function copies the value of @var{c} (converted to an
 314 @code{unsigned char}) into each of the first @var{size} bytes of the
 315 object beginning at @var{block}.  It returns the value of @var{block}.
 316 @end deftypefun
 317
 318 @comment string.h
 319 @comment ISO
 320 @deftypefun {char *} strcpy (char *@var{to}, const char *@var{from})
 321 This copies characters from the string @var{from} (up to and including
 322 the terminating null character) into the string @var{to}.  Like
 323 @code{memcpy}, this function has undefined results if the strings
 324 overlap.  The return value is the value of @var{to}.
 325 @end deftypefun
 326
 327 @comment string.h
 328 @comment ISO
 329 @deftypefun {char *} strncpy (char *@var{to}, const char *@var{from}, size_t @var{size})
 330 This function is similar to @code{strcpy} but always copies exactly
 331 @var{size} characters into @var{to}.
 332
 333 If the length of @var{from} is more than @var{size}, then @code{strncpy}
 334 copies just the first @var{size} characters.  Note that in this case
 335 there is no null terminator written into @var{to}.
 336
 337 If the length of @var{from} is less than @var{size}, then @code{strncpy}
 338 copies all of @var{from}, followed by enough null characters to add up
 339 to @var{size} characters in all.  This behavior is rarely useful, but it
 340 is specified by the @w{ISO C} standard.
 341
 342 The behavior of @code{strncpy} is undefined if the strings overlap.
 343
 344 Using @code{strncpy} as opposed to @code{strcpy} is a way to avoid bugs
 345 relating to writing past the end of the allocated space for @var{to}.
 346 However, it can also make your program much slower in one common case:
 347 copying a string which is probably small into a potentially large buffer.
 348 In this case, @var{size} may be large, and when it is, @code{strncpy} will
 349 waste a considerable amount of time copying null characters.
 350 @end deftypefun
 351
 352 @comment string.h
 353 @comment SVID
 354 @deftypefun {char *} strdup (const char *@var{s})
 355 This function copies the null-terminated string @var{s} into a newly
 356 allocated string.  The string is allocated using @code{malloc}; see
 357 @ref{Unconstrained Allocation}.  If @code{malloc} cannot allocate space
 358 for the new string, @code{strdup} returns a null pointer.  Otherwise it
 359 returns a pointer to the new string.
 360 @end deftypefun
 361
 362 @comment string.h
 363 @comment GNU
 364 @deftypefun {char *} strndup (const char *@var{s}, size_t @var{size})
 365 This function is similar to @code{strdup} but always copies at most
 366 @var{size} characters into the newly allocated string.
 367
 368 If the length of @var{s} is more than @var{size}, then @code{strndup}
 369 copies just the first @var{size} characters and adds a closing null
 370 terminator.  Otherwise all characters are copied and the string is
 371 terminated.
 372
 373 This function is different to @code{strncpy} in that it always
 374 terminates the destination string.
 375 @end deftypefun
 376
 377 @comment string.h
 378 @comment Unknown origin
 379 @deftypefun {char *} stpcpy (char *@var{to}, const char *@var{from})
 380 This function is like @code{strcpy}, except that it returns a pointer to
 381 the end of the string @var{to} (that is, the address of the terminating
 382 null character) rather than the beginning.
 383
 384 For example, this program uses @code{stpcpy} to concatenate @samp{foo}
 385 and @samp{bar} to produce @samp{foobar}, which it then prints.
 386
 387 @smallexample
 388 @include stpcpy.c.texi
 389 @end smallexample
 390
 391 This function is not part of the ISO or POSIX standards, and is not
 392 customary on Unix systems, but we did not invent it either.  Perhaps it
 393 comes from MS-DOG.
 394
 395 Its behavior is undefined if the strings overlap.
 396 @end deftypefun
 397
 398 @comment string.h
 399 @comment GNU
 400 @deftypefun {char *} stpncpy (char *@var{to}, const char *@var{from}, size_t @var{size})
 401 This function is similar to @code{stpcpy} but copies always exactly
 402 @var{size} characters into @var{to}.
 403
 404 If the length of @var{from} is more then @var{size}, then @code{stpncpy}
 405 copies just the first @var{size} characters and returns a pointer to the
 406 character directly following the one which was copied last.  Note that in
 407 this case there is no null terminator written into @var{to}.
 408
 409 If the length of @var{from} is less than @var{size}, then @code{stpncpy}
 410 copies all of @var{from}, followed by enough null characters to add up
 411 to @var{size} characters in all.  This behaviour is rarely useful, but it
 412 is implemented to be useful in contexts where this behaviour of the
 413 @code{strncpy} is used.  @code{stpncpy} returns a pointer to the
 414 @emph{first} written null character.
 415
 416 This function is not part of ISO or POSIX but was found useful while
 417 developing the GNU C Library itself.
 418
 419 Its behaviour is undefined if the strings overlap.
 420 @end deftypefun
 421
 422 @comment string.h
 423 @comment GNU
 424 @deftypefn {Macro} {char *} strdupa (const char *@var{s})
 425 This function is similar to @code{strdup} but allocates the new string
 426 using @code{alloca} instead of @code{malloc} (@pxref{Variable Size
 427 Automatic}).  This means of course the returned string has the same
 428 limitations as any block of memory allocated using @code{alloca}.
 429
 430 For obvious reasons @code{strdupa} is implemented only as a macro;
 431 you cannot get the address of this function.  Despite this limitation
 432 it is a useful function.  The following code shows a situation where
 433 using @code{malloc} would be a lot more expensive.
 434
 435 @smallexample
 436 @include strdupa.c.texi
 437 @end smallexample
 438
 439 Please note that calling @code{strtok} using @var{path} directly is
 440 invalid.
 441
 442 This function is only available if GNU CC is used.
 443 @end deftypefn
 444
 445 @comment string.h
 446 @comment GNU
 447 @deftypefn {Macro} {char *} strndupa (const char *@var{s}, size_t @var{size})
 448 This function is similar to @code{strndup} but like @code{strdupa} it
 449 allocates the new string using @code{alloca}
 450 @pxref{Variable Size Automatic}.  The same advantages and limitations
 451 of @code{strdupa} are valid for @code{strndupa}, too.
 452
 453 This function is implemented only as a macro, just like @code{strdupa}.
 454
 455 @code{strndupa} is only available if GNU CC is used.
 456 @end deftypefn
 457
 458 @comment string.h
 459 @comment ISO
 460 @deftypefun {char *} strcat (char *@var{to}, const char *@var{from})
 461 The @code{strcat} function is similar to @code{strcpy}, except that the
 462 characters from @var{from} are concatenated or appended to the end of
 463 @var{to}, instead of overwriting it.  That is, the first character from
 464 @var{from} overwrites the null character marking the end of @var{to}.
 465
 466 An equivalent definition for @code{strcat} would be:
 467
 468 @smallexample
 469 char *
 470 strcat (char *to, const char *from)
 471 @{
 472   strcpy (to + strlen (to), from);
 473   return to;
 474 @}
 475 @end smallexample
 476
 477 This function has undefined results if the strings overlap.
 478 @end deftypefun
 479
 480 @comment string.h
 481 @comment ISO
 482 @deftypefun {char *} strncat (char *@var{to}, const char *@var{from}, size_t @var{size})
 483 This function is like @code{strcat} except that not more than @var{size}
 484 characters from @var{from} are appended to the end of @var{to}.  A
 485 single null character is also always appended to @var{to}, so the total
 486 allocated size of @var{to} must be at least @code{@var{size} + 1} bytes
 487 longer than its initial length.
 488
 489 The @code{strncat} function could be implemented like this:
 490
 491 @smallexample
 492 @group
 493 char *
 494 strncat (char *to, const char *from, size_t size)
 495 @{
 496   strncpy (to + strlen (to), from, size);
 497   return to;
 498 @}
 499 @end group
 500 @end smallexample
 501
 502 The behavior of @code{strncat} is undefined if the strings overlap.
 503 @end deftypefun
 504
 505 Here is an example showing the use of @code{strncpy} and @code{strncat}.
 506 Notice how, in the call to @code{strncat}, the @var{size} parameter
 507 is computed to avoid overflowing the character array @code{buffer}.
 508
 509 @smallexample
 510 @include strncat.c.texi
 511 @end smallexample
 512
 513 @noindent
 514 The output produced by this program looks like:
 515
 516 @smallexample
 517 hello
 518 hello, wo
 519 @end smallexample
 520
 521 @comment string.h
 522 @comment BSD
 523 @deftypefun void bcopy (const void *@var{from}, void *@var{to}, size_t @var{size})
 524 This is a partially obsolete alternative for @code{memmove}, derived from
 525 BSD.  Note that it is not quite equivalent to @code{memmove}, because the
 526 arguments are not in the same order and there is no return value.
 527 @end deftypefun
 528
 529 @comment string.h
 530 @comment BSD
 531 @deftypefun void bzero (void *@var{block}, size_t @var{size})
 532 This is a partially obsolete alternative for @code{memset}, derived from
 533 BSD.  Note that it is not as general as @code{memset}, because the only
 534 value it can store is zero.
 535 @end deftypefun
 536
 537 @node String/Array Comparison
 538 @section String/Array Comparison
 539 @cindex comparing strings and arrays
 540 @cindex string comparison functions
 541 @cindex array comparison functions
 542 @cindex predicates on strings
 543 @cindex predicates on arrays
 544
 545 You can use the functions in this section to perform comparisons on the
 546 contents of strings and arrays.  As well as checking for equality, these
 547 functions can also be used as the ordering functions for sorting
 548 operations.  @xref{Searching and Sorting}, for an example of this.
 549
 550 Unlike most comparison operations in C, the string comparison functions
 551 return a nonzero value if the strings are @emph{not} equivalent rather
 552 than if they are.  The sign of the value indicates the relative ordering
 553 of the first characters in the strings that are not equivalent:  a
 554 negative value indicates that the first string is ``less'' than the
 555 second, while a positive value indicates that the first string is
 556 ``greater''.
 557
 558 The most common use of these functions is to check only for equality.
 559 This is canonically done with an expression like @w{@samp{! strcmp (s1, s2)}}.
 560
 561 All of these functions are declared in the header file @file{string.h}.
 562 @pindex string.h
 563
 564 @comment string.h
 565 @comment ISO
 566 @deftypefun int memcmp (const void *@var{a1}, const void *@var{a2}, size_t @var{size})
 567 The function @code{memcmp} compares the @var{size} bytes of memory
 568 beginning at @var{a1} against the @var{size} bytes of memory beginning
 569 at @var{a2}.  The value returned has the same sign as the difference
 570 between the first differing pair of bytes (interpreted as @code{unsigned
 571 char} objects, then promoted to @code{int}).
 572
 573 If the contents of the two blocks are equal, @code{memcmp} returns
 574 @code{0}.
 575 @end deftypefun
 576
 577 On arbitrary arrays, the @code{memcmp} function is mostly useful for
 578 testing equality.  It usually isn't meaningful to do byte-wise ordering
 579 comparisons on arrays of things other than bytes.  For example, a
 580 byte-wise comparison on the bytes that make up floating-point numbers
 581 isn't likely to tell you anything about the relationship between the
 582 values of the floating-point numbers.
 583
 584 You should also be careful about using @code{memcmp} to compare objects
 585 that can contain ``holes'', such as the padding inserted into structure
 586 objects to enforce alignment requirements, extra space at the end of
 587 unions, and extra characters at the ends of strings whose length is less
 588 than their allocated size.  The contents of these ``holes'' are
 589 indeterminate and may cause strange behavior when performing byte-wise
 590 comparisons.  For more predictable results, perform an explicit
 591 component-wise comparison.
 592
 593 For example, given a structure type definition like:
 594
 595 @smallexample
 596 struct foo
 597   @{
 598     unsigned char tag;
 599     union
 600       @{
 601         double f;
 602         long i;
 603         char *p;
 604       @} value;
 605   @};
 606 @end smallexample
 607
 608 @noindent
 609 you are better off writing a specialized comparison function to compare
 610 @code{struct foo} objects instead of comparing them with @code{memcmp}.
 611
 612 @comment string.h
 613 @comment ISO
 614 @deftypefun int strcmp (const char *@var{s1}, const char *@var{s2})
 615 The @code{strcmp} function compares the string @var{s1} against
 616 @var{s2}, returning a value that has the same sign as the difference
 617 between the first differing pair of characters (interpreted as
 618 @code{unsigned char} objects, then promoted to @code{int}).
 619
 620 If the two strings are equal, @code{strcmp} returns @code{0}.
 621
 622 A consequence of the ordering used by @code{strcmp} is that if @var{s1}
 623 is an initial substring of @var{s2}, then @var{s1} is considered to be
 624 ``less than'' @var{s2}.
 625 @end deftypefun
 626
 627 @comment string.h
 628 @comment BSD
 629 @deftypefun int strcasecmp (const char *@var{s1}, const char *@var{s2})
 630 This function is like @code{strcmp}, except that differences in case are
 631 ignored.  How uppercase and lowercase characters are related is
 632 determined by the currently selected locale.  In the standard @code{"C"}
 633 locale the characters @"A and @"a do not match but in a locale which
 634 regards these characters as parts of the alphabet they do match.
 635
 636 @noindent
 637 @code{strcasecmp} is derived from BSD.
 638 @end deftypefun
 639
 640 @comment string.h
 641 @comment BSD
 642 @deftypefun int strncasecmp (const char *@var{s1}, const char *@var{s2}, size_t @var{n})
 643 This function is like @code{strncmp}, except that differences in case
 644 are ignored.  Like @code{strcasecmp}, it is locale dependent how
 645 uppercase and lowercase characters are related.
 646
 647 @noindent
 648 @code{strncasecmp} is a GNU extension.
 649 @end deftypefun
 650
 651 @comment string.h
 652 @comment ISO
 653 @deftypefun int strncmp (const char *@var{s1}, const char *@var{s2}, size_t @var{size})
 654 This function is the similar to @code{strcmp}, except that no more than
 655 @var{size} characters are compared.  In other words, if the two strings are
 656 the same in their first @var{size} characters, the return value is zero.
 657 @end deftypefun
 658
 659 Here are some examples showing the use of @code{strcmp} and @code{strncmp}.
 660 These examples assume the use of the ASCII character set.  (If some
 661 other character set---say, EBCDIC---is used instead, then the glyphs
 662 are associated with different numeric codes, and the return values
 663 and ordering may differ.)
 664
 665 @smallexample
 666 strcmp ("hello", "hello")
 667     @result{} 0    /* @r{These two strings are the same.} */
 668 strcmp ("hello", "Hello")
 669     @result{} 32   /* @r{Comparisons are case-sensitive.} */
 670 strcmp ("hello", "world")
 671     @result{} -15  /* @r{The character @code{'h'} comes before @code{'w'}.} */
 672 strcmp ("hello", "hello, world")
 673     @result{} -44  /* @r{Comparing a null character against a comma.} */
 674 strncmp ("hello", "hello, world", 5)
 675     @result{} 0    /* @r{The initial 5 characters are the same.} */
 676 strncmp ("hello, world", "hello, stupid world!!!", 5)
 677     @result{} 0    /* @r{The initial 5 characters are the same.} */
 678 @end smallexample
 679
 680 @comment string.h
 681 @comment GNU
 682 @deftypefun int strverscmp (const char *@var{s1}, const char *@var{s2})
 683 The @code{strverscmp} function compares the string @var{s1} against
 684 @var{s2}, considering them as holding indices/version numbers.  Return
 685 value follows the same conventions as found in the @code{strverscmp}
 686 function.  In fact, if @var{s1} and @var{s2} contain no digits,
 687 @code{strverscmp} behaves like @code{strcmp}.
 688
 689 Basically, we compare strings normally (character by character), until
 690 we find a digit in each string - then we enter a special comparison
 691 mode, where each sequence of digits is taken as a whole.  If we reach the
 692 end of these two parts without noticing a difference, we return to the
 693 standard comparison mode.  There are two types of numeric parts:
 694 "integral" and "fractional" (those  begin with a '0'). The types
 695 of the numeric parts affect the way we sort them:
 696
 697 @itemize @bullet
 698 @item
 699 integral/integral: we compare values as you would expect.
 700
 701 @item
 702 fractional/integral: the fractional part is less than the integral one.
 703 Again, no surprise.
 704
 705 @item
 706 fractional/fractional: the things become a bit more complex.
 707 If the common prefix contains only leading zeroes, the longest part is less
 708 than the other one; else the comparison behaves normally.
 709 @end itemize
 710
 711 @smallexample
 712 strverscmp ("no digit", "no digit")
 713     @result{} 0    /* @r{same behaviour as strcmp.} */
 714 strverscmp ("item#99", "item#100")
 715     @result{} <0   /* @r{same prefix, but 99 < 100.} */
 716 strverscmp ("alpha1", "alpha001")
 717     @result{} >0   /* @r{fractional part inferior to integral one.} */
 718 strverscmp ("part1_f012", "part1_f01")
 719     @result{} >0   /* @r{two fractional parts.} */
 720 strverscmp ("foo.009", "foo.0")
 721     @result{} <0   /* @r{idem, but with leading zeroes only.} */
 722 @end smallexample
 723
 724 This function is especially useful when dealing with filename sorting,
 725 because filenames frequently hold indices/version numbers.
 726
 727 @code{strverscmp} is a GNU extension.
 728 @end deftypefun
 729
 730 @comment string.h
 731 @comment BSD
 732 @deftypefun int bcmp (const void *@var{a1}, const void *@var{a2}, size_t @var{size})
 733 This is an obsolete alias for @code{memcmp}, derived from BSD.
 734 @end deftypefun
 735
 736 @node Collation Functions
 737 @section Collation Functions
 738
 739 @cindex collating strings
 740 @cindex string collation functions
 741
 742 In some locales, the conventions for lexicographic ordering differ from
 743 the strict numeric ordering of character codes.  For example, in Spanish
 744 most glyphs with diacritical marks such as accents are not considered
 745 distinct letters for the purposes of collation.  On the other hand, the
 746 two-character sequence @samp{ll} is treated as a single letter that is
 747 collated immediately after @samp{l}.
 748
 749 You can use the functions @code{strcoll} and @code{strxfrm} (declared in
 750 the header file @file{string.h}) to compare strings using a collation
 751 ordering appropriate for the current locale.  The locale used by these
 752 functions in particular can be specified by setting the locale for the
 753 @code{LC_COLLATE} category; see @ref{Locales}.
 754 @pindex string.h
 755
 756 In the standard C locale, the collation sequence for @code{strcoll} is
 757 the same as that for @code{strcmp}.
 758
 759 Effectively, the way these functions work is by applying a mapping to
 760 transform the characters in a string to a byte sequence that represents
 761 the string's position in the collating sequence of the current locale.
 762 Comparing two such byte sequences in a simple fashion is equivalent to
 763 comparing the strings with the locale's collating sequence.
 764
 765 The function @code{strcoll} performs this translation implicitly, in
 766 order to do one comparison.  By contrast, @code{strxfrm} performs the
 767 mapping explicitly.  If you are making multiple comparisons using the
 768 same string or set of strings, it is likely to be more efficient to use
 769 @code{strxfrm} to transform all the strings just once, and subsequently
 770 compare the transformed strings with @code{strcmp}.
 771
 772 @comment string.h
 773 @comment ISO
 774 @deftypefun int strcoll (const char *@var{s1}, const char *@var{s2})
 775 The @code{strcoll} function is similar to @code{strcmp} but uses the
 776 collating sequence of the current locale for collation (the
 777 @code{LC_COLLATE} locale).
 778 @end deftypefun
 779
 780 Here is an example of sorting an array of strings, using @code{strcoll}
 781 to compare them.  The actual sort algorithm is not written here; it
 782 comes from @code{qsort} (@pxref{Array Sort Function}).  The job of the
 783 code shown here is to say how to compare the strings while sorting them.
 784 (Later on in this section, we will show a way to do this more
 785 efficiently using @code{strxfrm}.)
 786
 787 @smallexample
 788 /* @r{This is the comparison function used with @code{qsort}.} */
 789
 790 int
 791 compare_elements (char **p1, char **p2)
 792 @{
 793   return strcoll (*p1, *p2);
 794 @}
 795
 796 /* @r{This is the entry point---the function to sort}
 797    @r{strings using the locale's collating sequence.} */
 798
 799 void
 800 sort_strings (char **array, int nstrings)
 801 @{
 802   /* @r{Sort @code{temp_array} by comparing the strings.} */
 803   qsort (array, sizeof (char *),
 804          nstrings, compare_elements);
 805 @}
 806 @end smallexample
 807
 808 @cindex converting string to collation order
 809 @comment string.h
 810 @comment ISO
 811 @deftypefun size_t strxfrm (char *@var{to}, const char *@var{from}, size_t @var{size})
 812 The function @code{strxfrm} transforms @var{string} using the collation
 813 transformation determined by the locale currently selected for
 814 collation, and stores the transformed string in the array @var{to}.  Up
 815 to @var{size} characters (including a terminating null character) are
 816 stored.
 817
 818 The behavior is undefined if the strings @var{to} and @var{from}
 819 overlap; see @ref{Copying and Concatenation}.
 820
 821 The return value is the length of the entire transformed string.  This
 822 value is not affected by the value of @var{size}, but if it is greater
 823 or equal than @var{size}, it means that the transformed string did not
 824 entirely fit in the array @var{to}.  In this case, only as much of the
 825 string as actually fits was stored.  To get the whole transformed
 826 string, call @code{strxfrm} again with a bigger output array.
 827
 828 The transformed string may be longer than the original string, and it
 829 may also be shorter.
 830
 831 If @var{size} is zero, no characters are stored in @var{to}.  In this
 832 case, @code{strxfrm} simply returns the number of characters that would
 833 be the length of the transformed string.  This is useful for determining
 834 what size string to allocate.  It does not matter what @var{to} is if
 835 @var{size} is zero; @var{to} may even be a null pointer.
 836 @end deftypefun
 837
 838 Here is an example of how you can use @code{strxfrm} when
 839 you plan to do many comparisons.  It does the same thing as the previous
 840 example, but much faster, because it has to transform each string only
 841 once, no matter how many times it is compared with other strings.  Even
 842 the time needed to allocate and free storage is much less than the time
 843 we save, when there are many strings.
 844
 845 @smallexample
 846 struct sorter @{ char *input; char *transformed; @};
 847
 848 /* @r{This is the comparison function used with @code{qsort}}
 849    @r{to sort an array of @code{struct sorter}.} */
 850
 851 int
 852 compare_elements (struct sorter *p1, struct sorter *p2)
 853 @{
 854   return strcmp (p1->transformed, p2->transformed);
 855 @}
 856
 857 /* @r{This is the entry point---the function to sort}
 858    @r{strings using the locale's collating sequence.} */
 859
 860 void
 861 sort_strings_fast (char **array, int nstrings)
 862 @{
 863   struct sorter temp_array[nstrings];
 864   int i;
 865
 866   /* @r{Set up @code{temp_array}.  Each element contains}
 867      @r{one input string and its transformed string.} */
 868   for (i = 0; i < nstrings; i++)
 869     @{
 870       size_t length = strlen (array[i]) * 2;
 871       char *transformed;
 872       size_t transformed_length;
 873
 874       temp_array[i].input = array[i];
 875
 876       /* @r{First try a buffer perhaps big enough.}  */
 877       transformed = (char *) xmalloc (length);
 878
 879       /* @r{Transform @code{array[i]}.}  */
 880       transformed_length = strxfrm (transformed, array[i], length);
 881
 882       /* @r{If the buffer was not large enough, resize it}
 883          @r{and try again.}  */
 884       if (transformed_length >= length)
 885         @{
 886           /* @r{Allocate the needed space. +1 for terminating}
 887              @r{@code{NUL} character.}  */
 888           transformed = (char *) xrealloc (transformed,
 889                                            transformed_length + 1);
 890
 891           /* @r{The return value is not interesting because we know}
 892              @r{how long the transformed string is.}  */
 893           (void) strxfrm (transformed, array[i],
 894                           transformed_length + 1);
 895         @}
 896
 897       temp_array[i].transformed = transformed;
 898     @}
 899
 900   /* @r{Sort @code{temp_array} by comparing transformed strings.} */
 901   qsort (temp_array, sizeof (struct sorter),
 902          nstrings, compare_elements);
 903
 904   /* @r{Put the elements back in the permanent array}
 905      @r{in their sorted order.} */
 906   for (i = 0; i < nstrings; i++)
 907     array[i] = temp_array[i].input;
 908
 909   /* @r{Free the strings we allocated.} */
 910   for (i = 0; i < nstrings; i++)
 911     free (temp_array[i].transformed);
 912 @}
 913 @end smallexample
 914
 915 @strong{Compatibility Note:}  The string collation functions are a new
 916 feature of @w{ISO C 89}.  Older C dialects have no equivalent feature.
 917
 918 @node Search Functions
 919 @section Search Functions
 920
 921 This section describes library functions which perform various kinds
 922 of searching operations on strings and arrays.  These functions are
 923 declared in the header file @file{string.h}.
 924 @pindex string.h
 925 @cindex search functions (for strings)
 926 @cindex string search functions
 927
 928 @comment string.h
 929 @comment ISO
 930 @deftypefun {void *} memchr (const void *@var{block}, int @var{c}, size_t @var{size})
 931 This function finds the first occurrence of the byte @var{c} (converted
 932 to an @code{unsigned char}) in the initial @var{size} bytes of the
 933 object beginning at @var{block}.  The return value is a pointer to the
 934 located byte, or a null pointer if no match was found.
 935 @end deftypefun
 936
 937 @comment string.h
 938 @comment ISO
 939 @deftypefun {char *} strchr (const char *@var{string}, int @var{c})
 940 The @code{strchr} function finds the first occurrence of the character
 941 @var{c} (converted to a @code{char}) in the null-terminated string
 942 beginning at @var{string}.  The return value is a pointer to the located
 943 character, or a null pointer if no match was found.
 944
 945 For example,
 946 @smallexample
 947 strchr ("hello, world", 'l')
 948     @result{} "llo, world"
 949 strchr ("hello, world", '?')
 950     @result{} NULL
 951 @end smallexample
 952
 953 The terminating null character is considered to be part of the string,
 954 so you can use this function get a pointer to the end of a string by
 955 specifying a null character as the value of the @var{c} argument.
 956 @end deftypefun
 957
 958 @comment string.h
 959 @comment BSD
 960 @deftypefun {char *} index (const char *@var{string}, int @var{c})
 961 @code{index} is another name for @code{strchr}; they are exactly the same.
 962 New code should always use @code{strchr} since this name is defined in
 963 @w{ISO C} while @code{index} is a BSD invention which never was available
 964 on @w{System V} derived systems.
 965 @end deftypefun
 966
 967 @comment string.h
 968 @comment ISO
 969 @deftypefun {char *} strrchr (const char *@var{string}, int @var{c})
 970 The function @code{strrchr} is like @code{strchr}, except that it searches
 971 backwards from the end of the string @var{string} (instead of forwards
 972 from the front).
 973
 974 For example,
 975 @smallexample
 976 strrchr ("hello, world", 'l')
 977     @result{} "ld"
 978 @end smallexample
 979 @end deftypefun
 980
 981 @comment string.h
 982 @comment BSD
 983 @deftypefun {char *} rindex (const char *@var{string}, int @var{c})
 984 @code{rindex} is another name for @code{strrchr}; they are exactly the same.
 985 New code should always use @code{strrchr} since this name is defined in
 986 @w{ISO C} while @code{rindex} is a BSD invention which never was available
 987 on @w{System V} derived systems.
 988 @end deftypefun
 989
 990 @comment string.h
 991 @comment ISO
 992 @deftypefun {char *} strstr (const char *@var{haystack}, const char *@var{needle})
 993 This is like @code{strchr}, except that it searches @var{haystack} for a
 994 substring @var{needle} rather than just a single character.  It
 995 returns a pointer into the string @var{haystack} that is the first
 996 character of the substring, or a null pointer if no match was found.  If
 997 @var{needle} is an empty string, the function returns @var{haystack}.
 998
 999 For example,
1000 @smallexample
1001 strstr ("hello, world", "l")
1002     @result{} "llo, world"
1003 strstr ("hello, world", "wo")
1004     @result{} "world"
1005 @end smallexample
1006 @end deftypefun
1007
1008
1009 @comment string.h
1010 @comment GNU
1011 @deftypefun {void *} memmem (const void *@var{haystack}, size_t @var{haystack-len},@*const void *@var{needle}, size_t @var{needle-len})
1012 This is like @code{strstr}, but @var{needle} and @var{haystack} are byte
1013 arrays rather than null-terminated strings.  @var{needle-len} is the
1014 length of @var{needle} and @var{haystack-len} is the length of
1015 @var{haystack}.@refill
1016
1017 This function is a GNU extension.
1018 @end deftypefun
1019
1020 @comment string.h
1021 @comment ISO
1022 @deftypefun size_t strspn (const char *@var{string}, const char *@var{skipset})
1023 The @code{strspn} (``string span'') function returns the length of the
1024 initial substring of @var{string} that consists entirely of characters that
1025 are members of the set specified by the string @var{skipset}.  The order
1026 of the characters in @var{skipset} is not important.
1027
1028 For example,
1029 @smallexample
1030 strspn ("hello, world", "abcdefghijklmnopqrstuvwxyz")
1031     @result{} 5
1032 @end smallexample
1033 @end deftypefun
1034
1035 @comment string.h
1036 @comment ISO
1037 @deftypefun size_t strcspn (const char *@var{string}, const char *@var{stopset})
1038 The @code{strcspn} (``string complement span'') function returns the length
1039 of the initial substring of @var{string} that consists entirely of characters
1040 that are @emph{not} members of the set specified by the string @var{stopset}.
1041 (In other words, it returns the offset of the first character in @var{string}
1042 that is a member of the set @var{stopset}.)
1043
1044 For example,
1045 @smallexample
1046 strcspn ("hello, world", " \t\n,.;!?")
1047     @result{} 5
1048 @end smallexample
1049 @end deftypefun
1050
1051 @comment string.h
1052 @comment ISO
1053 @deftypefun {char *} strpbrk (const char *@var{string}, const char *@var{stopset})
1054 The @code{strpbrk} (``string pointer break'') function is related to
1055 @code{strcspn}, except that it returns a pointer to the first character
1056 in @var{string} that is a member of the set @var{stopset} instead of the
1057 length of the initial substring.  It returns a null pointer if no such
1058 character from @var{stopset} is found.
1059
1060 @c @group  Invalid outside the example.
1061 For example,
1062
1063 @smallexample
1064 strpbrk ("hello, world", " \t\n,.;!?")
1065     @result{} ", world"
1066 @end smallexample
1067 @c @end group
1068 @end deftypefun
1069
1070 @node Finding Tokens in a String
1071 @section Finding Tokens in a String
1072
1073 @cindex tokenizing strings
1074 @cindex breaking a string into tokens
1075 @cindex parsing tokens from a string
1076 It's fairly common for programs to have a need to do some simple kinds
1077 of lexical analysis and parsing, such as splitting a command string up
1078 into tokens.  You can do this with the @code{strtok} function, declared
1079 in the header file @file{string.h}.
1080 @pindex string.h
1081
1082 @comment string.h
1083 @comment ISO
1084 @deftypefun {char *} strtok (char *@var{newstring}, const char *@var{delimiters})
1085 A string can be split into tokens by making a series of calls to the
1086 function @code{strtok}.
1087
1088 The string to be split up is passed as the @var{newstring} argument on
1089 the first call only.  The @code{strtok} function uses this to set up
1090 some internal state information.  Subsequent calls to get additional
1091 tokens from the same string are indicated by passing a null pointer as
1092 the @var{newstring} argument.  Calling @code{strtok} with another
1093 non-null @var{newstring} argument reinitializes the state information.
1094 It is guaranteed that no other library function ever calls @code{strtok}
1095 behind your back (which would mess up this internal state information).
1096
1097 The @var{delimiters} argument is a string that specifies a set of delimiters
1098 that may surround the token being extracted.  All the initial characters
1099 that are members of this set are discarded.  The first character that is
1100 @emph{not} a member of this set of delimiters marks the beginning of the
1101 next token.  The end of the token is found by looking for the next
1102 character that is a member of the delimiter set.  This character in the
1103 original string @var{newstring} is overwritten by a null character, and the
1104 pointer to the beginning of the token in @var{newstring} is returned.
1105
1106 On the next call to @code{strtok}, the searching begins at the next
1107 character beyond the one that marked the end of the previous token.
1108 Note that the set of delimiters @var{delimiters} do not have to be the
1109 same on every call in a series of calls to @code{strtok}.
1110
1111 If the end of the string @var{newstring} is reached, or if the remainder of
1112 string consists only of delimiter characters, @code{strtok} returns
1113 a null pointer.
1114 @end deftypefun
1115
1116 @strong{Warning:} Since @code{strtok} alters the string it is parsing,
1117 you should always copy the string to a temporary buffer before parsing
1118 it with @code{strtok}.  If you allow @code{strtok} to modify a string
1119 that came from another part of your program, you are asking for trouble;
1120 that string might be used for other purposes after @code{strtok} has
1121 modified it, and it would not have the expected value.
1122
1123 The string that you are operating on might even be a constant.  Then
1124 when @code{strtok} tries to modify it, your program will get a fatal
1125 signal for writing in read-only memory.  @xref{Program Error Signals}.
1126
1127 This is a special case of a general principle: if a part of a program
1128 does not have as its purpose the modification of a certain data
1129 structure, then it is error-prone to modify the data structure
1130 temporarily.
1131
1132 The function @code{strtok} is not reentrant.  @xref{Nonreentrancy}, for
1133 a discussion of where and why reentrancy is important.
1134
1135 Here is a simple example showing the use of @code{strtok}.
1136
1137 @comment Yes, this example has been tested.
1138 @smallexample
1139 #include <string.h>
1140 #include <stddef.h>
1141
1142 @dots{}
1143
1144 const char string[] = "words separated by spaces -- and, punctuation!";
1145 const char delimiters[] = " .,;:!-";
1146 char *token, *cp;
1147
1148 @dots{}
1149
1150 cp = strdupa (string);                /* Make writable copy.  */
1151 token = strtok (cp, delimiters);      /* token => "words" */
1152 token = strtok (NULL, delimiters);    /* token => "separated" */
1153 token = strtok (NULL, delimiters);    /* token => "by" */
1154 token = strtok (NULL, delimiters);    /* token => "spaces" */
1155 token = strtok (NULL, delimiters);    /* token => "and" */
1156 token = strtok (NULL, delimiters);    /* token => "punctuation" */
1157 token = strtok (NULL, delimiters);    /* token => NULL */
1158 @end smallexample
1159
1160 The GNU C library contains two more functions for tokenizing a string
1161 which overcome the limitation of non-reentrancy.
1162
1163 @comment string.h
1164 @comment POSIX
1165 @deftypefun {char *} strtok_r (char *@var{newstring}, const char *@var{delimiters}, char **@var{save_ptr})
1166 Just like @code{strtok}, this function splits the string into several
1167 tokens which can be accessed by successive calls to @code{strtok_r}.
1168 The difference is that the information about the next token is stored in
1169 the space pointed to by the third argument, @var{save_ptr}, which is a
1170 pointer to a string pointer.  Calling @code{strtok_r} with a null
1171 pointer for @var{newstring} and leaving @var{save_ptr} between the calls
1172 unchanged does the job without hindering reentrancy.
1173
1174 This function is defined in POSIX-1 and can be found on many systems
1175 which support multi-threading.
1176 @end deftypefun
1177
1178 @comment string.h
1179 @comment BSD
1180 @deftypefun {char *} strsep (char **@var{string_ptr}, const char *@var{delimiter})
1181 This function is just @code{strtok_r} with the @var{newstring} argument
1182 replaced by the @var{save_ptr} argument.  The initialization of the
1183 moving pointer has to be done by the user.  Successive calls to
1184 @code{strsep} move the pointer along the tokens separated by
1185 @var{delimiter}, returning the address of the next token and updating
1186 @var{string_ptr} to point to the beginning of the next token.
1187
1188 If the input string contains more than one character from
1189 @var{delimiter} in a row @code{strsep} returns an empty string for each
1190 pair of characters from @var{delimiter}.  This means that a program
1191 normally should test for @code{strsep} returning an empty string before
1192 processing it.
1193
1194 This function was introduced in 4.3BSD and therefore is widely available.
1195 @end deftypefun
1196
1197 Here is how the above example looks like when @code{strsep} is used.
1198
1199 @comment Yes, this example has been tested.
1200 @smallexample
1201 #include <string.h>
1202 #include <stddef.h>
1203
1204 @dots{}
1205
1206 const char string[] = "words separated by spaces -- and, punctuation!";
1207 const char delimiters[] = " .,;:!-";
1208 char *running;
1209 char *token;
1210
1211 @dots{}
1212
1213 running = strdupa (string);
1214 token = strsep (&running, delimiters);    /* token => "words" */
1215 token = strsep (&running, delimiters);    /* token => "separated" */
1216 token = strsep (&running, delimiters);    /* token => "by" */
1217 token = strsep (&running, delimiters);    /* token => "spaces" */
1218 token = strsep (&running, delimiters);    /* token => "" */
1219 token = strsep (&running, delimiters);    /* token => "" */
1220 token = strsep (&running, delimiters);    /* token => "" */
1221 token = strsep (&running, delimiters);    /* token => "and" */
1222 token = strsep (&running, delimiters);    /* token => "" */
1223 token = strsep (&running, delimiters);    /* token => "punctuation" */
1224 token = strsep (&running, delimiters);    /* token => "" */
1225 token = strsep (&running, delimiters);    /* token => NULL */
1226 @end smallexample
1227
1228 @node Encode Binary Data
1229 @section Encode Binary Data
1230
1231 To store or transfer binary data in environments which only support text
1232 one has to encode the binary data by mapping the input bytes to
1233 characters in the range allowed for storing or transfering.  SVID
1234 systems (and nowadays XPG compliant systems) provide minimal support for
1235 this task.
1236
1237 @comment stdlib.h
1238 @comment XPG
1239 @deftypefun {char *} l64a (long int @var{n})
1240 This function encodes a 32-bit input value using characters from the
1241 basic character set.  It returns a pointer to a 6 character buffer which
1242 contains an encoded version of @var{n}.  To encode a series of bytes the
1243 user must copy the returned string to a destination buffer.  It returns
1244 the empty string if @var{n} is zero, which is somewhat bizarre but
1245 mandated by the standard.@*
1246 @strong{Warning:} Since a static buffer is used this function should not
1247 be used in multi-threaded programs.  There is no thread-safe alternative
1248 to this function in the C library.@*
1249 @strong{Compatibility Note:} The XPG standard states that the return
1250 value of @code{l64a} is undefined if @var{n} is negative.  In the GNU
1251 implementation, @code{l64a} treats its argument as unsigned, so it will
1252 return a sensible encoding for any nonzero @var{n}; however, portable
1253 programs should not rely on this.
1254
1255 To encode a large buffer @code{l64a} must be called in a loop, once for
1256 each 32-bit word of the buffer.  For example, one could do something
1257 like this:
1258
1259 @smallexample
1260 char *
1261 encode (const void *buf, size_t len)
1262 @{
1263   /* @r{We know in advance how long the buffer has to be.} */
1264   unsigned char *in = (unsigned char *) buf;
1265   char *out = malloc (6 + ((len + 3) / 4) * 6 + 1);
1266   char *cp = out;
1267
1268   /* @r{Encode the length.} */
1269   /* @r{Using `htonl' is necessary so that the data can be}
1270      @r{decoded even on machines with different byte order.} */
1271
1272   cp = mempcpy (cp, l64a (htonl (len)), 6);
1273
1274   while (len > 3)
1275     @{
1276       unsigned long int n = *in++;
1277       n = (n << 8) | *in++;
1278       n = (n << 8) | *in++;
1279       n = (n << 8) | *in++;
1280       len -= 4;
1281       if (n)
1282         cp = mempcpy (cp, l64a (htonl (n)), 6);
1283       else
1284             /* @r{`l64a' returns the empty string for n==0, so we }
1285                @r{must generate its encoding (}"......"@r{) by hand.} */
1286         cp = stpcpy (cp, "......");
1287     @}
1288   if (len > 0)
1289     @{
1290       unsigned long int n = *in++;
1291       if (--len > 0)
1292         @{
1293           n = (n << 8) | *in++;
1294           if (--len > 0)
1295             n = (n << 8) | *in;
1296         @}
1297       memcpy (cp, l64a (htonl (n)), 6);
1298       cp += 6;
1299     @}
1300   *cp = '\0';
1301   return out;
1302 @}
1303 @end smallexample
1304
1305 It is strange that the library does not provide the complete
1306 functionality needed but so be it.
1307
1308 @end deftypefun
1309
1310 To decode data produced with @code{l64a} the following function should be
1311 used.
1312
1313 @comment stdlib.h
1314 @comment XPG
1315 @deftypefun {long int} a64l (const char *@var{string})
1316 The parameter @var{string} should contain a string which was produced by
1317 a call to @code{l64a}.  The function processes at least 6 characters of
1318 this string, and decodes the characters it finds according to the table
1319 below.  It stops decoding when it finds a character not in the table,
1320 rather like @code{atoi}; if you have a buffer which has been broken into
1321 lines, you must be careful to skip over the end-of-line characters.
1322
1323 The decoded number is returned as a @code{long int} value.
1324 @end deftypefun
1325
1326 The @code{l64a} and @code{a64l} functions use a base 64 encoding, in
1327 which each character of an encoded string represents six bits of an
1328 input word.  These symbols are used for the base 64 digits:
1329
1330 @multitable {xxxxx} {xxx} {xxx} {xxx} {xxx} {xxx} {xxx} {xxx} {xxx}
1331 @item              @tab 0 @tab 1 @tab 2 @tab 3 @tab 4 @tab 5 @tab 6 @tab 7
1332 @item       0      @tab @code{.} @tab @code{/} @tab @code{0} @tab @code{1}
1333                    @tab @code{2} @tab @code{3} @tab @code{4} @tab @code{5}
1334 @item       8      @tab @code{6} @tab @code{7} @tab @code{8} @tab @code{9}
1335                    @tab @code{A} @tab @code{B} @tab @code{C} @tab @code{D}
1336 @item       16     @tab @code{E} @tab @code{F} @tab @code{G} @tab @code{H}
1337                    @tab @code{I} @tab @code{J} @tab @code{K} @tab @code{L}
1338 @item       24     @tab @code{M} @tab @code{N} @tab @code{O} @tab @code{P}
1339                    @tab @code{Q} @tab @code{R} @tab @code{S} @tab @code{T}
1340 @item       32     @tab @code{U} @tab @code{V} @tab @code{W} @tab @code{X}
1341                    @tab @code{Y} @tab @code{Z} @tab @code{a} @tab @code{b}
1342 @item       40     @tab @code{c} @tab @code{d} @tab @code{e} @tab @code{f}
1343                    @tab @code{g} @tab @code{h} @tab @code{i} @tab @code{j}
1344 @item       48     @tab @code{k} @tab @code{l} @tab @code{m} @tab @code{n}
1345                    @tab @code{o} @tab @code{p} @tab @code{q} @tab @code{r}
1346 @item       56     @tab @code{s} @tab @code{t} @tab @code{u} @tab @code{v}
1347                    @tab @code{w} @tab @code{x} @tab @code{y} @tab @code{z}
1348 @end multitable
1349
1350 This encoding scheme is not standard.  There are some other encoding
1351 methods which are much more widely used (UU encoding, MIME encoding).
1352 Generally, it is better to use one of these encodings.
1353
1354 @node Argz and Envz Vectors
1355 @section Argz and Envz Vectors
1356
1357 @cindex argz vectors (string vectors)
1358 @cindex string vectors, null-character separated
1359 @cindex argument vectors, null-character separated
1360 @dfn{argz vectors} are vectors of strings in a contiguous block of
1361 memory, each element separated from its neighbors by null-characters
1362 (@code{'\0'}).
1363
1364 @cindex envz vectors (environment vectors)
1365 @cindex environment vectors, null-character separated
1366 @dfn{Envz vectors} are an extension of argz vectors where each element is a
1367 name-value pair, separated by a @code{'='} character (as in a Unix
1368 environment).
1369
1370 @menu
1371 * Argz Functions::              Operations on argz vectors.
1372 * Envz Functions::              Additional operations on environment vectors.
1373 @end menu
1374
1375 @node Argz Functions, Envz Functions, , Argz and Envz Vectors
1376 @subsection Argz Functions
1377
1378 Each argz vector is represented by a pointer to the first element, of
1379 type @code{char *}, and a size, of type @code{size_t}, both of which can
1380 be initialized to @code{0} to represent an empty argz vector.  All argz
1381 functions accept either a pointer and a size argument, or pointers to
1382 them, if they will be modified.
1383
1384 The argz functions use @code{malloc}/@code{realloc} to allocate/grow
1385 argz vectors, and so any argz vector creating using these functions may
1386 be freed by using @code{free}; conversely, any argz function that may
1387 grow a string expects that string to have been allocated using
1388 @code{malloc} (those argz functions that only examine their arguments or
1389 modify them in place will work on any sort of memory).
1390 @xref{Unconstrained Allocation}.
1391
1392 All argz functions that do memory allocation have a return type of
1393 @code{error_t}, and return @code{0} for success, and @code{ENOMEM} if an
1394 allocation error occurs.
1395
1396 @pindex argz.h
1397 These functions are declared in the standard include file @file{argz.h}.
1398
1399 @comment argz.h
1400 @comment GNU
1401 @deftypefun {error_t} argz_create (char *const @var{argv}[], char **@var{argz}, size_t *@var{argz_len})
1402 The @code{argz_create} function converts the Unix-style argument vector
1403 @var{argv} (a vector of pointers to normal C strings, terminated by
1404 @code{(char *)0}; @pxref{Program Arguments}) into an argz vector with
1405 the same elements, which is returned in @var{argz} and @var{argz_len}.
1406 @end deftypefun
1407
1408 @comment argz.h
1409 @comment GNU
1410 @deftypefun {error_t} argz_create_sep (const char *@var{string}, int @var{sep}, char **@var{argz}, size_t *@var{argz_len})
1411 The @code{argz_create_sep} function converts the null-terminated string
1412 @var{string} into an argz vector (returned in @var{argz} and
1413 @var{argz_len}) by splitting it into elements at every occurance of the
1414 character @var{sep}.
1415 @end deftypefun
1416
1417 @comment argz.h
1418 @comment GNU
1419 @deftypefun {size_t} argz_count (const char *@var{argz}, size_t @var{arg_len})
1420 Returns the number of elements in the argz vector @var{argz} and
1421 @var{argz_len}.
1422 @end deftypefun
1423
1424 @comment argz.h
1425 @comment GNU
1426 @deftypefun {void} argz_extract (char *@var{argz}, size_t @var{argz_len}, char **@var{argv})
1427 The @code{argz_extract} function converts the argz vector @var{argz} and
1428 @var{argz_len} into a Unix-style argument vector stored in @var{argv},
1429 by putting pointers to every element in @var{argz} into successive
1430 positions in @var{argv}, followed by a terminator of @code{0}.
1431 @var{Argv} must be pre-allocated with enough space to hold all the
1432 elements in @var{argz} plus the terminating @code{(char *)0}
1433 (@code{(argz_count (@var{argz}, @var{argz_len}) + 1) * sizeof (char *)}
1434 bytes should be enough).  Note that the string pointers stored into
1435 @var{argv} point into @var{argz}---they are not copies---and so
1436 @var{argz} must be copied if it will be changed while @var{argv} is
1437 still active.  This function is useful for passing the elements in
1438 @var{argz} to an exec function (@pxref{Executing a File}).
1439 @end deftypefun
1440
1441 @comment argz.h
1442 @comment GNU
1443 @deftypefun {void} argz_stringify (char *@var{argz}, size_t @var{len}, int @var{sep})
1444 The @code{argz_stringify} converts @var{argz} into a normal string with
1445 the elements separated by the character @var{sep}, by replacing each
1446 @code{'\0'} inside @var{argz} (except the last one, which terminates the
1447 string) with @var{sep}.  This is handy for printing @var{argz} in a
1448 readable manner.
1449 @end deftypefun
1450
1451 @comment argz.h
1452 @comment GNU
1453 @deftypefun {error_t} argz_add (char **@var{argz}, size_t *@var{argz_len}, const char *@var{str})
1454 The @code{argz_add} function adds the string @var{str} to the end of the
1455 argz vector @code{*@var{argz}}, and updates @code{*@var{argz}} and
1456 @code{*@var{argz_len}} accordingly.
1457 @end deftypefun
1458
1459 @comment argz.h
1460 @comment GNU
1461 @deftypefun {error_t} argz_add_sep (char **@var{argz}, size_t *@var{argz_len}, const char *@var{str}, int @var{delim})
1462 The @code{argz_add_sep} function is similar to @code{argz_add}, but
1463 @var{str} is split into separate elements in the result at occurances of
1464 the character @var{delim}.  This is useful, for instance, for
1465 adding the components of a Unix search path to an argz vector, by using
1466 a value of @code{':'} for @var{delim}.
1467 @end deftypefun
1468
1469 @comment argz.h
1470 @comment GNU
1471 @deftypefun {error_t} argz_append (char **@var{argz}, size_t *@var{argz_len}, const char *@var{buf}, size_t @var{buf_len})
1472 The @code{argz_append} function appends @var{buf_len} bytes starting at
1473 @var{buf} to the argz vector @code{*@var{argz}}, reallocating
1474 @code{*@var{argz}} to accommodate it, and adding @var{buf_len} to
1475 @code{*@var{argz_len}}.
1476 @end deftypefun
1477
1478 @comment argz.h
1479 @comment GNU
1480 @deftypefun {error_t} argz_delete (char **@var{argz}, size_t *@var{argz_len}, char *@var{entry})
1481 If @var{entry} points to the beginning of one of the elements in the
1482 argz vector @code{*@var{argz}}, the @code{argz_delete} function will
1483 remove this entry and reallocate @code{*@var{argz}}, modifying
1484 @code{*@var{argz}} and @code{*@var{argz_len}} accordingly.  Note that as
1485 destructive argz functions usually reallocate their argz argument,
1486 pointers into argz vectors such as @var{entry} will then become invalid.
1487 @end deftypefun
1488
1489 @comment argz.h
1490 @comment GNU
1491 @deftypefun {error_t} argz_insert (char **@var{argz}, size_t *@var{argz_len}, char *@var{before}, const char *@var{entry})
1492 The @code{argz_insert} function inserts the string @var{entry} into the
1493 argz vector @code{*@var{argz}} at a point just before the existing
1494 element pointed to by @var{before}, reallocating @code{*@var{argz}} and
1495 updating @code{*@var{argz}} and @code{*@var{argz_len}}.  If @var{before}
1496 is @code{0}, @var{entry} is added to the end instead (as if by
1497 @code{argz_add}).  Since the first element is in fact the same as
1498 @code{*@var{argz}}, passing in @code{*@var{argz}} as the value of
1499 @var{before} will result in @var{entry} being inserted at the beginning.
1500 @end deftypefun
1501
1502 @comment argz.h
1503 @comment GNU
1504 @deftypefun {char *} argz_next (char *@var{argz}, size_t @var{argz_len}, const char *@var{entry})
1505 The @code{argz_next} function provides a convenient way of iterating
1506 over the elements in the argz vector @var{argz}.  It returns a pointer
1507 to the next element in @var{argz} after the element @var{entry}, or
1508 @code{0} if there are no elements following @var{entry}.  If @var{entry}
1509 is @code{0}, the first element of @var{argz} is returned.
1510
1511 This behavior suggests two styles of iteration:
1512
1513 @smallexample
1514     char *entry = 0;
1515     while ((entry = argz_next (@var{argz}, @var{argz_len}, entry)))
1516       @var{action};
1517 @end smallexample
1518
1519 (the double parentheses are necessary to make some C compilers shut up
1520 about what they consider a questionable @code{while}-test) and:
1521
1522 @smallexample
1523     char *entry;
1524     for (entry = @var{argz};
1525          entry;
1526          entry = argz_next (@var{argz}, @var{argz_len}, entry))
1527       @var{action};
1528 @end smallexample
1529
1530 Note that the latter depends on @var{argz} having a value of @code{0} if
1531 it is empty (rather than a pointer to an empty block of memory); this
1532 invariant is maintained for argz vectors created by the functions here.
1533 @end deftypefun
1534
1535 @comment argz.h
1536 @comment GNU
1537 @deftypefun error_t argz_replace (@w{char **@var{argz}, size_t *@var{argz_len}}, @w{const char *@var{str}, const char *@var{with}}, @w{unsigned *@var{replace_count}})
1538 Replace any occurances of the string @var{str} in @var{argz} with
1539 @var{with}, reallocating @var{argz} as necessary.  If
1540 @var{replace_count} is non-zero, @code{*@var{replace_count}} will be
1541 incremented by number of replacements performed.
1542 @end deftypefun
1543
1544 @node Envz Functions, , Argz Functions, Argz and Envz Vectors
1545 @subsection Envz Functions
1546
1547 Envz vectors are just argz vectors with additional constraints on the form
1548 of each element; as such, argz functions can also be used on them, where it
1549 makes sense.
1550
1551 Each element in an envz vector is a name-value pair, separated by a @code{'='}
1552 character; if multiple @code{'='} characters are present in an element, those
1553 after the first are considered part of the value, and treated like all other
1554 non-@code{'\0'} characters.
1555
1556 If @emph{no} @code{'='} characters are present in an element, that element is
1557 considered the name of a ``null'' entry, as distinct from an entry with an
1558 empty value: @code{envz_get} will return @code{0} if given the name of null
1559 entry, whereas an entry with an empty value would result in a value of
1560 @code{""}; @code{envz_entry} will still find such entries, however.  Null
1561 entries can be removed with @code{envz_strip} function.
1562
1563 As with argz functions, envz functions that may allocate memory (and thus
1564 fail) have a return type of @code{error_t}, and return either @code{0} or
1565 @code{ENOMEM}.
1566
1567 @pindex envz.h
1568 These functions are declared in the standard include file @file{envz.h}.
1569
1570 @comment envz.h
1571 @comment GNU
1572 @deftypefun {char *} envz_entry (const char *@var{envz}, size_t @var{envz_len}, const char *@var{name})
1573 The @code{envz_entry} function finds the entry in @var{envz} with the name
1574 @var{name}, and returns a pointer to the whole entry---that is, the argz
1575 element which begins with @var{name} followed by a @code{'='} character.  If
1576 there is no entry with that name, @code{0} is returned.
1577 @end deftypefun
1578
1579 @comment envz.h
1580 @comment GNU
1581 @deftypefun {char *} envz_get (const char *@var{envz}, size_t @var{envz_len}, const char *@var{name})
1582 The @code{envz_get} function finds the entry in @var{envz} with the name
1583 @var{name} (like @code{envz_entry}), and returns a pointer to the value
1584 portion of that entry (following the @code{'='}).  If there is no entry with
1585 that name (or only a null entry), @code{0} is returned.
1586 @end deftypefun
1587
1588 @comment envz.h
1589 @comment GNU
1590 @deftypefun {error_t} envz_add (char **@var{envz}, size_t *@var{envz_len}, const char *@var{name}, const char *@var{value})
1591 The @code{envz_add} function adds an entry to @code{*@var{envz}}
1592 (updating @code{*@var{envz}} and @code{*@var{envz_len}}) with the name
1593 @var{name}, and value @var{value}.  If an entry with the same name
1594 already exists in @var{envz}, it is removed first.  If @var{value} is
1595 @code{0}, then the new entry will the special null type of entry
1596 (mentioned above).
1597 @end deftypefun
1598
1599 @comment envz.h
1600 @comment GNU
1601 @deftypefun {error_t} envz_merge (char **@var{envz}, size_t *@var{envz_len}, const char *@var{envz2}, size_t @var{envz2_len}, int @var{override})
1602 The @code{envz_merge} function adds each entry in @var{envz2} to @var{envz},
1603 as if with @code{envz_add}, updating @code{*@var{envz}} and
1604 @code{*@var{envz_len}}.  If @var{override} is true, then values in @var{envz2}
1605 will supersede those with the same name in @var{envz}, otherwise not.
1606
1607 Null entries are treated just like other entries in this respect, so a null
1608 entry in @var{envz} can prevent an entry of the same name in @var{envz2} from
1609 being added to @var{envz}, if @var{override} is false.
1610 @end deftypefun
1611
1612 @comment envz.h
1613 @comment GNU
1614 @deftypefun {void} envz_strip (char **@var{envz}, size_t *@var{envz_len})
1615 The @code{envz_strip} function removes any null entries from @var{envz},
1616 updating @code{*@var{envz}} and @code{*@var{envz_len}}.
1617 @end deftypefun