1 @node Handling strings with NUL characters
2 @section Handling strings with NUL characters
4 @c Copyright (C) 2023--2024 Free Software Foundation, Inc.
6 @c Permission is granted to copy, distribute and/or modify this document
7 @c under the terms of the GNU Free Documentation License, Version 1.3 or
8 @c any later version published by the Free Software Foundation; with no
9 @c Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts. A
10 @c copy of the license is at <https://www.gnu.org/licenses/fdl-1.3.en.html>.
12 @c Written by Bruno Haible.
14 Strings in C are usually represented by a character sequence with a
15 terminating NUL character. A @samp{char *}, pointer to the first byte
16 of this character sequence, is what gets passed around as function
17 argument or return value.
19 The major restriction of this string representation is that it cannot
20 handle strings that contain NUL characters: such strings will appear
21 shorter than they were meant to be. In most application areas, this is
22 not a problem, and the @code{char *} type is well usable.
24 In areas where strings with embedded NUL characters need to be handled,
25 the common approach is to use a @code{char *ptr} pointer variable
26 together with a @code{size_t nbytes} variable (or an @code{idx_t nbytes}
27 variable, if you want to avoid problems due to integer overflow). This
28 works fine in code that constructs or manipulates strings with embedded
29 NUL characters. But when it comes to @emph{storing} them, for example
30 in an array or as key or value of a hash table, one needs a type that
31 combines these two fields.
33 The Gnulib modules @code{string-desc}, @code{xstring-desc}, and
34 @code{string-desc-quotearg} provide such a type. We call it a
35 ``string descriptor'' and name it @code{string_desc_t}.
37 The type @code{string_desc_t} is a struct that contains a pointer to the
38 first byte and the number of bytes of the memory region that make up the
39 string. An additional terminating NUL byte, that may be present in
40 memory, is not included in this byte count. This type implements the
41 same concept as @code{std::string_view} in C++, or the @code{String}
44 A @code{string_desc_t} can be passed to a function as an argument, or
45 can be the return value of a function. This is type-safe: If, by
46 mistake, a programmer passes a @code{string_desc_t} to a function that
47 expects a @code{char *} argument, or vice versa, or assigns a
48 @code{string_desc_t} value to a variable of type @code{char *}, or
49 vice versa, the compiler will report an error.
51 Functions related to string descriptors are provided:
54 Side-effect-free operations in @code{"string-desc.h"},
56 Memory-allocating operations in @code{"string-desc.h"},
58 Memory-allocating operations with out-of-memory checking in
59 @code{"xstring-desc.h"},
61 Operations with side effects in @code{"string-desc.h"}.
64 For outputting a string descriptor, the @code{*printf} family of
65 functions cannot be used directly. A format string directive such as
66 @code{"%.*s"} would not work:
69 it would stop the output at the first encountered NUL character,
71 it would require to cast the number of bytes to @code{int}, and thus
72 would not work for strings longer than @code{INT_MAX} bytes.
74 @c @noindent Other format string directives don't work either, because
75 @c the only way to produce a NUL character in @code{*printf}'s output
76 @c is through a dedicated @code{%c} or @code{%lc} directive.
78 Therefore Gnulib offers
81 a function @code{string_desc_fwrite} that outputs a string descriptor to
84 a function @code{string_desc_write} that outputs a string descriptor to
87 and for those applications where the NUL characters should become
88 visible as @samp{\0}, a family of @code{quotearg} based functions, that
89 allow to specify the escaping rules in detail.
92 The functionality is thus split across three modules as follows:
95 The module @code{string-desc}, under LGPL, defines the type and
98 The module @code{xstring-desc}, under GPL, defines the memory-allocating
99 functions with out-of-memory checking.
101 The module @code{string-desc-quotearg}, under GPL, defines the
102 @code{quotearg} based functions.