2 .\" Copyright (c) 2015 Matthew Dillon
3 .\" All rights reserved.
5 .\" Redistribution and use in source and binary forms, with or without
6 .\" modification, are permitted provided that the following conditions
8 .\" 1. Redistributions of source code must retain the above copyright
9 .\" notice, this list of conditions and the following disclaimer.
10 .\" 2. Redistributions in binary form must reproduce the above copyright
11 .\" notice, this list of conditions and the following disclaimer in the
12 .\" documentation and/or other materials provided with the distribution.
14 .\" THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
15 .\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
16 .\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
17 .\" ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
18 .\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
19 .\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
20 .\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
21 .\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
22 .\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
23 .\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
36 .Nd "8-bit-clean wchar conversion w/escaping or validation"
43 .Fa "wchar_t * restrict dst" "const char * restrict src"
44 .Fa "size_t dlen" "size_t *slen" "int flags"
48 .Fa "wchar_t * restrict dst" "const char * restrict src"
49 .Fa "size_t dlen" "size_t *slen" "int flags"
53 .Fa "char * restrict dst" "const wchar_t * restrict src"
54 .Fa "size_t dlen" "size_t *slen" "int flags"
58 .Fa "char * restrict dst" "const wchar_t * restrict src"
59 .Fa "size_t dlen" "size_t *slen" "int flags"
64 .Fa "wchar_t * restrict dst" "const char * restrict src"
65 .Fa "size_t dlen" "size_t *slen" "locale_t locale" "int flags"
69 .Fa "char * restrict dst" "const wchar_t * restrict src"
70 .Fa "size_t dlen" "size_t *slen" "locale_t locale" "int flags"
77 functions translate byte data into wide-char format and back again.
78 Under normal conditions (but not with all flags) these functions
79 guarantee that the round-trip will be 8-bit-clean.
80 Some care must be taken to properly specify the
82 flag to properly handle trailing incomplete sequences at stream EOF.
84 For the "C" locale these functions are 1:1 (do not convert UTF-8).
85 For UTF-8 locales these functions convert to/from UTF-8.
86 Most of the discussion below pertains to UTF-8 translations.
92 functions do exactly the same thing as the above functions but are locked
94 That is, these functions work regardless of which localehas been selected
95 and also do not require any initial
98 Applications working explicitly in UTF-8 should use these versions.
100 Any illegal sequences will be escaped using UTF-8B (U+DC80 - U+DCFF).
101 Illegal sequences include surrogate-space encodings, non-canonical encodings,
102 codings >= 0x10FFFF, 5-byte and 6-byte codings (which are not legal any more),
103 and malformed codings.
104 Flags may be used to modify this behavior.
108 function takes generic 8-bit byte data as its input which the caller
109 expects to be loosely coded in UTF-8 and converts it to an array of
111 and returns the number of
116 to the number of bytes in the input buffer and the function will
119 on return to the number of bytes in the input buffer that were processed.
121 Fewer bytes than specified might be processed due to the output buffer
122 reaching its limit or due to an incomplete sequence at the end of the input
125 flag has not been specified.
127 If processing a stream, the caller
128 typically copies any unprocessed data at the end of the buffer back to
129 the beginning and then continues loading the buffer from there.
130 Be sure to check for an incomplete translation at stream EOF and do a
131 final translation of the remainder with the
135 This function will always generate escapes for illegal UTF-8 code sequences
136 and by can produce a clean BYTE-WCHAR-BYTE conversion.
137 See the flags description later on.
139 This function cannot return an error unless the
142 In case of error, any valid conversions are returned first and the caller
143 is expected to iterate.
144 The error is returned when it becomes the first element of the buffer.
148 destination buffer may be specified in which case this function operates
149 identically except for actually trying to fill the buffer.
150 This feature is typically used for validation with
152 and sometimes also used in combination with
154 (set if you want to allow surrogates).
158 function takes an array of
160 as its input which is usually expected to be well-formed and converts it
161 to an array of generic 8-bit byte data.
164 to the number of elements in the input buffer and the function will set
166 on return to the number of elements in the input buffer that were processed.
168 Be sure to properly set the
170 flag for the last buffer at stream EOF.
172 This function can return an error regardless of the flags if a supplied
173 wchar code is out of range.
174 Some flags change the range of allowed wchar codes.
175 In case of error, any valid conversions are returned first and the
176 caller is expected to iterate.
177 The error is returned when it becomes the first element of the buffer.
181 destination buffer may be specified in which case this function operates
182 identically except for actually trying to fill the buffer.
183 This feature is typically used for validation with or without
185 and sometimes also used in combination with
188 One final note on the use of
192 is not set surrogates in the escape range will be de-escaped (giving us our
193 8-bit-clean round-trip), and other surrogates will be passed through as UTF-8
197 mode this flag works slightly differently.
198 If not specified no surrogates are allowed at all (escaped or otherwise),
199 and if specified all surrogates are allowed and will never be de-escaped.
201 The _l-suffixed versions of
207 argument, whereas the
208 non-suffixed versions use the current global or per-thread locale.
209 .Sh UTF-8B ESCAPE SEQUENCES
210 Escaping is handled by converting one or more bytes in the byte sequence to
211 the UTF-8B escape wchar (U+DC80 - U+DCFF).
212 Most illegal sequences escape the first byte and then reprocess the remaining
215 sequence length (5 or 6 bytes), non-canonical encoding, or illegal wchar value
216 (beyond 0x10FFFF if not modified by flags) will escape all bytes in the
217 sequence as long as they were not malformed.
219 When converting back to a byte-sequence, if not modified by flags, UTF-8B
220 escape wchars are converted back to their original bytes.
221 Other surrogate codes (U+D800 - U+DFFF which are normally illegal) will be
222 passed through and encoded as UTF-8.
224 .Bl -tag -width ".Dv WCSBIN_LONGCODES"
226 Indicate that the input buffer represents the last of the input stream.
227 This causes any partial sequences at the end of the input buffer to be
230 This flag passes-through any surrogate codes that are already UTF-8-encoded.
231 This is normally illegal but if you are processing a stream which has already
232 been UTF-8B escaped this flag will prevent the U+DC80 - U+DCFF codes from
233 being re-escaped bytes-to-wchars and will prevent decoding back to the
234 original bytes wchars-to-bytes.
235 This flag is sometimes used on input if the
236 caller expects the input stream to already be escaped, and not usually used
237 on output unless the caller explicitly wants to encode to an intermediate
238 illegal UTF-8 encoding that retains the escapes as escapes.
240 This flag does not prevent additional escapes from being translated on
243 prevents escaping on bytes-to-wchars), but
244 will prevent de-escaping on wchars-to-bytes.
246 This flag breaks round-trip 8-bit-clean operation since escape codes use
247 the surrogate space and will mix with surrogates that are passed through
248 on input by this flag in a way that cannot be distinguished.
249 .It Dv WCSBIN_LONGCODES
250 Specifying this flag in the bytes-to-wchars direction allows for decoding
251 of legacy 5-byte and 6-byte sequences as well as 4-byte sequences which
252 would normally be illegal.
253 These sequences are illegal and this flag should
254 not normally be used unless the caller explicitly wants to handle the legacy
257 Specifying this flag in the wchars-to-bytes direction allows normally illegal
258 wchars to be encoded.
259 Again, not recommended.
261 This flag does not allow decoding non-canonical sequences.
262 Such sequences will still be escaped.
264 This flag forces strict parsing in the bytes-to-wchars direction and will
267 to process short or return with an error once processing reaches the
268 illegal coding rather than escaping the illegal sequence.
269 This flag is usually specified only when the caller desires to validate
271 Remember that an error may also be present with return values greater than 0.
272 A partial sequences at the end of the buffer is not
273 considered to be an error unless
277 Caller is reminded that when using this feature for validation, a
278 short-return can happen rather than an error if the error is not at the
279 base of the source or if
282 If the caller is not chaining buffers then
284 should be specified and a simple check of whether
286 equals the original input buffer length on return is sufficient to determine
287 if an error occurred or not.
288 If the caller is chaining buffers
290 is not specified and the caller must proceed with the copy-down / continued
291 buffer loading loop to distinguish between an incomplete buffer and an error.
302 functions return the number of output elements generated and set
304 to the number of input elements converted.
305 If an error occurs but the output buffer has already been populated,
306 a short return will occur and the next iteration where the error is
307 the first element will return the error.
308 The caller is responsible for processing any error conditions before
316 functions can return a (size_t)-1 error if
318 is specified, and otherwise cannot.
325 functions can return a (size_t)-1 error if given an illegal wchar code,
328 Any wchar code >= 0x80000000U always causes an error to be returned.
330 If an error is returned, errno will be set to
347 functions are non-standard extensions to libc.