lib/libc/locale/mbintowcr.3

   1 .\"-
   2 .\" Copyright (c) 2015 Matthew Dillon
   3 .\" All rights reserved.
   4 .\"
   5 .\" Redistribution and use in source and binary forms, with or without
   6 .\" modification, are permitted provided that the following conditions
   7 .\" are met:
   8 .\" 1. Redistributions of source code must retain the above copyright
   9 .\"    notice, this list of conditions and the following disclaimer.
  10 .\" 2. Redistributions in binary form must reproduce the above copyright
  11 .\"    notice, this list of conditions and the following disclaimer in the
  12 .\"    documentation and/or other materials provided with the distribution.
  13 .\"
  14 .\" THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
  15 .\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  16 .\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  17 .\" ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
  18 .\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  19 .\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  20 .\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  21 .\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  22 .\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  23 .\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  24 .\" SUCH DAMAGE.
  25 .\"
  26 .Dd August 24, 2015
  27 .Dt MBINTOWCR 3
  28 .Os
  29 .Sh NAME
  30 .Nm mbintowcr ,
  31 .Nm mbintowcr_l ,
  32 .Nm utf8towcr ,
  33 .Nm wcrtombin ,
  34 .Nm wcrtombin_l ,
  35 .Nm wcrtoutf8
  36 .Nd "8-bit-clean wchar conversion w/escaping or validation"
  37 .Sh LIBRARY
  38 .Lb libc
  39 .Sh SYNOPSIS
  40 .In wchar.h
  41 .Ft size_t
  42 .Fo mbintowcr
  43 .Fa "wchar_t * restrict dst" "const char * restrict src"
  44 .Fa "size_t dlen" "size_t *slen" "int flags"
  45 .Fc
  46 .Ft size_t
  47 .Fo utf8towcr
  48 .Fa "wchar_t * restrict dst" "const char * restrict src"
  49 .Fa "size_t dlen" "size_t *slen" "int flags"
  50 .Fc
  51 .Ft size_t
  52 .Fo wcrtombin
  53 .Fa "char * restrict dst" "const wchar_t * restrict src"
  54 .Fa "size_t dlen" "size_t *slen" "int flags"
  55 .Fc
  56 .Ft size_t
  57 .Fo wcrtoutf8
  58 .Fa "char * restrict dst" "const wchar_t * restrict src"
  59 .Fa "size_t dlen" "size_t *slen" "int flags"
  60 .Fc
  61 .In xlocale.h
  62 .Ft size_t
  63 .Fo mbintowcr_l
  64 .Fa "wchar_t * restrict dst" "const char * restrict src"
  65 .Fa "size_t dlen" "size_t *slen" "locale_t locale" "int flags"
  66 .Fc
  67 .Ft size_t
  68 .Fo wcrtombin_l
  69 .Fa "char * restrict dst" "const wchar_t * restrict src"
  70 .Fa "size_t dlen" "size_t *slen" "locale_t locale" "int flags"
  71 .Fc
  72 .Sh DESCRIPTION
  73 The
  74 .Fn mbintowcr
  75 and
  76 .Fn wcrtombin
  77 functions translate byte data into wide-char format and back again.
  78 Under normal conditions (but not with all flags) these functions
  79 guarantee that the round-trip will be 8-bit-clean.
  80 Some care must be taken to properly specify the
  81 .Dv WCSBIN_EOF
  82 flag to properly handle trailing incomplete sequences at stream EOF.
  83 .Pp
  84 For the "C" locale these functions are 1:1 (do not convert UTF-8).
  85 For UTF-8 locales these functions convert to/from UTF-8.
  86 Most of the discussion below pertains to UTF-8 translations.
  87 .Pp
  88 The
  89 .Fn utf8towcr
  90 and
  91 .Fn wcrtoutf8
  92 functions do exactly the same thing as the above functions but are locked
  93 to the UTF-8 locale.
  94 That is, these functions work regardless of which localehas been selected
  95 and also do not require any initial
  96 .Fn setlocale
  97 call to initialize.
  98 Applications working explicitly in UTF-8 should use these versions.
  99 .Pp
 100 Any illegal sequences will be escaped using UTF-8B (U+DC80 - U+DCFF).
 101 Illegal sequences include surrogate-space encodings, non-canonical encodings,
 102 codings >= 0x10FFFF, 5-byte and 6-byte codings (which are not legal any more),
 103 and malformed codings.
 104 Flags may be used to modify this behavior.
 105 .Pp
 106 The
 107 .Fn mbintowcr
 108 function takes generic 8-bit byte data as its input which the caller
 109 expects to be loosely coded in UTF-8 and converts it to an array of
 110 .Vt wchar_t ,
 111 and returns the number of
 112 .Vt wchar_t
 113 that were converted.
 114 The caller must set
 115 .Fa *slen
 116 to the number of bytes in the input buffer and the function will
 117 set
 118 .Fa *slen
 119 on return to the number of bytes in the input buffer that were processed.
 120 .Pp
 121 Fewer bytes than specified might be processed due to the output buffer
 122 reaching its limit or due to an incomplete sequence at the end of the input
 123 buffer when the
 124 .Dv WCSBIN_EOF
 125 flag has not been specified.
 126 .Pp
 127 If processing a stream, the caller
 128 typically copies any unprocessed data at the end of the buffer back to
 129 the beginning and then continues loading the buffer from there.
 130 Be sure to check for an incomplete translation at stream EOF and do a
 131 final translation of the remainder with the
 132 .Dv WCSBIN_EOF
 133 flag set.
 134 .Pp
 135 This function will always generate escapes for illegal UTF-8 code sequences
 136 and by can produce a clean BYTE-WCHAR-BYTE conversion.
 137 See the flags description later on.
 138 .Pp
 139 This function cannot return an error unless the
 140 .Dv WCSBIN_STRICT
 141 flag is set.
 142 In case of error, any valid conversions are returned first and the caller
 143 is expected to iterate.
 144 The error is returned when it becomes the first element of the buffer.
 145 .Pp
 146 A
 147 .Dv NULL
 148 destination buffer may be specified in which case this function operates
 149 identically except for actually trying to fill the buffer.
 150 This feature is typically used for validation with
 151 .Dv WCSBIN_STRICT
 152 and sometimes also used in combination with
 153 .Dv WCSBIN_SURRO
 154 (set if you want to allow surrogates).
 155 .Pp
 156 The
 157 .Fn wcrtombin
 158 function takes an array of
 159 .Vt wchar_t
 160 as its input which is usually expected to be well-formed and converts it
 161 to an array of generic 8-bit byte data.
 162 The caller must set
 163 .Fa *slen
 164 to the number of elements in the input buffer and the function will set
 165 .Fa *slen
 166 on return to the number of elements in the input buffer that were processed.
 167 .Pp
 168 Be sure to properly set the
 169 .Dv WCSBIN_EOF
 170 flag for the last buffer at stream EOF.
 171 .Pp
 172 This function can return an error regardless of the flags if a supplied
 173 wchar code is out of range.
 174 Some flags change the range of allowed wchar codes.
 175 In case of error, any valid conversions are returned first and the
 176 caller is expected to iterate.
 177 The error is returned when it becomes the first element of the buffer.
 178 .Pp
 179 A
 180 .Dv NULL
 181 destination buffer may be specified in which case this function operates
 182 identically except for actually trying to fill the buffer.
 183 This feature is typically used for validation with or without
 184 .Dv WCSBIN_STRICT
 185 and sometimes also used in combination with
 186 .Dv WCSBIN_SURRO .
 187 .Pp
 188 One final note on the use of
 189 .Dv WCSBIN_SURRO
 190 for wchars-to-bytes.
 191 If this flag
 192 is not set surrogates in the escape range will be de-escaped (giving us our
 193 8-bit-clean round-trip), and other surrogates will be passed through as UTF-8
 194 encodings.
 195 In
 196 .Dv WCSBIN_STRICT
 197 mode this flag works slightly differently.
 198 If not specified no surrogates are allowed at all (escaped or otherwise),
 199 and if specified all surrogates are allowed and will never be de-escaped.
 200 .Pp
 201 The _l-suffixed versions of
 202 .Fn mbintowcr
 203 and
 204 .Fn wcrtombin
 205 take an explicit
 206 .Fa locale
 207 argument, whereas the
 208 non-suffixed versions use the current global or per-thread locale.
 209 .Sh UTF-8B ESCAPE SEQUENCES
 210 Escaping is handled by converting one or more bytes in the byte sequence to
 211 the UTF-8B escape wchar (U+DC80 - U+DCFF).
 212 Most illegal sequences escape the first byte and then reprocess the remaining
 213 bytes.
 214 An illegal byte
 215 sequence length (5 or 6 bytes), non-canonical encoding, or illegal wchar value
 216 (beyond 0x10FFFF if not modified by flags) will escape all bytes in the
 217 sequence as long as they were not malformed.
 218 .Pp
 219 When converting back to a byte-sequence, if not modified by flags, UTF-8B
 220 escape wchars are converted back to their original bytes.
 221 Other surrogate codes (U+D800 - U+DFFF which are normally illegal) will be
 222 passed through and encoded as UTF-8.
 223 .Sh FLAGS
 224 .Bl -tag -width ".Dv WCSBIN_LONGCODES"
 225 .It Dv WCSBIN_EOF
 226 Indicate that the input buffer represents the last of the input stream.
 227 This causes any partial sequences at the end of the input buffer to be
 228 processed.
 229 .It Dv WCSBIN_SURRO
 230 This flag passes-through any surrogate codes that are already UTF-8-encoded.
 231 This is normally illegal but if you are processing a stream which has already
 232 been UTF-8B escaped this flag will prevent the U+DC80 - U+DCFF codes from
 233 being re-escaped bytes-to-wchars and will prevent decoding back to the
 234 original bytes wchars-to-bytes.
 235 This flag is sometimes used on input if the
 236 caller expects the input stream to already be escaped, and not usually used
 237 on output unless the caller explicitly wants to encode to an intermediate
 238 illegal UTF-8 encoding that retains the escapes as escapes.
 239 .Pp
 240 This flag does not prevent additional escapes from being translated on
 241 bytes-to-wchars
 242 .Dv ( WCSBIN_STRICT
 243 prevents escaping on bytes-to-wchars), but
 244 will prevent de-escaping on wchars-to-bytes.
 245 .Pp
 246 This flag breaks round-trip 8-bit-clean operation since escape codes use
 247 the surrogate space and will mix with surrogates that are passed through
 248 on input by this flag in a way that cannot be distinguished.
 249 .It Dv WCSBIN_LONGCODES
 250 Specifying this flag in the bytes-to-wchars direction allows for decoding
 251 of legacy 5-byte and 6-byte sequences as well as 4-byte sequences which
 252 would normally be illegal.
 253 These sequences are illegal and this flag should
 254 not normally be used unless the caller explicitly wants to handle the legacy
 255 case.
 256 .Pp
 257 Specifying this flag in the wchars-to-bytes direction allows normally illegal
 258 wchars to be encoded.
 259 Again, not recommended.
 260 .Pp
 261 This flag does not allow decoding non-canonical sequences.
 262 Such sequences will still be escaped.
 263 .It Dv WCSBIN_STRICT
 264 This flag forces strict parsing in the bytes-to-wchars direction and will
 265 cause
 266 .Fn mbintowcr
 267 to process short or return with an error once processing reaches the
 268 illegal coding rather than escaping the illegal sequence.
 269 This flag is usually specified only when the caller desires to validate
 270 a UTF-8 buffer.
 271 Remember that an error may also be present with return values greater than 0.
 272 A partial sequences at the end of the buffer is not
 273 considered to be an error unless
 274 .Dv WCSBIN_EOF
 275 is also specified.
 276 .Pp
 277 Caller is reminded that when using this feature for validation, a
 278 short-return can happen rather than an error if the error is not at the
 279 base of the source or if
 280 .Dv WCSBIN_EOF
 281 is not specified.
 282 If the caller is not chaining buffers then
 283 .Dv WCSBIN_EOF
 284 should be specified and a simple check of whether
 285 .Fa *slen
 286 equals the original input buffer length on return is sufficient to determine
 287 if an error occurred or not.
 288 If the caller is chaining buffers
 289 .Dv WCSBIN_EOF
 290 is not specified and the caller must proceed with the copy-down / continued
 291 buffer loading loop to distinguish between an incomplete buffer and an error.
 292 .El
 293 .Sh RETURN VALUES
 294 The
 295 .Fn mbintowcr ,
 296 .Fn mbintowcr_l ,
 297 .Fn utf8towcr ,
 298 .Fn wcrtombin ,
 299 .Fn wcrtombin_l
 300 and
 301 .Fn wcrtoutf8
 302 functions return the number of output elements generated and set
 303 .Fa *slen
 304 to the number of input elements converted.
 305 If an error occurs but the output buffer has already been populated,
 306 a short return will occur and the next iteration where the error is
 307 the first element will return the error.
 308 The caller is responsible for processing any error conditions before
 309 continuing.
 310 .Pp
 311 The
 312 .Fn mbintowcr ,
 313 .Fn mbintowcr_l
 314 and
 315 .Fn utf8towcr
 316 functions can return a (size_t)-1 error if
 317 .Dv WCSBIN_STRICT
 318 is specified, and otherwise cannot.
 319 .Pp
 320 The
 321 .Fn wcrtombin ,
 322 .Fn wcrtombin_l
 323 and
 324 .Fn wcrtoutf8
 325 functions can return a (size_t)-1 error if given an illegal wchar code,
 326 as modified by
 327 .Fa flags .
 328 Any wchar code >= 0x80000000U always causes an error to be returned.
 329 .Sh ERRORS
 330 If an error is returned, errno will be set to
 331 .Er EILSEQ .
 332 .Sh SEE ALSO
 333 .Xr mbtowc 3 ,
 334 .Xr multibyte 3 ,
 335 .Xr setlocale 3 ,
 336 .Xr wcrtomb 3 ,
 337 .Xr xlocale 3
 338 .Sh STANDARDS
 339 The
 340 .Fn mbintowcr ,
 341 .Fn mbintowcr_l ,
 342 .Fn utf8towcr ,
 343 .Fn wcrtombin ,
 344 .Fn wcrtombin_l
 345 and
 346 .Fn wcrtoutf8
 347 functions are non-standard extensions to libc.