Doc/c-api/unicode.rst

   1 .. highlightlang:: c
   2
   3 .. _unicodeobjects:
   4
   5 Unicode Objects and Codecs
   6 --------------------------
   7
   8 .. sectionauthor:: Marc-Andre Lemburg <mal@lemburg.com>
   9
  10 Unicode Objects
  11 ^^^^^^^^^^^^^^^
  12
  13 These are the basic Unicode object types used for the Unicode implementation in
  14 Python:
  15
  16 .. % --- Unicode Type -------------------------------------------------------
  17
  18
  19 .. ctype:: Py_UNICODE
  20
  21    This type represents the storage type which is used by Python internally as
  22    basis for holding Unicode ordinals.  Python's default builds use a 16-bit type
  23    for :ctype:`Py_UNICODE` and store Unicode values internally as UCS2. It is also
  24    possible to build a UCS4 version of Python (most recent Linux distributions come
  25    with UCS4 builds of Python). These builds then use a 32-bit type for
  26    :ctype:`Py_UNICODE` and store Unicode data internally as UCS4. On platforms
  27    where :ctype:`wchar_t` is available and compatible with the chosen Python
  28    Unicode build variant, :ctype:`Py_UNICODE` is a typedef alias for
  29    :ctype:`wchar_t` to enhance native platform compatibility. On all other
  30    platforms, :ctype:`Py_UNICODE` is a typedef alias for either :ctype:`unsigned
  31    short` (UCS2) or :ctype:`unsigned long` (UCS4).
  32
  33 Note that UCS2 and UCS4 Python builds are not binary compatible. Please keep
  34 this in mind when writing extensions or interfaces.
  35
  36
  37 .. ctype:: PyUnicodeObject
  38
  39    This subtype of :ctype:`PyObject` represents a Python Unicode object.
  40
  41
  42 .. cvar:: PyTypeObject PyUnicode_Type
  43
  44    This instance of :ctype:`PyTypeObject` represents the Python Unicode type.  It
  45    is exposed to Python code as ``str``.
  46
  47 The following APIs are really C macros and can be used to do fast checks and to
  48 access internal read-only data of Unicode objects:
  49
  50
  51 .. cfunction:: int PyUnicode_Check(PyObject *o)
  52
  53    Return true if the object *o* is a Unicode object or an instance of a Unicode
  54    subtype.
  55
  56
  57 .. cfunction:: int PyUnicode_CheckExact(PyObject *o)
  58
  59    Return true if the object *o* is a Unicode object, but not an instance of a
  60    subtype.
  61
  62
  63 .. cfunction:: Py_ssize_t PyUnicode_GET_SIZE(PyObject *o)
  64
  65    Return the size of the object.  *o* has to be a :ctype:`PyUnicodeObject` (not
  66    checked).
  67
  68
  69 .. cfunction:: Py_ssize_t PyUnicode_GET_DATA_SIZE(PyObject *o)
  70
  71    Return the size of the object's internal buffer in bytes.  *o* has to be a
  72    :ctype:`PyUnicodeObject` (not checked).
  73
  74
  75 .. cfunction:: Py_UNICODE* PyUnicode_AS_UNICODE(PyObject *o)
  76
  77    Return a pointer to the internal :ctype:`Py_UNICODE` buffer of the object.  *o*
  78    has to be a :ctype:`PyUnicodeObject` (not checked).
  79
  80
  81 .. cfunction:: const char* PyUnicode_AS_DATA(PyObject *o)
  82
  83    Return a pointer to the internal buffer of the object. *o* has to be a
  84    :ctype:`PyUnicodeObject` (not checked).
  85
  86
  87 .. cfunction:: int PyUnicode_ClearFreeList()
  88
  89    Clear the free list. Return the total number of freed items.
  90
  91
  92 Unicode provides many different character properties. The most often needed ones
  93 are available through these macros which are mapped to C functions depending on
  94 the Python configuration.
  95
  96 .. % --- Unicode character properties ---------------------------------------
  97
  98
  99 .. cfunction:: int Py_UNICODE_ISSPACE(Py_UNICODE ch)
 100
 101    Return 1 or 0 depending on whether *ch* is a whitespace character.
 102
 103
 104 .. cfunction:: int Py_UNICODE_ISLOWER(Py_UNICODE ch)
 105
 106    Return 1 or 0 depending on whether *ch* is a lowercase character.
 107
 108
 109 .. cfunction:: int Py_UNICODE_ISUPPER(Py_UNICODE ch)
 110
 111    Return 1 or 0 depending on whether *ch* is an uppercase character.
 112
 113
 114 .. cfunction:: int Py_UNICODE_ISTITLE(Py_UNICODE ch)
 115
 116    Return 1 or 0 depending on whether *ch* is a titlecase character.
 117
 118
 119 .. cfunction:: int Py_UNICODE_ISLINEBREAK(Py_UNICODE ch)
 120
 121    Return 1 or 0 depending on whether *ch* is a linebreak character.
 122
 123
 124 .. cfunction:: int Py_UNICODE_ISDECIMAL(Py_UNICODE ch)
 125
 126    Return 1 or 0 depending on whether *ch* is a decimal character.
 127
 128
 129 .. cfunction:: int Py_UNICODE_ISDIGIT(Py_UNICODE ch)
 130
 131    Return 1 or 0 depending on whether *ch* is a digit character.
 132
 133
 134 .. cfunction:: int Py_UNICODE_ISNUMERIC(Py_UNICODE ch)
 135
 136    Return 1 or 0 depending on whether *ch* is a numeric character.
 137
 138
 139 .. cfunction:: int Py_UNICODE_ISALPHA(Py_UNICODE ch)
 140
 141    Return 1 or 0 depending on whether *ch* is an alphabetic character.
 142
 143
 144 .. cfunction:: int Py_UNICODE_ISALNUM(Py_UNICODE ch)
 145
 146    Return 1 or 0 depending on whether *ch* is an alphanumeric character.
 147
 148
 149 .. cfunction:: int Py_UNICODE_ISPRINTABLE(Py_UNICODE ch)
 150
 151    Return 1 or 0 depending on whether *ch* is a printable character.
 152    Nonprintable characters are those characters defined in the Unicode character
 153    database as "Other" or "Separator", excepting the ASCII space (0x20) which is
 154    considered printable.  (Note that printable characters in this context are
 155    those which should not be escaped when :func:`repr` is invoked on a string.
 156    It has no bearing on the handling of strings written to :data:`sys.stdout` or
 157    :data:`sys.stderr`.)
 158
 159
 160 These APIs can be used for fast direct character conversions:
 161
 162
 163 .. cfunction:: Py_UNICODE Py_UNICODE_TOLOWER(Py_UNICODE ch)
 164
 165    Return the character *ch* converted to lower case.
 166
 167
 168 .. cfunction:: Py_UNICODE Py_UNICODE_TOUPPER(Py_UNICODE ch)
 169
 170    Return the character *ch* converted to upper case.
 171
 172
 173 .. cfunction:: Py_UNICODE Py_UNICODE_TOTITLE(Py_UNICODE ch)
 174
 175    Return the character *ch* converted to title case.
 176
 177
 178 .. cfunction:: int Py_UNICODE_TODECIMAL(Py_UNICODE ch)
 179
 180    Return the character *ch* converted to a decimal positive integer.  Return
 181    ``-1`` if this is not possible.  This macro does not raise exceptions.
 182
 183
 184 .. cfunction:: int Py_UNICODE_TODIGIT(Py_UNICODE ch)
 185
 186    Return the character *ch* converted to a single digit integer. Return ``-1`` if
 187    this is not possible.  This macro does not raise exceptions.
 188
 189
 190 .. cfunction:: double Py_UNICODE_TONUMERIC(Py_UNICODE ch)
 191
 192    Return the character *ch* converted to a double. Return ``-1.0`` if this is not
 193    possible.  This macro does not raise exceptions.
 194
 195 To create Unicode objects and access their basic sequence properties, use these
 196 APIs:
 197
 198 .. % --- Plain Py_UNICODE ---------------------------------------------------
 199
 200
 201 .. cfunction:: PyObject* PyUnicode_FromUnicode(const Py_UNICODE *u, Py_ssize_t size)
 202
 203    Create a Unicode Object from the Py_UNICODE buffer *u* of the given size. *u*
 204    may be *NULL* which causes the contents to be undefined. It is the user's
 205    responsibility to fill in the needed data.  The buffer is copied into the new
 206    object. If the buffer is not *NULL*, the return value might be a shared object.
 207    Therefore, modification of the resulting Unicode object is only allowed when *u*
 208    is *NULL*.
 209
 210
 211 .. cfunction:: PyObject* PyUnicode_FromStringAndSize(const char *u, Py_ssize_t size)
 212
 213    Create a Unicode Object from the char buffer *u*.  The bytes will be interpreted
 214    as being UTF-8 encoded.  *u* may also be *NULL* which
 215    causes the contents to be undefined. It is the user's responsibility to fill in
 216    the needed data.  The buffer is copied into the new object. If the buffer is not
 217    *NULL*, the return value might be a shared object. Therefore, modification of
 218    the resulting Unicode object is only allowed when *u* is *NULL*.
 219
 220
 221 .. cfunction:: PyObject *PyUnicode_FromString(const char *u)
 222
 223    Create a Unicode object from an UTF-8 encoded null-terminated char buffer
 224    *u*.
 225
 226
 227 .. cfunction:: PyObject* PyUnicode_FromFormat(const char *format, ...)
 228
 229    Take a C :cfunc:`printf`\ -style *format* string and a variable number of
 230    arguments, calculate the size of the resulting Python unicode string and return
 231    a string with the values formatted into it.  The variable arguments must be C
 232    types and must correspond exactly to the format characters in the *format*
 233    string.  The following format characters are allowed:
 234
 235    .. % The descriptions for %zd and %zu are wrong, but the truth is complicated
 236    .. % because not all compilers support the %z width modifier -- we fake it
 237    .. % when necessary via interpolating PY_FORMAT_SIZE_T.
 238
 239    +-------------------+---------------------+--------------------------------+
 240    | Format Characters | Type                | Comment                        |
 241    +===================+=====================+================================+
 242    | :attr:`%%`        | *n/a*               | The literal % character.       |
 243    +-------------------+---------------------+--------------------------------+
 244    | :attr:`%c`        | int                 | A single character,            |
 245    |                   |                     | represented as an C int.       |
 246    +-------------------+---------------------+--------------------------------+
 247    | :attr:`%d`        | int                 | Exactly equivalent to          |
 248    |                   |                     | ``printf("%d")``.              |
 249    +-------------------+---------------------+--------------------------------+
 250    | :attr:`%u`        | unsigned int        | Exactly equivalent to          |
 251    |                   |                     | ``printf("%u")``.              |
 252    +-------------------+---------------------+--------------------------------+
 253    | :attr:`%ld`       | long                | Exactly equivalent to          |
 254    |                   |                     | ``printf("%ld")``.             |
 255    +-------------------+---------------------+--------------------------------+
 256    | :attr:`%lu`       | unsigned long       | Exactly equivalent to          |
 257    |                   |                     | ``printf("%lu")``.             |
 258    +-------------------+---------------------+--------------------------------+
 259    | :attr:`%zd`       | Py_ssize_t          | Exactly equivalent to          |
 260    |                   |                     | ``printf("%zd")``.             |
 261    +-------------------+---------------------+--------------------------------+
 262    | :attr:`%zu`       | size_t              | Exactly equivalent to          |
 263    |                   |                     | ``printf("%zu")``.             |
 264    +-------------------+---------------------+--------------------------------+
 265    | :attr:`%i`        | int                 | Exactly equivalent to          |
 266    |                   |                     | ``printf("%i")``.              |
 267    +-------------------+---------------------+--------------------------------+
 268    | :attr:`%x`        | int                 | Exactly equivalent to          |
 269    |                   |                     | ``printf("%x")``.              |
 270    +-------------------+---------------------+--------------------------------+
 271    | :attr:`%s`        | char\*              | A null-terminated C character  |
 272    |                   |                     | array.                         |
 273    +-------------------+---------------------+--------------------------------+
 274    | :attr:`%p`        | void\*              | The hex representation of a C  |
 275    |                   |                     | pointer. Mostly equivalent to  |
 276    |                   |                     | ``printf("%p")`` except that   |
 277    |                   |                     | it is guaranteed to start with |
 278    |                   |                     | the literal ``0x`` regardless  |
 279    |                   |                     | of what the platform's         |
 280    |                   |                     | ``printf`` yields.             |
 281    +-------------------+---------------------+--------------------------------+
 282    | :attr:`%A`        | PyObject\*          | The result of calling          |
 283    |                   |                     | :func:`ascii`.                 |
 284    +-------------------+---------------------+--------------------------------+
 285    | :attr:`%U`        | PyObject\*          | A unicode object.              |
 286    +-------------------+---------------------+--------------------------------+
 287    | :attr:`%V`        | PyObject\*, char \* | A unicode object (which may be |
 288    |                   |                     | *NULL*) and a null-terminated  |
 289    |                   |                     | C character array as a second  |
 290    |                   |                     | parameter (which will be used, |
 291    |                   |                     | if the first parameter is      |
 292    |                   |                     | *NULL*).                       |
 293    +-------------------+---------------------+--------------------------------+
 294    | :attr:`%S`        | PyObject\*          | The result of calling          |
 295    |                   |                     | :func:`PyObject_Str`.          |
 296    +-------------------+---------------------+--------------------------------+
 297    | :attr:`%R`        | PyObject\*          | The result of calling          |
 298    |                   |                     | :func:`PyObject_Repr`.         |
 299    +-------------------+---------------------+--------------------------------+
 300
 301    An unrecognized format character causes all the rest of the format string to be
 302    copied as-is to the result string, and any extra arguments discarded.
 303
 304
 305 .. cfunction:: PyObject* PyUnicode_FromFormatV(const char *format, va_list vargs)
 306
 307    Identical to :func:`PyUnicode_FromFormat` except that it takes exactly two
 308    arguments.
 309
 310
 311 .. cfunction:: Py_UNICODE* PyUnicode_AsUnicode(PyObject *unicode)
 312
 313    Return a read-only pointer to the Unicode object's internal :ctype:`Py_UNICODE`
 314    buffer, *NULL* if *unicode* is not a Unicode object.
 315
 316
 317 .. cfunction:: Py_ssize_t PyUnicode_GetSize(PyObject *unicode)
 318
 319    Return the length of the Unicode object.
 320
 321
 322 .. cfunction:: PyObject* PyUnicode_FromEncodedObject(PyObject *obj, const char *encoding, const char *errors)
 323
 324    Coerce an encoded object *obj* to an Unicode object and return a reference with
 325    incremented refcount.
 326
 327    String and other char buffer compatible objects are decoded according to the
 328    given encoding and using the error handling defined by errors.  Both can be
 329    *NULL* to have the interface use the default values (see the next section for
 330    details).
 331
 332    All other objects, including Unicode objects, cause a :exc:`TypeError` to be
 333    set.
 334
 335    The API returns *NULL* if there was an error.  The caller is responsible for
 336    decref'ing the returned objects.
 337
 338
 339 .. cfunction:: PyObject* PyUnicode_FromObject(PyObject *obj)
 340
 341    Shortcut for ``PyUnicode_FromEncodedObject(obj, NULL, "strict")`` which is used
 342    throughout the interpreter whenever coercion to Unicode is needed.
 343
 344 If the platform supports :ctype:`wchar_t` and provides a header file wchar.h,
 345 Python can interface directly to this type using the following functions.
 346 Support is optimized if Python's own :ctype:`Py_UNICODE` type is identical to
 347 the system's :ctype:`wchar_t`.
 348
 349 .. % --- wchar_t support for platforms which support it ---------------------
 350
 351
 352 .. cfunction:: PyObject* PyUnicode_FromWideChar(const wchar_t *w, Py_ssize_t size)
 353
 354    Create a Unicode object from the :ctype:`wchar_t` buffer *w* of the given size.
 355    Passing -1 as the size indicates that the function must itself compute the length,
 356    using wcslen.
 357    Return *NULL* on failure.
 358
 359
 360 .. cfunction:: Py_ssize_t PyUnicode_AsWideChar(PyUnicodeObject *unicode, wchar_t *w, Py_ssize_t size)
 361
 362    Copy the Unicode object contents into the :ctype:`wchar_t` buffer *w*.  At most
 363    *size* :ctype:`wchar_t` characters are copied (excluding a possibly trailing
 364    0-termination character).  Return the number of :ctype:`wchar_t` characters
 365    copied or -1 in case of an error.  Note that the resulting :ctype:`wchar_t`
 366    string may or may not be 0-terminated.  It is the responsibility of the caller
 367    to make sure that the :ctype:`wchar_t` string is 0-terminated in case this is
 368    required by the application.
 369
 370
 371 .. _builtincodecs:
 372
 373 Built-in Codecs
 374 ^^^^^^^^^^^^^^^
 375
 376 Python provides a set of built-in codecs which are written in C for speed. All of
 377 these codecs are directly usable via the following functions.
 378
 379 Many of the following APIs take two arguments encoding and errors. These
 380 parameters encoding and errors have the same semantics as the ones of the
 381 built-in :func:`unicode` Unicode object constructor.
 382
 383 Setting encoding to *NULL* causes the default encoding to be used
 384 which is ASCII.  The file system calls should use
 385 :cfunc:`PyUnicode_FSConverter` for encoding file names. This uses the
 386 variable :cdata:`Py_FileSystemDefaultEncoding` internally. This
 387 variable should be treated as read-only: On some systems, it will be a
 388 pointer to a static string, on others, it will change at run-time
 389 (such as when the application invokes setlocale).
 390
 391 Error handling is set by errors which may also be set to *NULL* meaning to use
 392 the default handling defined for the codec.  Default error handling for all
 393 built-in codecs is "strict" (:exc:`ValueError` is raised).
 394
 395 The codecs all use a similar interface.  Only deviation from the following
 396 generic ones are documented for simplicity.
 397
 398 These are the generic codec APIs:
 399
 400 .. % --- Generic Codecs -----------------------------------------------------
 401
 402
 403 .. cfunction:: PyObject* PyUnicode_Decode(const char *s, Py_ssize_t size, const char *encoding, const char *errors)
 404
 405    Create a Unicode object by decoding *size* bytes of the encoded string *s*.
 406    *encoding* and *errors* have the same meaning as the parameters of the same name
 407    in the :func:`unicode` built-in function.  The codec to be used is looked up
 408    using the Python codec registry.  Return *NULL* if an exception was raised by
 409    the codec.
 410
 411
 412 .. cfunction:: PyObject* PyUnicode_Encode(const Py_UNICODE *s, Py_ssize_t size, const char *encoding, const char *errors)
 413
 414    Encode the :ctype:`Py_UNICODE` buffer of the given size and return a Python
 415    bytes object.  *encoding* and *errors* have the same meaning as the
 416    parameters of the same name in the Unicode :meth:`encode` method.  The codec
 417    to be used is looked up using the Python codec registry.  Return *NULL* if an
 418    exception was raised by the codec.
 419
 420
 421 .. cfunction:: PyObject* PyUnicode_AsEncodedString(PyObject *unicode, const char *encoding, const char *errors)
 422
 423    Encode a Unicode object and return the result as Python bytes object.
 424    *encoding* and *errors* have the same meaning as the parameters of the same
 425    name in the Unicode :meth:`encode` method. The codec to be used is looked up
 426    using the Python codec registry. Return *NULL* if an exception was raised by
 427    the codec.
 428
 429 These are the UTF-8 codec APIs:
 430
 431 .. % --- UTF-8 Codecs -------------------------------------------------------
 432
 433
 434 .. cfunction:: PyObject* PyUnicode_DecodeUTF8(const char *s, Py_ssize_t size, const char *errors)
 435
 436    Create a Unicode object by decoding *size* bytes of the UTF-8 encoded string
 437    *s*. Return *NULL* if an exception was raised by the codec.
 438
 439
 440 .. cfunction:: PyObject* PyUnicode_DecodeUTF8Stateful(const char *s, Py_ssize_t size, const char *errors, Py_ssize_t *consumed)
 441
 442    If *consumed* is *NULL*, behave like :cfunc:`PyUnicode_DecodeUTF8`. If
 443    *consumed* is not *NULL*, trailing incomplete UTF-8 byte sequences will not be
 444    treated as an error. Those bytes will not be decoded and the number of bytes
 445    that have been decoded will be stored in *consumed*.
 446
 447
 448 .. cfunction:: PyObject* PyUnicode_EncodeUTF8(const Py_UNICODE *s, Py_ssize_t size, const char *errors)
 449
 450    Encode the :ctype:`Py_UNICODE` buffer of the given size using UTF-8 and
 451    return a Python bytes object.  Return *NULL* if an exception was raised by
 452    the codec.
 453
 454
 455 .. cfunction:: PyObject* PyUnicode_AsUTF8String(PyObject *unicode)
 456
 457    Encode a Unicode object using UTF-8 and return the result as Python bytes
 458    object.  Error handling is "strict".  Return *NULL* if an exception was
 459    raised by the codec.
 460
 461 These are the UTF-32 codec APIs:
 462
 463 .. % --- UTF-32 Codecs ------------------------------------------------------ */
 464
 465
 466 .. cfunction:: PyObject* PyUnicode_DecodeUTF32(const char *s, Py_ssize_t size, const char *errors, int *byteorder)
 467
 468    Decode *length* bytes from a UTF-32 encoded buffer string and return the
 469    corresponding Unicode object.  *errors* (if non-*NULL*) defines the error
 470    handling. It defaults to "strict".
 471
 472    If *byteorder* is non-*NULL*, the decoder starts decoding using the given byte
 473    order::
 474
 475       *byteorder == -1: little endian
 476       *byteorder == 0:  native order
 477       *byteorder == 1:  big endian
 478
 479    If ``*byteorder`` is zero, and the first four bytes of the input data are a
 480    byte order mark (BOM), the decoder switches to this byte order and the BOM is
 481    not copied into the resulting Unicode string.  If ``*byteorder`` is ``-1`` or
 482    ``1``, any byte order mark is copied to the output.
 483
 484    After completion, *\*byteorder* is set to the current byte order at the end
 485    of input data.
 486
 487    In a narrow build codepoints outside the BMP will be decoded as surrogate pairs.
 488
 489    If *byteorder* is *NULL*, the codec starts in native order mode.
 490
 491    Return *NULL* if an exception was raised by the codec.
 492
 493
 494 .. cfunction:: PyObject* PyUnicode_DecodeUTF32Stateful(const char *s, Py_ssize_t size, const char *errors, int *byteorder, Py_ssize_t *consumed)
 495
 496    If *consumed* is *NULL*, behave like :cfunc:`PyUnicode_DecodeUTF32`. If
 497    *consumed* is not *NULL*, :cfunc:`PyUnicode_DecodeUTF32Stateful` will not treat
 498    trailing incomplete UTF-32 byte sequences (such as a number of bytes not divisible
 499    by four) as an error. Those bytes will not be decoded and the number of bytes
 500    that have been decoded will be stored in *consumed*.
 501
 502
 503 .. cfunction:: PyObject* PyUnicode_EncodeUTF32(const Py_UNICODE *s, Py_ssize_t size, const char *errors, int byteorder)
 504
 505    Return a Python bytes object holding the UTF-32 encoded value of the Unicode
 506    data in *s*.  Output is written according to the following byte order::
 507
 508       byteorder == -1: little endian
 509       byteorder == 0:  native byte order (writes a BOM mark)
 510       byteorder == 1:  big endian
 511
 512    If byteorder is ``0``, the output string will always start with the Unicode BOM
 513    mark (U+FEFF). In the other two modes, no BOM mark is prepended.
 514
 515    If *Py_UNICODE_WIDE* is not defined, surrogate pairs will be output
 516    as a single codepoint.
 517
 518    Return *NULL* if an exception was raised by the codec.
 519
 520
 521 .. cfunction:: PyObject* PyUnicode_AsUTF32String(PyObject *unicode)
 522
 523    Return a Python byte string using the UTF-32 encoding in native byte
 524    order. The string always starts with a BOM mark.  Error handling is "strict".
 525    Return *NULL* if an exception was raised by the codec.
 526
 527
 528 These are the UTF-16 codec APIs:
 529
 530 .. % --- UTF-16 Codecs ------------------------------------------------------ */
 531
 532
 533 .. cfunction:: PyObject* PyUnicode_DecodeUTF16(const char *s, Py_ssize_t size, const char *errors, int *byteorder)
 534
 535    Decode *length* bytes from a UTF-16 encoded buffer string and return the
 536    corresponding Unicode object.  *errors* (if non-*NULL*) defines the error
 537    handling. It defaults to "strict".
 538
 539    If *byteorder* is non-*NULL*, the decoder starts decoding using the given byte
 540    order::
 541
 542       *byteorder == -1: little endian
 543       *byteorder == 0:  native order
 544       *byteorder == 1:  big endian
 545
 546    If ``*byteorder`` is zero, and the first two bytes of the input data are a
 547    byte order mark (BOM), the decoder switches to this byte order and the BOM is
 548    not copied into the resulting Unicode string.  If ``*byteorder`` is ``-1`` or
 549    ``1``, any byte order mark is copied to the output (where it will result in
 550    either a ``\ufeff`` or a ``\ufffe`` character).
 551
 552    After completion, *\*byteorder* is set to the current byte order at the end
 553    of input data.
 554
 555    If *byteorder* is *NULL*, the codec starts in native order mode.
 556
 557    Return *NULL* if an exception was raised by the codec.
 558
 559
 560 .. cfunction:: PyObject* PyUnicode_DecodeUTF16Stateful(const char *s, Py_ssize_t size, const char *errors, int *byteorder, Py_ssize_t *consumed)
 561
 562    If *consumed* is *NULL*, behave like :cfunc:`PyUnicode_DecodeUTF16`. If
 563    *consumed* is not *NULL*, :cfunc:`PyUnicode_DecodeUTF16Stateful` will not treat
 564    trailing incomplete UTF-16 byte sequences (such as an odd number of bytes or a
 565    split surrogate pair) as an error. Those bytes will not be decoded and the
 566    number of bytes that have been decoded will be stored in *consumed*.
 567
 568
 569 .. cfunction:: PyObject* PyUnicode_EncodeUTF16(const Py_UNICODE *s, Py_ssize_t size, const char *errors, int byteorder)
 570
 571    Return a Python bytes object holding the UTF-16 encoded value of the Unicode
 572    data in *s*.  Output is written according to the following byte order::
 573
 574       byteorder == -1: little endian
 575       byteorder == 0:  native byte order (writes a BOM mark)
 576       byteorder == 1:  big endian
 577
 578    If byteorder is ``0``, the output string will always start with the Unicode BOM
 579    mark (U+FEFF). In the other two modes, no BOM mark is prepended.
 580
 581    If *Py_UNICODE_WIDE* is defined, a single :ctype:`Py_UNICODE` value may get
 582    represented as a surrogate pair. If it is not defined, each :ctype:`Py_UNICODE`
 583    values is interpreted as an UCS-2 character.
 584
 585    Return *NULL* if an exception was raised by the codec.
 586
 587
 588 .. cfunction:: PyObject* PyUnicode_AsUTF16String(PyObject *unicode)
 589
 590    Return a Python byte string using the UTF-16 encoding in native byte
 591    order. The string always starts with a BOM mark.  Error handling is "strict".
 592    Return *NULL* if an exception was raised by the codec.
 593
 594 These are the "Unicode Escape" codec APIs:
 595
 596 .. % --- Unicode-Escape Codecs ----------------------------------------------
 597
 598
 599 .. cfunction:: PyObject* PyUnicode_DecodeUnicodeEscape(const char *s, Py_ssize_t size, const char *errors)
 600
 601    Create a Unicode object by decoding *size* bytes of the Unicode-Escape encoded
 602    string *s*.  Return *NULL* if an exception was raised by the codec.
 603
 604
 605 .. cfunction:: PyObject* PyUnicode_EncodeUnicodeEscape(const Py_UNICODE *s, Py_ssize_t size)
 606
 607    Encode the :ctype:`Py_UNICODE` buffer of the given size using Unicode-Escape and
 608    return a Python string object.  Return *NULL* if an exception was raised by the
 609    codec.
 610
 611
 612 .. cfunction:: PyObject* PyUnicode_AsUnicodeEscapeString(PyObject *unicode)
 613
 614    Encode a Unicode object using Unicode-Escape and return the result as Python
 615    string object.  Error handling is "strict". Return *NULL* if an exception was
 616    raised by the codec.
 617
 618 These are the "Raw Unicode Escape" codec APIs:
 619
 620 .. % --- Raw-Unicode-Escape Codecs ------------------------------------------
 621
 622
 623 .. cfunction:: PyObject* PyUnicode_DecodeRawUnicodeEscape(const char *s, Py_ssize_t size, const char *errors)
 624
 625    Create a Unicode object by decoding *size* bytes of the Raw-Unicode-Escape
 626    encoded string *s*.  Return *NULL* if an exception was raised by the codec.
 627
 628
 629 .. cfunction:: PyObject* PyUnicode_EncodeRawUnicodeEscape(const Py_UNICODE *s, Py_ssize_t size, const char *errors)
 630
 631    Encode the :ctype:`Py_UNICODE` buffer of the given size using Raw-Unicode-Escape
 632    and return a Python string object.  Return *NULL* if an exception was raised by
 633    the codec.
 634
 635
 636 .. cfunction:: PyObject* PyUnicode_AsRawUnicodeEscapeString(PyObject *unicode)
 637
 638    Encode a Unicode object using Raw-Unicode-Escape and return the result as
 639    Python string object. Error handling is "strict". Return *NULL* if an exception
 640    was raised by the codec.
 641
 642 These are the Latin-1 codec APIs: Latin-1 corresponds to the first 256 Unicode
 643 ordinals and only these are accepted by the codecs during encoding.
 644
 645 .. % --- Latin-1 Codecs -----------------------------------------------------
 646
 647
 648 .. cfunction:: PyObject* PyUnicode_DecodeLatin1(const char *s, Py_ssize_t size, const char *errors)
 649
 650    Create a Unicode object by decoding *size* bytes of the Latin-1 encoded string
 651    *s*.  Return *NULL* if an exception was raised by the codec.
 652
 653
 654 .. cfunction:: PyObject* PyUnicode_EncodeLatin1(const Py_UNICODE *s, Py_ssize_t size, const char *errors)
 655
 656    Encode the :ctype:`Py_UNICODE` buffer of the given size using Latin-1 and
 657    return a Python bytes object.  Return *NULL* if an exception was raised by
 658    the codec.
 659
 660
 661 .. cfunction:: PyObject* PyUnicode_AsLatin1String(PyObject *unicode)
 662
 663    Encode a Unicode object using Latin-1 and return the result as Python bytes
 664    object.  Error handling is "strict".  Return *NULL* if an exception was
 665    raised by the codec.
 666
 667 These are the ASCII codec APIs.  Only 7-bit ASCII data is accepted. All other
 668 codes generate errors.
 669
 670 .. % --- ASCII Codecs -------------------------------------------------------
 671
 672
 673 .. cfunction:: PyObject* PyUnicode_DecodeASCII(const char *s, Py_ssize_t size, const char *errors)
 674
 675    Create a Unicode object by decoding *size* bytes of the ASCII encoded string
 676    *s*.  Return *NULL* if an exception was raised by the codec.
 677
 678
 679 .. cfunction:: PyObject* PyUnicode_EncodeASCII(const Py_UNICODE *s, Py_ssize_t size, const char *errors)
 680
 681    Encode the :ctype:`Py_UNICODE` buffer of the given size using ASCII and
 682    return a Python bytes object.  Return *NULL* if an exception was raised by
 683    the codec.
 684
 685
 686 .. cfunction:: PyObject* PyUnicode_AsASCIIString(PyObject *unicode)
 687
 688    Encode a Unicode object using ASCII and return the result as Python bytes
 689    object.  Error handling is "strict".  Return *NULL* if an exception was
 690    raised by the codec.
 691
 692 These are the mapping codec APIs:
 693
 694 .. % --- Character Map Codecs -----------------------------------------------
 695
 696 This codec is special in that it can be used to implement many different codecs
 697 (and this is in fact what was done to obtain most of the standard codecs
 698 included in the :mod:`encodings` package). The codec uses mapping to encode and
 699 decode characters.
 700
 701 Decoding mappings must map single string characters to single Unicode
 702 characters, integers (which are then interpreted as Unicode ordinals) or None
 703 (meaning "undefined mapping" and causing an error).
 704
 705 Encoding mappings must map single Unicode characters to single string
 706 characters, integers (which are then interpreted as Latin-1 ordinals) or None
 707 (meaning "undefined mapping" and causing an error).
 708
 709 The mapping objects provided must only support the __getitem__ mapping
 710 interface.
 711
 712 If a character lookup fails with a LookupError, the character is copied as-is
 713 meaning that its ordinal value will be interpreted as Unicode or Latin-1 ordinal
 714 resp. Because of this, mappings only need to contain those mappings which map
 715 characters to different code points.
 716
 717
 718 .. cfunction:: PyObject* PyUnicode_DecodeCharmap(const char *s, Py_ssize_t size, PyObject *mapping, const char *errors)
 719
 720    Create a Unicode object by decoding *size* bytes of the encoded string *s* using
 721    the given *mapping* object.  Return *NULL* if an exception was raised by the
 722    codec. If *mapping* is *NULL* latin-1 decoding will be done. Else it can be a
 723    dictionary mapping byte or a unicode string, which is treated as a lookup table.
 724    Byte values greater that the length of the string and U+FFFE "characters" are
 725    treated as "undefined mapping".
 726
 727
 728 .. cfunction:: PyObject* PyUnicode_EncodeCharmap(const Py_UNICODE *s, Py_ssize_t size, PyObject *mapping, const char *errors)
 729
 730    Encode the :ctype:`Py_UNICODE` buffer of the given size using the given
 731    *mapping* object and return a Python string object. Return *NULL* if an
 732    exception was raised by the codec.
 733
 734
 735 .. cfunction:: PyObject* PyUnicode_AsCharmapString(PyObject *unicode, PyObject *mapping)
 736
 737    Encode a Unicode object using the given *mapping* object and return the result
 738    as Python string object.  Error handling is "strict".  Return *NULL* if an
 739    exception was raised by the codec.
 740
 741 The following codec API is special in that maps Unicode to Unicode.
 742
 743
 744 .. cfunction:: PyObject* PyUnicode_TranslateCharmap(const Py_UNICODE *s, Py_ssize_t size, PyObject *table, const char *errors)
 745
 746    Translate a :ctype:`Py_UNICODE` buffer of the given length by applying a
 747    character mapping *table* to it and return the resulting Unicode object.  Return
 748    *NULL* when an exception was raised by the codec.
 749
 750    The *mapping* table must map Unicode ordinal integers to Unicode ordinal
 751    integers or None (causing deletion of the character).
 752
 753    Mapping tables need only provide the :meth:`__getitem__` interface; dictionaries
 754    and sequences work well.  Unmapped character ordinals (ones which cause a
 755    :exc:`LookupError`) are left untouched and are copied as-is.
 756
 757
 758 These are the MBCS codec APIs. They are currently only available on Windows and
 759 use the Win32 MBCS converters to implement the conversions.  Note that MBCS (or
 760 DBCS) is a class of encodings, not just one.  The target encoding is defined by
 761 the user settings on the machine running the codec.
 762
 763 .. % --- MBCS codecs for Windows --------------------------------------------
 764
 765
 766 .. cfunction:: PyObject* PyUnicode_DecodeMBCS(const char *s, Py_ssize_t size, const char *errors)
 767
 768    Create a Unicode object by decoding *size* bytes of the MBCS encoded string *s*.
 769    Return *NULL* if an exception was raised by the codec.
 770
 771
 772 .. cfunction:: PyObject* PyUnicode_DecodeMBCSStateful(const char *s, int size, const char *errors, int *consumed)
 773
 774    If *consumed* is *NULL*, behave like :cfunc:`PyUnicode_DecodeMBCS`. If
 775    *consumed* is not *NULL*, :cfunc:`PyUnicode_DecodeMBCSStateful` will not decode
 776    trailing lead byte and the number of bytes that have been decoded will be stored
 777    in *consumed*.
 778
 779
 780 .. cfunction:: PyObject* PyUnicode_EncodeMBCS(const Py_UNICODE *s, Py_ssize_t size, const char *errors)
 781
 782    Encode the :ctype:`Py_UNICODE` buffer of the given size using MBCS and return
 783    a Python bytes object.  Return *NULL* if an exception was raised by the
 784    codec.
 785
 786
 787 .. cfunction:: PyObject* PyUnicode_AsMBCSString(PyObject *unicode)
 788
 789    Encode a Unicode object using MBCS and return the result as Python bytes
 790    object.  Error handling is "strict".  Return *NULL* if an exception was
 791    raised by the codec.
 792
 793 For decoding file names and other environment strings, :cdata:`Py_FileSystemEncoding`
 794 should be used as the encoding, and ``"surrogateescape"`` should be used as the error
 795 handler. For encoding file names during argument parsing, the ``O&`` converter should
 796 be used, passsing PyUnicode_FSConverter as the conversion function:
 797
 798 .. cfunction:: int PyUnicode_FSConverter(PyObject* obj, void* result)
 799
 800    Convert *obj* into *result*, using the file system encoding, and the ``surrogateescape``
 801    error handler. *result* must be a ``PyObject*``, yielding a bytes or bytearray object
 802    which must be released if it is no longer used.
 803
 804    .. versionadded:: 3.1
 805
 806 .. % --- Methods & Slots ----------------------------------------------------
 807
 808
 809 .. _unicodemethodsandslots:
 810
 811 Methods and Slot Functions
 812 ^^^^^^^^^^^^^^^^^^^^^^^^^^
 813
 814 The following APIs are capable of handling Unicode objects and strings on input
 815 (we refer to them as strings in the descriptions) and return Unicode objects or
 816 integers as appropriate.
 817
 818 They all return *NULL* or ``-1`` if an exception occurs.
 819
 820
 821 .. cfunction:: PyObject* PyUnicode_Concat(PyObject *left, PyObject *right)
 822
 823    Concat two strings giving a new Unicode string.
 824
 825
 826 .. cfunction:: PyObject* PyUnicode_Split(PyObject *s, PyObject *sep, Py_ssize_t maxsplit)
 827
 828    Split a string giving a list of Unicode strings.  If sep is *NULL*, splitting
 829    will be done at all whitespace substrings.  Otherwise, splits occur at the given
 830    separator.  At most *maxsplit* splits will be done.  If negative, no limit is
 831    set.  Separators are not included in the resulting list.
 832
 833
 834 .. cfunction:: PyObject* PyUnicode_Splitlines(PyObject *s, int keepend)
 835
 836    Split a Unicode string at line breaks, returning a list of Unicode strings.
 837    CRLF is considered to be one line break.  If *keepend* is 0, the Line break
 838    characters are not included in the resulting strings.
 839
 840
 841 .. cfunction:: PyObject* PyUnicode_Translate(PyObject *str, PyObject *table, const char *errors)
 842
 843    Translate a string by applying a character mapping table to it and return the
 844    resulting Unicode object.
 845
 846    The mapping table must map Unicode ordinal integers to Unicode ordinal integers
 847    or None (causing deletion of the character).
 848
 849    Mapping tables need only provide the :meth:`__getitem__` interface; dictionaries
 850    and sequences work well.  Unmapped character ordinals (ones which cause a
 851    :exc:`LookupError`) are left untouched and are copied as-is.
 852
 853    *errors* has the usual meaning for codecs. It may be *NULL* which indicates to
 854    use the default error handling.
 855
 856
 857 .. cfunction:: PyObject* PyUnicode_Join(PyObject *separator, PyObject *seq)
 858
 859    Join a sequence of strings using the given separator and return the resulting
 860    Unicode string.
 861
 862
 863 .. cfunction:: int PyUnicode_Tailmatch(PyObject *str, PyObject *substr, Py_ssize_t start, Py_ssize_t end, int direction)
 864
 865    Return 1 if *substr* matches *str*[*start*:*end*] at the given tail end
 866    (*direction* == -1 means to do a prefix match, *direction* == 1 a suffix match),
 867    0 otherwise. Return ``-1`` if an error occurred.
 868
 869
 870 .. cfunction:: Py_ssize_t PyUnicode_Find(PyObject *str, PyObject *substr, Py_ssize_t start, Py_ssize_t end, int direction)
 871
 872    Return the first position of *substr* in *str*[*start*:*end*] using the given
 873    *direction* (*direction* == 1 means to do a forward search, *direction* == -1 a
 874    backward search).  The return value is the index of the first match; a value of
 875    ``-1`` indicates that no match was found, and ``-2`` indicates that an error
 876    occurred and an exception has been set.
 877
 878
 879 .. cfunction:: Py_ssize_t PyUnicode_Count(PyObject *str, PyObject *substr, Py_ssize_t start, Py_ssize_t end)
 880
 881    Return the number of non-overlapping occurrences of *substr* in
 882    ``str[start:end]``.  Return ``-1`` if an error occurred.
 883
 884
 885 .. cfunction:: PyObject* PyUnicode_Replace(PyObject *str, PyObject *substr, PyObject *replstr, Py_ssize_t maxcount)
 886
 887    Replace at most *maxcount* occurrences of *substr* in *str* with *replstr* and
 888    return the resulting Unicode object. *maxcount* == -1 means replace all
 889    occurrences.
 890
 891
 892 .. cfunction:: int PyUnicode_Compare(PyObject *left, PyObject *right)
 893
 894    Compare two strings and return -1, 0, 1 for less than, equal, and greater than,
 895    respectively.
 896
 897
 898 .. cfunction:: int PyUnicode_CompareWithASCIIString(PyObject *uni, char *string)
 899
 900    Compare a unicode object, *uni*, with *string* and return -1, 0, 1 for less
 901    than, equal, and greater than, respectively.
 902
 903
 904 .. cfunction:: int PyUnicode_RichCompare(PyObject *left,  PyObject *right,  int op)
 905
 906    Rich compare two unicode strings and return one of the following:
 907
 908    * ``NULL`` in case an exception was raised
 909    * :const:`Py_True` or :const:`Py_False` for successful comparisons
 910    * :const:`Py_NotImplemented` in case the type combination is unknown
 911
 912    Note that :const:`Py_EQ` and :const:`Py_NE` comparisons can cause a
 913    :exc:`UnicodeWarning` in case the conversion of the arguments to Unicode fails
 914    with a :exc:`UnicodeDecodeError`.
 915
 916    Possible values for *op* are :const:`Py_GT`, :const:`Py_GE`, :const:`Py_EQ`,
 917    :const:`Py_NE`, :const:`Py_LT`, and :const:`Py_LE`.
 918
 919
 920 .. cfunction:: PyObject* PyUnicode_Format(PyObject *format, PyObject *args)
 921
 922    Return a new string object from *format* and *args*; this is analogous to
 923    ``format % args``.  The *args* argument must be a tuple.
 924
 925
 926 .. cfunction:: int PyUnicode_Contains(PyObject *container, PyObject *element)
 927
 928    Check whether *element* is contained in *container* and return true or false
 929    accordingly.
 930
 931    *element* has to coerce to a one element Unicode string. ``-1`` is returned if
 932    there was an error.
 933
 934
 935 .. cfunction:: void PyUnicode_InternInPlace(PyObject **string)
 936
 937    Intern the argument *\*string* in place.  The argument must be the address of a
 938    pointer variable pointing to a Python unicode string object.  If there is an
 939    existing interned string that is the same as *\*string*, it sets *\*string* to
 940    it (decrementing the reference count of the old string object and incrementing
 941    the reference count of the interned string object), otherwise it leaves
 942    *\*string* alone and interns it (incrementing its reference count).
 943    (Clarification: even though there is a lot of talk about reference counts, think
 944    of this function as reference-count-neutral; you own the object after the call
 945    if and only if you owned it before the call.)
 946
 947
 948 .. cfunction:: PyObject* PyUnicode_InternFromString(const char *v)
 949
 950    A combination of :cfunc:`PyUnicode_FromString` and
 951    :cfunc:`PyUnicode_InternInPlace`, returning either a new unicode string object
 952    that has been interned, or a new ("owned") reference to an earlier interned
 953    string object with the same value.
 954