Doc/library/codecs.rst

   1
   2 :mod:`codecs` --- Codec registry and base classes
   3 =================================================
   4
   5 .. module:: codecs
   6    :synopsis: Encode and decode data and streams.
   7 .. moduleauthor:: Marc-Andre Lemburg <mal@lemburg.com>
   8 .. sectionauthor:: Marc-Andre Lemburg <mal@lemburg.com>
   9 .. sectionauthor:: Martin v. Löwis <martin@v.loewis.de>
  10
  11
  12 .. index::
  13    single: Unicode
  14    single: Codecs
  15    pair: Codecs; encode
  16    pair: Codecs; decode
  17    single: streams
  18    pair: stackable; streams
  19
  20 This module defines base classes for standard Python codecs (encoders and
  21 decoders) and provides access to the internal Python codec registry which
  22 manages the codec and error handling lookup process.
  23
  24 It defines the following functions:
  25
  26
  27 .. function:: register(search_function)
  28
  29    Register a codec search function. Search functions are expected to take one
  30    argument, the encoding name in all lower case letters, and return a
  31    :class:`CodecInfo` object having the following attributes:
  32
  33    * ``name`` The name of the encoding;
  34
  35    * ``encoder`` The stateless encoding function;
  36
  37    * ``decoder`` The stateless decoding function;
  38
  39    * ``incrementalencoder`` An incremental encoder class or factory function;
  40
  41    * ``incrementaldecoder`` An incremental decoder class or factory function;
  42
  43    * ``streamwriter`` A stream writer class or factory function;
  44
  45    * ``streamreader`` A stream reader class or factory function.
  46
  47    The various functions or classes take the following arguments:
  48
  49    *encoder* and *decoder*: These must be functions or methods which have the same
  50    interface as the :meth:`encode`/:meth:`decode` methods of Codec instances (see
  51    Codec Interface). The functions/methods are expected to work in a stateless
  52    mode.
  53
  54    *incrementalencoder* and *incrementalencoder*: These have to be factory
  55    functions providing the following interface:
  56
  57    ``factory(errors='strict')``
  58
  59    The factory functions must return objects providing the interfaces defined by
  60    the base classes :class:`IncrementalEncoder` and :class:`IncrementalEncoder`,
  61    respectively. Incremental codecs can maintain state.
  62
  63    *streamreader* and *streamwriter*: These have to be factory functions providing
  64    the following interface:
  65
  66    ``factory(stream, errors='strict')``
  67
  68    The factory functions must return objects providing the interfaces defined by
  69    the base classes :class:`StreamWriter` and :class:`StreamReader`, respectively.
  70    Stream codecs can maintain state.
  71
  72    Possible values for errors are ``'strict'`` (raise an exception in case of an
  73    encoding error), ``'replace'`` (replace malformed data with a suitable
  74    replacement marker, such as ``'?'``), ``'ignore'`` (ignore malformed data and
  75    continue without further notice), ``'xmlcharrefreplace'`` (replace with the
  76    appropriate XML character reference (for encoding only)) and
  77    ``'backslashreplace'`` (replace with backslashed escape sequences (for encoding
  78    only)) as well as any other error handling name defined via
  79    :func:`register_error`.
  80
  81    In case a search function cannot find a given encoding, it should return
  82    ``None``.
  83
  84
  85 .. function:: lookup(encoding)
  86
  87    Looks up the codec info in the Python codec registry and returns a
  88    :class:`CodecInfo` object as defined above.
  89
  90    Encodings are first looked up in the registry's cache. If not found, the list of
  91    registered search functions is scanned. If no :class:`CodecInfo` object is
  92    found, a :exc:`LookupError` is raised. Otherwise, the :class:`CodecInfo` object
  93    is stored in the cache and returned to the caller.
  94
  95 To simplify access to the various codecs, the module provides these additional
  96 functions which use :func:`lookup` for the codec lookup:
  97
  98
  99 .. function:: getencoder(encoding)
 100
 101    Look up the codec for the given encoding and return its encoder function.
 102
 103    Raises a :exc:`LookupError` in case the encoding cannot be found.
 104
 105
 106 .. function:: getdecoder(encoding)
 107
 108    Look up the codec for the given encoding and return its decoder function.
 109
 110    Raises a :exc:`LookupError` in case the encoding cannot be found.
 111
 112
 113 .. function:: getincrementalencoder(encoding)
 114
 115    Look up the codec for the given encoding and return its incremental encoder
 116    class or factory function.
 117
 118    Raises a :exc:`LookupError` in case the encoding cannot be found or the codec
 119    doesn't support an incremental encoder.
 120
 121    .. versionadded:: 2.5
 122
 123
 124 .. function:: getincrementaldecoder(encoding)
 125
 126    Look up the codec for the given encoding and return its incremental decoder
 127    class or factory function.
 128
 129    Raises a :exc:`LookupError` in case the encoding cannot be found or the codec
 130    doesn't support an incremental decoder.
 131
 132    .. versionadded:: 2.5
 133
 134
 135 .. function:: getreader(encoding)
 136
 137    Look up the codec for the given encoding and return its StreamReader class or
 138    factory function.
 139
 140    Raises a :exc:`LookupError` in case the encoding cannot be found.
 141
 142
 143 .. function:: getwriter(encoding)
 144
 145    Look up the codec for the given encoding and return its StreamWriter class or
 146    factory function.
 147
 148    Raises a :exc:`LookupError` in case the encoding cannot be found.
 149
 150
 151 .. function:: register_error(name, error_handler)
 152
 153    Register the error handling function *error_handler* under the name *name*.
 154    *error_handler* will be called during encoding and decoding in case of an error,
 155    when *name* is specified as the errors parameter.
 156
 157    For encoding *error_handler* will be called with a :exc:`UnicodeEncodeError`
 158    instance, which contains information about the location of the error. The error
 159    handler must either raise this or a different exception or return a tuple with a
 160    replacement for the unencodable part of the input and a position where encoding
 161    should continue. The encoder will encode the replacement and continue encoding
 162    the original input at the specified position. Negative position values will be
 163    treated as being relative to the end of the input string. If the resulting
 164    position is out of bound an :exc:`IndexError` will be raised.
 165
 166    Decoding and translating works similar, except :exc:`UnicodeDecodeError` or
 167    :exc:`UnicodeTranslateError` will be passed to the handler and that the
 168    replacement from the error handler will be put into the output directly.
 169
 170
 171 .. function:: lookup_error(name)
 172
 173    Return the error handler previously registered under the name *name*.
 174
 175    Raises a :exc:`LookupError` in case the handler cannot be found.
 176
 177
 178 .. function:: strict_errors(exception)
 179
 180    Implements the ``strict`` error handling.
 181
 182
 183 .. function:: replace_errors(exception)
 184
 185    Implements the ``replace`` error handling.
 186
 187
 188 .. function:: ignore_errors(exception)
 189
 190    Implements the ``ignore`` error handling.
 191
 192
 193 .. function:: xmlcharrefreplace_errors(exception)
 194
 195    Implements the ``xmlcharrefreplace`` error handling.
 196
 197
 198 .. function:: backslashreplace_errors(exception)
 199
 200    Implements the ``backslashreplace`` error handling.
 201
 202 To simplify working with encoded files or stream, the module also defines these
 203 utility functions:
 204
 205
 206 .. function:: open(filename, mode[, encoding[, errors[, buffering]]])
 207
 208    Open an encoded file using the given *mode* and return a wrapped version
 209    providing transparent encoding/decoding.
 210
 211    .. note::
 212
 213       The wrapped version will only accept the object format defined by the codecs,
 214       i.e. Unicode objects for most built-in codecs.  Output is also codec-dependent
 215       and will usually be Unicode as well.
 216
 217    *encoding* specifies the encoding which is to be used for the file.
 218
 219    *errors* may be given to define the error handling. It defaults to ``'strict'``
 220    which causes a :exc:`ValueError` to be raised in case an encoding error occurs.
 221
 222    *buffering* has the same meaning as for the built-in :func:`open` function.  It
 223    defaults to line buffered.
 224
 225
 226 .. function:: EncodedFile(file, input[, output[, errors]])
 227
 228    Return a wrapped version of file which provides transparent encoding
 229    translation.
 230
 231    Strings written to the wrapped file are interpreted according to the given
 232    *input* encoding and then written to the original file as strings using the
 233    *output* encoding. The intermediate encoding will usually be Unicode but depends
 234    on the specified codecs.
 235
 236    If *output* is not given, it defaults to *input*.
 237
 238    *errors* may be given to define the error handling. It defaults to ``'strict'``,
 239    which causes :exc:`ValueError` to be raised in case an encoding error occurs.
 240
 241
 242 .. function:: iterencode(iterable, encoding[, errors])
 243
 244    Uses an incremental encoder to iteratively encode the input provided by
 245    *iterable*. This function is a :term:`generator`.  *errors* (as well as any
 246    other keyword argument) is passed through to the incremental encoder.
 247
 248    .. versionadded:: 2.5
 249
 250
 251 .. function:: iterdecode(iterable, encoding[, errors])
 252
 253    Uses an incremental decoder to iteratively decode the input provided by
 254    *iterable*. This function is a :term:`generator`.  *errors* (as well as any
 255    other keyword argument) is passed through to the incremental decoder.
 256
 257    .. versionadded:: 2.5
 258
 259 The module also provides the following constants which are useful for reading
 260 and writing to platform dependent files:
 261
 262
 263 .. data:: BOM
 264           BOM_BE
 265           BOM_LE
 266           BOM_UTF8
 267           BOM_UTF16
 268           BOM_UTF16_BE
 269           BOM_UTF16_LE
 270           BOM_UTF32
 271           BOM_UTF32_BE
 272           BOM_UTF32_LE
 273
 274    These constants define various encodings of the Unicode byte order mark (BOM)
 275    used in UTF-16 and UTF-32 data streams to indicate the byte order used in the
 276    stream or file and in UTF-8 as a Unicode signature. :const:`BOM_UTF16` is either
 277    :const:`BOM_UTF16_BE` or :const:`BOM_UTF16_LE` depending on the platform's
 278    native byte order, :const:`BOM` is an alias for :const:`BOM_UTF16`,
 279    :const:`BOM_LE` for :const:`BOM_UTF16_LE` and :const:`BOM_BE` for
 280    :const:`BOM_UTF16_BE`. The others represent the BOM in UTF-8 and UTF-32
 281    encodings.
 282
 283
 284 .. _codec-base-classes:
 285
 286 Codec Base Classes
 287 ------------------
 288
 289 The :mod:`codecs` module defines a set of base classes which define the
 290 interface and can also be used to easily write you own codecs for use in Python.
 291
 292 Each codec has to define four interfaces to make it usable as codec in Python:
 293 stateless encoder, stateless decoder, stream reader and stream writer. The
 294 stream reader and writers typically reuse the stateless encoder/decoder to
 295 implement the file protocols.
 296
 297 The :class:`Codec` class defines the interface for stateless encoders/decoders.
 298
 299 To simplify and standardize error handling, the :meth:`encode` and
 300 :meth:`decode` methods may implement different error handling schemes by
 301 providing the *errors* string argument.  The following string values are defined
 302 and implemented by all standard Python codecs:
 303
 304 +-------------------------+-----------------------------------------------+
 305 | Value                   | Meaning                                       |
 306 +=========================+===============================================+
 307 | ``'strict'``            | Raise :exc:`UnicodeError` (or a subclass);    |
 308 |                         | this is the default.                          |
 309 +-------------------------+-----------------------------------------------+
 310 | ``'ignore'``            | Ignore the character and continue with the    |
 311 |                         | next.                                         |
 312 +-------------------------+-----------------------------------------------+
 313 | ``'replace'``           | Replace with a suitable replacement           |
 314 |                         | character; Python will use the official       |
 315 |                         | U+FFFD REPLACEMENT CHARACTER for the built-in |
 316 |                         | Unicode codecs on decoding and '?' on         |
 317 |                         | encoding.                                     |
 318 +-------------------------+-----------------------------------------------+
 319 | ``'xmlcharrefreplace'`` | Replace with the appropriate XML character    |
 320 |                         | reference (only for encoding).                |
 321 +-------------------------+-----------------------------------------------+
 322 | ``'backslashreplace'``  | Replace with backslashed escape sequences     |
 323 |                         | (only for encoding).                          |
 324 +-------------------------+-----------------------------------------------+
 325
 326 The set of allowed values can be extended via :meth:`register_error`.
 327
 328
 329 .. _codec-objects:
 330
 331 Codec Objects
 332 ^^^^^^^^^^^^^
 333
 334 The :class:`Codec` class defines these methods which also define the function
 335 interfaces of the stateless encoder and decoder:
 336
 337
 338 .. method:: Codec.encode(input[, errors])
 339
 340    Encodes the object *input* and returns a tuple (output object, length consumed).
 341    While codecs are not restricted to use with Unicode, in a Unicode context,
 342    encoding converts a Unicode object to a plain string using a particular
 343    character set encoding (e.g., ``cp1252`` or ``iso-8859-1``).
 344
 345    *errors* defines the error handling to apply. It defaults to ``'strict'``
 346    handling.
 347
 348    The method may not store state in the :class:`Codec` instance. Use
 349    :class:`StreamCodec` for codecs which have to keep state in order to make
 350    encoding/decoding efficient.
 351
 352    The encoder must be able to handle zero length input and return an empty object
 353    of the output object type in this situation.
 354
 355
 356 .. method:: Codec.decode(input[, errors])
 357
 358    Decodes the object *input* and returns a tuple (output object, length consumed).
 359    In a Unicode context, decoding converts a plain string encoded using a
 360    particular character set encoding to a Unicode object.
 361
 362    *input* must be an object which provides the ``bf_getreadbuf`` buffer slot.
 363    Python strings, buffer objects and memory mapped files are examples of objects
 364    providing this slot.
 365
 366    *errors* defines the error handling to apply. It defaults to ``'strict'``
 367    handling.
 368
 369    The method may not store state in the :class:`Codec` instance. Use
 370    :class:`StreamCodec` for codecs which have to keep state in order to make
 371    encoding/decoding efficient.
 372
 373    The decoder must be able to handle zero length input and return an empty object
 374    of the output object type in this situation.
 375
 376 The :class:`IncrementalEncoder` and :class:`IncrementalDecoder` classes provide
 377 the basic interface for incremental encoding and decoding. Encoding/decoding the
 378 input isn't done with one call to the stateless encoder/decoder function, but
 379 with multiple calls to the :meth:`encode`/:meth:`decode` method of the
 380 incremental encoder/decoder. The incremental encoder/decoder keeps track of the
 381 encoding/decoding process during method calls.
 382
 383 The joined output of calls to the :meth:`encode`/:meth:`decode` method is the
 384 same as if all the single inputs were joined into one, and this input was
 385 encoded/decoded with the stateless encoder/decoder.
 386
 387
 388 .. _incremental-encoder-objects:
 389
 390 IncrementalEncoder Objects
 391 ^^^^^^^^^^^^^^^^^^^^^^^^^^
 392
 393 .. versionadded:: 2.5
 394
 395 The :class:`IncrementalEncoder` class is used for encoding an input in multiple
 396 steps. It defines the following methods which every incremental encoder must
 397 define in order to be compatible with the Python codec registry.
 398
 399
 400 .. class:: IncrementalEncoder([errors])
 401
 402    Constructor for an :class:`IncrementalEncoder` instance.
 403
 404    All incremental encoders must provide this constructor interface. They are free
 405    to add additional keyword arguments, but only the ones defined here are used by
 406    the Python codec registry.
 407
 408    The :class:`IncrementalEncoder` may implement different error handling schemes
 409    by providing the *errors* keyword argument. These parameters are predefined:
 410
 411    * ``'strict'`` Raise :exc:`ValueError` (or a subclass); this is the default.
 412
 413    * ``'ignore'`` Ignore the character and continue with the next.
 414
 415    * ``'replace'`` Replace with a suitable replacement character
 416
 417    * ``'xmlcharrefreplace'`` Replace with the appropriate XML character reference
 418
 419    * ``'backslashreplace'`` Replace with backslashed escape sequences.
 420
 421    The *errors* argument will be assigned to an attribute of the same name.
 422    Assigning to this attribute makes it possible to switch between different error
 423    handling strategies during the lifetime of the :class:`IncrementalEncoder`
 424    object.
 425
 426    The set of allowed values for the *errors* argument can be extended with
 427    :func:`register_error`.
 428
 429
 430 .. method:: IncrementalEncoder.encode(object[, final])
 431
 432    Encodes *object* (taking the current state of the encoder into account) and
 433    returns the resulting encoded object. If this is the last call to :meth:`encode`
 434    *final* must be true (the default is false).
 435
 436
 437 .. method:: IncrementalEncoder.reset()
 438
 439    Reset the encoder to the initial state.
 440
 441
 442 .. _incremental-decoder-objects:
 443
 444 IncrementalDecoder Objects
 445 ^^^^^^^^^^^^^^^^^^^^^^^^^^
 446
 447 The :class:`IncrementalDecoder` class is used for decoding an input in multiple
 448 steps. It defines the following methods which every incremental decoder must
 449 define in order to be compatible with the Python codec registry.
 450
 451
 452 .. class:: IncrementalDecoder([errors])
 453
 454    Constructor for an :class:`IncrementalDecoder` instance.
 455
 456    All incremental decoders must provide this constructor interface. They are free
 457    to add additional keyword arguments, but only the ones defined here are used by
 458    the Python codec registry.
 459
 460    The :class:`IncrementalDecoder` may implement different error handling schemes
 461    by providing the *errors* keyword argument. These parameters are predefined:
 462
 463    * ``'strict'`` Raise :exc:`ValueError` (or a subclass); this is the default.
 464
 465    * ``'ignore'`` Ignore the character and continue with the next.
 466
 467    * ``'replace'`` Replace with a suitable replacement character.
 468
 469    The *errors* argument will be assigned to an attribute of the same name.
 470    Assigning to this attribute makes it possible to switch between different error
 471    handling strategies during the lifetime of the :class:`IncrementalEncoder`
 472    object.
 473
 474    The set of allowed values for the *errors* argument can be extended with
 475    :func:`register_error`.
 476
 477
 478 .. method:: IncrementalDecoder.decode(object[, final])
 479
 480    Decodes *object* (taking the current state of the decoder into account) and
 481    returns the resulting decoded object. If this is the last call to :meth:`decode`
 482    *final* must be true (the default is false). If *final* is true the decoder must
 483    decode the input completely and must flush all buffers. If this isn't possible
 484    (e.g. because of incomplete byte sequences at the end of the input) it must
 485    initiate error handling just like in the stateless case (which might raise an
 486    exception).
 487
 488
 489 .. method:: IncrementalDecoder.reset()
 490
 491    Reset the decoder to the initial state.
 492
 493 The :class:`StreamWriter` and :class:`StreamReader` classes provide generic
 494 working interfaces which can be used to implement new encoding submodules very
 495 easily. See :mod:`encodings.utf_8` for an example of how this is done.
 496
 497
 498 .. _stream-writer-objects:
 499
 500 StreamWriter Objects
 501 ^^^^^^^^^^^^^^^^^^^^
 502
 503 The :class:`StreamWriter` class is a subclass of :class:`Codec` and defines the
 504 following methods which every stream writer must define in order to be
 505 compatible with the Python codec registry.
 506
 507
 508 .. class:: StreamWriter(stream[, errors])
 509
 510    Constructor for a :class:`StreamWriter` instance.
 511
 512    All stream writers must provide this constructor interface. They are free to add
 513    additional keyword arguments, but only the ones defined here are used by the
 514    Python codec registry.
 515
 516    *stream* must be a file-like object open for writing binary data.
 517
 518    The :class:`StreamWriter` may implement different error handling schemes by
 519    providing the *errors* keyword argument. These parameters are predefined:
 520
 521    * ``'strict'`` Raise :exc:`ValueError` (or a subclass); this is the default.
 522
 523    * ``'ignore'`` Ignore the character and continue with the next.
 524
 525    * ``'replace'`` Replace with a suitable replacement character
 526
 527    * ``'xmlcharrefreplace'`` Replace with the appropriate XML character reference
 528
 529    * ``'backslashreplace'`` Replace with backslashed escape sequences.
 530
 531    The *errors* argument will be assigned to an attribute of the same name.
 532    Assigning to this attribute makes it possible to switch between different error
 533    handling strategies during the lifetime of the :class:`StreamWriter` object.
 534
 535    The set of allowed values for the *errors* argument can be extended with
 536    :func:`register_error`.
 537
 538
 539 .. method:: StreamWriter.write(object)
 540
 541    Writes the object's contents encoded to the stream.
 542
 543
 544 .. method:: StreamWriter.writelines(list)
 545
 546    Writes the concatenated list of strings to the stream (possibly by reusing the
 547    :meth:`write` method).
 548
 549
 550 .. method:: StreamWriter.reset()
 551
 552    Flushes and resets the codec buffers used for keeping state.
 553
 554    Calling this method should ensure that the data on the output is put into a
 555    clean state that allows appending of new fresh data without having to rescan the
 556    whole stream to recover state.
 557
 558 In addition to the above methods, the :class:`StreamWriter` must also inherit
 559 all other methods and attributes from the underlying stream.
 560
 561
 562 .. _stream-reader-objects:
 563
 564 StreamReader Objects
 565 ^^^^^^^^^^^^^^^^^^^^
 566
 567 The :class:`StreamReader` class is a subclass of :class:`Codec` and defines the
 568 following methods which every stream reader must define in order to be
 569 compatible with the Python codec registry.
 570
 571
 572 .. class:: StreamReader(stream[, errors])
 573
 574    Constructor for a :class:`StreamReader` instance.
 575
 576    All stream readers must provide this constructor interface. They are free to add
 577    additional keyword arguments, but only the ones defined here are used by the
 578    Python codec registry.
 579
 580    *stream* must be a file-like object open for reading (binary) data.
 581
 582    The :class:`StreamReader` may implement different error handling schemes by
 583    providing the *errors* keyword argument. These parameters are defined:
 584
 585    * ``'strict'`` Raise :exc:`ValueError` (or a subclass); this is the default.
 586
 587    * ``'ignore'`` Ignore the character and continue with the next.
 588
 589    * ``'replace'`` Replace with a suitable replacement character.
 590
 591    The *errors* argument will be assigned to an attribute of the same name.
 592    Assigning to this attribute makes it possible to switch between different error
 593    handling strategies during the lifetime of the :class:`StreamReader` object.
 594
 595    The set of allowed values for the *errors* argument can be extended with
 596    :func:`register_error`.
 597
 598
 599 .. method:: StreamReader.read([size[, chars, [firstline]]])
 600
 601    Decodes data from the stream and returns the resulting object.
 602
 603    *chars* indicates the number of characters to read from the stream. :func:`read`
 604    will never return more than *chars* characters, but it might return less, if
 605    there are not enough characters available.
 606
 607    *size* indicates the approximate maximum number of bytes to read from the stream
 608    for decoding purposes. The decoder can modify this setting as appropriate. The
 609    default value -1 indicates to read and decode as much as possible.  *size* is
 610    intended to prevent having to decode huge files in one step.
 611
 612    *firstline* indicates that it would be sufficient to only return the first line,
 613    if there are decoding errors on later lines.
 614
 615    The method should use a greedy read strategy meaning that it should read as much
 616    data as is allowed within the definition of the encoding and the given size,
 617    e.g.  if optional encoding endings or state markers are available on the stream,
 618    these should be read too.
 619
 620    .. versionchanged:: 2.4
 621       *chars* argument added.
 622
 623    .. versionchanged:: 2.4.2
 624       *firstline* argument added.
 625
 626
 627 .. method:: StreamReader.readline([size[, keepends]])
 628
 629    Read one line from the input stream and return the decoded data.
 630
 631    *size*, if given, is passed as size argument to the stream's :meth:`readline`
 632    method.
 633
 634    If *keepends* is false line-endings will be stripped from the lines returned.
 635
 636    .. versionchanged:: 2.4
 637       *keepends* argument added.
 638
 639
 640 .. method:: StreamReader.readlines([sizehint[, keepends]])
 641
 642    Read all lines available on the input stream and return them as a list of lines.
 643
 644    Line-endings are implemented using the codec's decoder method and are included
 645    in the list entries if *keepends* is true.
 646
 647    *sizehint*, if given, is passed as the *size* argument to the stream's
 648    :meth:`read` method.
 649
 650
 651 .. method:: StreamReader.reset()
 652
 653    Resets the codec buffers used for keeping state.
 654
 655    Note that no stream repositioning should take place.  This method is primarily
 656    intended to be able to recover from decoding errors.
 657
 658 In addition to the above methods, the :class:`StreamReader` must also inherit
 659 all other methods and attributes from the underlying stream.
 660
 661 The next two base classes are included for convenience. They are not needed by
 662 the codec registry, but may provide useful in practice.
 663
 664
 665 .. _stream-reader-writer:
 666
 667 StreamReaderWriter Objects
 668 ^^^^^^^^^^^^^^^^^^^^^^^^^^
 669
 670 The :class:`StreamReaderWriter` allows wrapping streams which work in both read
 671 and write modes.
 672
 673 The design is such that one can use the factory functions returned by the
 674 :func:`lookup` function to construct the instance.
 675
 676
 677 .. class:: StreamReaderWriter(stream, Reader, Writer, errors)
 678
 679    Creates a :class:`StreamReaderWriter` instance. *stream* must be a file-like
 680    object. *Reader* and *Writer* must be factory functions or classes providing the
 681    :class:`StreamReader` and :class:`StreamWriter` interface resp. Error handling
 682    is done in the same way as defined for the stream readers and writers.
 683
 684 :class:`StreamReaderWriter` instances define the combined interfaces of
 685 :class:`StreamReader` and :class:`StreamWriter` classes. They inherit all other
 686 methods and attributes from the underlying stream.
 687
 688
 689 .. _stream-recoder-objects:
 690
 691 StreamRecoder Objects
 692 ^^^^^^^^^^^^^^^^^^^^^
 693
 694 The :class:`StreamRecoder` provide a frontend - backend view of encoding data
 695 which is sometimes useful when dealing with different encoding environments.
 696
 697 The design is such that one can use the factory functions returned by the
 698 :func:`lookup` function to construct the instance.
 699
 700
 701 .. class:: StreamRecoder(stream, encode, decode, Reader, Writer, errors)
 702
 703    Creates a :class:`StreamRecoder` instance which implements a two-way conversion:
 704    *encode* and *decode* work on the frontend (the input to :meth:`read` and output
 705    of :meth:`write`) while *Reader* and *Writer* work on the backend (reading and
 706    writing to the stream).
 707
 708    You can use these objects to do transparent direct recodings from e.g. Latin-1
 709    to UTF-8 and back.
 710
 711    *stream* must be a file-like object.
 712
 713    *encode*, *decode* must adhere to the :class:`Codec` interface. *Reader*,
 714    *Writer* must be factory functions or classes providing objects of the
 715    :class:`StreamReader` and :class:`StreamWriter` interface respectively.
 716
 717    *encode* and *decode* are needed for the frontend translation, *Reader* and
 718    *Writer* for the backend translation.  The intermediate format used is
 719    determined by the two sets of codecs, e.g. the Unicode codecs will use Unicode
 720    as the intermediate encoding.
 721
 722    Error handling is done in the same way as defined for the stream readers and
 723    writers.
 724
 725 :class:`StreamRecoder` instances define the combined interfaces of
 726 :class:`StreamReader` and :class:`StreamWriter` classes. They inherit all other
 727 methods and attributes from the underlying stream.
 728
 729
 730 .. _encodings-overview:
 731
 732 Encodings and Unicode
 733 ---------------------
 734
 735 Unicode strings are stored internally as sequences of codepoints (to be precise
 736 as :ctype:`Py_UNICODE` arrays). Depending on the way Python is compiled (either
 737 via :option:`--enable-unicode=ucs2` or :option:`--enable-unicode=ucs4`, with the
 738 former being the default) :ctype:`Py_UNICODE` is either a 16-bit or 32-bit data
 739 type. Once a Unicode object is used outside of CPU and memory, CPU endianness
 740 and how these arrays are stored as bytes become an issue.  Transforming a
 741 unicode object into a sequence of bytes is called encoding and recreating the
 742 unicode object from the sequence of bytes is known as decoding.  There are many
 743 different methods for how this transformation can be done (these methods are
 744 also called encodings). The simplest method is to map the codepoints 0-255 to
 745 the bytes ``0x0``-``0xff``. This means that a unicode object that contains
 746 codepoints above ``U+00FF`` can't be encoded with this method (which is called
 747 ``'latin-1'`` or ``'iso-8859-1'``). :func:`unicode.encode` will raise a
 748 :exc:`UnicodeEncodeError` that looks like this: ``UnicodeEncodeError: 'latin-1'
 749 codec can't encode character u'\u1234' in position 3: ordinal not in
 750 range(256)``.
 751
 752 There's another group of encodings (the so called charmap encodings) that choose
 753 a different subset of all unicode code points and how these codepoints are
 754 mapped to the bytes ``0x0``-``0xff``. To see how this is done simply open
 755 e.g. :file:`encodings/cp1252.py` (which is an encoding that is used primarily on
 756 Windows). There's a string constant with 256 characters that shows you which
 757 character is mapped to which byte value.
 758
 759 All of these encodings can only encode 256 of the 65536 (or 1114111) codepoints
 760 defined in unicode. A simple and straightforward way that can store each Unicode
 761 code point, is to store each codepoint as two consecutive bytes. There are two
 762 possibilities: Store the bytes in big endian or in little endian order. These
 763 two encodings are called UTF-16-BE and UTF-16-LE respectively. Their
 764 disadvantage is that if e.g. you use UTF-16-BE on a little endian machine you
 765 will always have to swap bytes on encoding and decoding. UTF-16 avoids this
 766 problem: Bytes will always be in natural endianness. When these bytes are read
 767 by a CPU with a different endianness, then bytes have to be swapped though. To
 768 be able to detect the endianness of a UTF-16 byte sequence, there's the so
 769 called BOM (the "Byte Order Mark"). This is the Unicode character ``U+FEFF``.
 770 This character will be prepended to every UTF-16 byte sequence. The byte swapped
 771 version of this character (``0xFFFE``) is an illegal character that may not
 772 appear in a Unicode text. So when the first character in an UTF-16 byte sequence
 773 appears to be a ``U+FFFE`` the bytes have to be swapped on decoding.
 774 Unfortunately upto Unicode 4.0 the character ``U+FEFF`` had a second purpose as
 775 a ``ZERO WIDTH NO-BREAK SPACE``: A character that has no width and doesn't allow
 776 a word to be split. It can e.g. be used to give hints to a ligature algorithm.
 777 With Unicode 4.0 using ``U+FEFF`` as a ``ZERO WIDTH NO-BREAK SPACE`` has been
 778 deprecated (with ``U+2060`` (``WORD JOINER``) assuming this role). Nevertheless
 779 Unicode software still must be able to handle ``U+FEFF`` in both roles: As a BOM
 780 it's a device to determine the storage layout of the encoded bytes, and vanishes
 781 once the byte sequence has been decoded into a Unicode string; as a ``ZERO WIDTH
 782 NO-BREAK SPACE`` it's a normal character that will be decoded like any other.
 783
 784 There's another encoding that is able to encoding the full range of Unicode
 785 characters: UTF-8. UTF-8 is an 8-bit encoding, which means there are no issues
 786 with byte order in UTF-8. Each byte in a UTF-8 byte sequence consists of two
 787 parts: Marker bits (the most significant bits) and payload bits. The marker bits
 788 are a sequence of zero to six 1 bits followed by a 0 bit. Unicode characters are
 789 encoded like this (with x being payload bits, which when concatenated give the
 790 Unicode character):
 791
 792 +-----------------------------------+----------------------------------------------+
 793 | Range                             | Encoding                                     |
 794 +===================================+==============================================+
 795 | ``U-00000000`` ... ``U-0000007F`` | 0xxxxxxx                                     |
 796 +-----------------------------------+----------------------------------------------+
 797 | ``U-00000080`` ... ``U-000007FF`` | 110xxxxx 10xxxxxx                            |
 798 +-----------------------------------+----------------------------------------------+
 799 | ``U-00000800`` ... ``U-0000FFFF`` | 1110xxxx 10xxxxxx 10xxxxxx                   |
 800 +-----------------------------------+----------------------------------------------+
 801 | ``U-00010000`` ... ``U-001FFFFF`` | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx          |
 802 +-----------------------------------+----------------------------------------------+
 803 | ``U-00200000`` ... ``U-03FFFFFF`` | 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx |
 804 +-----------------------------------+----------------------------------------------+
 805 | ``U-04000000`` ... ``U-7FFFFFFF`` | 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx |
 806 |                                   | 10xxxxxx                                     |
 807 +-----------------------------------+----------------------------------------------+
 808
 809 The least significant bit of the Unicode character is the rightmost x bit.
 810
 811 As UTF-8 is an 8-bit encoding no BOM is required and any ``U+FEFF`` character in
 812 the decoded Unicode string (even if it's the first character) is treated as a
 813 ``ZERO WIDTH NO-BREAK SPACE``.
 814
 815 Without external information it's impossible to reliably determine which
 816 encoding was used for encoding a Unicode string. Each charmap encoding can
 817 decode any random byte sequence. However that's not possible with UTF-8, as
 818 UTF-8 byte sequences have a structure that doesn't allow arbitrary byte
 819 sequences. To increase the reliability with which a UTF-8 encoding can be
 820 detected, Microsoft invented a variant of UTF-8 (that Python 2.5 calls
 821 ``"utf-8-sig"``) for its Notepad program: Before any of the Unicode characters
 822 is written to the file, a UTF-8 encoded BOM (which looks like this as a byte
 823 sequence: ``0xef``, ``0xbb``, ``0xbf``) is written. As it's rather improbable
 824 that any charmap encoded file starts with these byte values (which would e.g.
 825 map to
 826
 827    | LATIN SMALL LETTER I WITH DIAERESIS
 828    | RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK
 829    | INVERTED QUESTION MARK
 830
 831 in iso-8859-1), this increases the probability that a utf-8-sig encoding can be
 832 correctly guessed from the byte sequence. So here the BOM is not used to be able
 833 to determine the byte order used for generating the byte sequence, but as a
 834 signature that helps in guessing the encoding. On encoding the utf-8-sig codec
 835 will write ``0xef``, ``0xbb``, ``0xbf`` as the first three bytes to the file. On
 836 decoding utf-8-sig will skip those three bytes if they appear as the first three
 837 bytes in the file.
 838
 839
 840 .. _standard-encodings:
 841
 842 Standard Encodings
 843 ------------------
 844
 845 Python comes with a number of codecs built-in, either implemented as C functions
 846 or with dictionaries as mapping tables. The following table lists the codecs by
 847 name, together with a few common aliases, and the languages for which the
 848 encoding is likely used. Neither the list of aliases nor the list of languages
 849 is meant to be exhaustive. Notice that spelling alternatives that only differ in
 850 case or use a hyphen instead of an underscore are also valid aliases.
 851
 852 Many of the character sets support the same languages. They vary in individual
 853 characters (e.g. whether the EURO SIGN is supported or not), and in the
 854 assignment of characters to code positions. For the European languages in
 855 particular, the following variants typically exist:
 856
 857 * an ISO 8859 codeset
 858
 859 * a Microsoft Windows code page, which is typically derived from a 8859 codeset,
 860   but replaces control characters with additional graphic characters
 861
 862 * an IBM EBCDIC code page
 863
 864 * an IBM PC code page, which is ASCII compatible
 865
 866 +-----------------+--------------------------------+--------------------------------+
 867 | Codec           | Aliases                        | Languages                      |
 868 +=================+================================+================================+
 869 | ascii           | 646, us-ascii                  | English                        |
 870 +-----------------+--------------------------------+--------------------------------+
 871 | big5            | big5-tw, csbig5                | Traditional Chinese            |
 872 +-----------------+--------------------------------+--------------------------------+
 873 | big5hkscs       | big5-hkscs, hkscs              | Traditional Chinese            |
 874 +-----------------+--------------------------------+--------------------------------+
 875 | cp037           | IBM037, IBM039                 | English                        |
 876 +-----------------+--------------------------------+--------------------------------+
 877 | cp424           | EBCDIC-CP-HE, IBM424           | Hebrew                         |
 878 +-----------------+--------------------------------+--------------------------------+
 879 | cp437           | 437, IBM437                    | English                        |
 880 +-----------------+--------------------------------+--------------------------------+
 881 | cp500           | EBCDIC-CP-BE, EBCDIC-CP-CH,    | Western Europe                 |
 882 |                 | IBM500                         |                                |
 883 +-----------------+--------------------------------+--------------------------------+
 884 | cp737           |                                | Greek                          |
 885 +-----------------+--------------------------------+--------------------------------+
 886 | cp775           | IBM775                         | Baltic languages               |
 887 +-----------------+--------------------------------+--------------------------------+
 888 | cp850           | 850, IBM850                    | Western Europe                 |
 889 +-----------------+--------------------------------+--------------------------------+
 890 | cp852           | 852, IBM852                    | Central and Eastern Europe     |
 891 +-----------------+--------------------------------+--------------------------------+
 892 | cp855           | 855, IBM855                    | Bulgarian, Byelorussian,       |
 893 |                 |                                | Macedonian, Russian, Serbian   |
 894 +-----------------+--------------------------------+--------------------------------+
 895 | cp856           |                                | Hebrew                         |
 896 +-----------------+--------------------------------+--------------------------------+
 897 | cp857           | 857, IBM857                    | Turkish                        |
 898 +-----------------+--------------------------------+--------------------------------+
 899 | cp860           | 860, IBM860                    | Portuguese                     |
 900 +-----------------+--------------------------------+--------------------------------+
 901 | cp861           | 861, CP-IS, IBM861             | Icelandic                      |
 902 +-----------------+--------------------------------+--------------------------------+
 903 | cp862           | 862, IBM862                    | Hebrew                         |
 904 +-----------------+--------------------------------+--------------------------------+
 905 | cp863           | 863, IBM863                    | Canadian                       |
 906 +-----------------+--------------------------------+--------------------------------+
 907 | cp864           | IBM864                         | Arabic                         |
 908 +-----------------+--------------------------------+--------------------------------+
 909 | cp865           | 865, IBM865                    | Danish, Norwegian              |
 910 +-----------------+--------------------------------+--------------------------------+
 911 | cp866           | 866, IBM866                    | Russian                        |
 912 +-----------------+--------------------------------+--------------------------------+
 913 | cp869           | 869, CP-GR, IBM869             | Greek                          |
 914 +-----------------+--------------------------------+--------------------------------+
 915 | cp874           |                                | Thai                           |
 916 +-----------------+--------------------------------+--------------------------------+
 917 | cp875           |                                | Greek                          |
 918 +-----------------+--------------------------------+--------------------------------+
 919 | cp932           | 932, ms932, mskanji, ms-kanji  | Japanese                       |
 920 +-----------------+--------------------------------+--------------------------------+
 921 | cp949           | 949, ms949, uhc                | Korean                         |
 922 +-----------------+--------------------------------+--------------------------------+
 923 | cp950           | 950, ms950                     | Traditional Chinese            |
 924 +-----------------+--------------------------------+--------------------------------+
 925 | cp1006          |                                | Urdu                           |
 926 +-----------------+--------------------------------+--------------------------------+
 927 | cp1026          | ibm1026                        | Turkish                        |
 928 +-----------------+--------------------------------+--------------------------------+
 929 | cp1140          | ibm1140                        | Western Europe                 |
 930 +-----------------+--------------------------------+--------------------------------+
 931 | cp1250          | windows-1250                   | Central and Eastern Europe     |
 932 +-----------------+--------------------------------+--------------------------------+
 933 | cp1251          | windows-1251                   | Bulgarian, Byelorussian,       |
 934 |                 |                                | Macedonian, Russian, Serbian   |
 935 +-----------------+--------------------------------+--------------------------------+
 936 | cp1252          | windows-1252                   | Western Europe                 |
 937 +-----------------+--------------------------------+--------------------------------+
 938 | cp1253          | windows-1253                   | Greek                          |
 939 +-----------------+--------------------------------+--------------------------------+
 940 | cp1254          | windows-1254                   | Turkish                        |
 941 +-----------------+--------------------------------+--------------------------------+
 942 | cp1255          | windows-1255                   | Hebrew                         |
 943 +-----------------+--------------------------------+--------------------------------+
 944 | cp1256          | windows1256                    | Arabic                         |
 945 +-----------------+--------------------------------+--------------------------------+
 946 | cp1257          | windows-1257                   | Baltic languages               |
 947 +-----------------+--------------------------------+--------------------------------+
 948 | cp1258          | windows-1258                   | Vietnamese                     |
 949 +-----------------+--------------------------------+--------------------------------+
 950 | euc_jp          | eucjp, ujis, u-jis             | Japanese                       |
 951 +-----------------+--------------------------------+--------------------------------+
 952 | euc_jis_2004    | jisx0213, eucjis2004           | Japanese                       |
 953 +-----------------+--------------------------------+--------------------------------+
 954 | euc_jisx0213    | eucjisx0213                    | Japanese                       |
 955 +-----------------+--------------------------------+--------------------------------+
 956 | euc_kr          | euckr, korean, ksc5601,        | Korean                         |
 957 |                 | ks_c-5601, ks_c-5601-1987,     |                                |
 958 |                 | ksx1001, ks_x-1001             |                                |
 959 +-----------------+--------------------------------+--------------------------------+
 960 | gb2312          | chinese, csiso58gb231280, euc- | Simplified Chinese             |
 961 |                 | cn, euccn, eucgb2312-cn,       |                                |
 962 |                 | gb2312-1980, gb2312-80, iso-   |                                |
 963 |                 | ir-58                          |                                |
 964 +-----------------+--------------------------------+--------------------------------+
 965 | gbk             | 936, cp936, ms936              | Unified Chinese                |
 966 +-----------------+--------------------------------+--------------------------------+
 967 | gb18030         | gb18030-2000                   | Unified Chinese                |
 968 +-----------------+--------------------------------+--------------------------------+
 969 | hz              | hzgb, hz-gb, hz-gb-2312        | Simplified Chinese             |
 970 +-----------------+--------------------------------+--------------------------------+
 971 | iso2022_jp      | csiso2022jp, iso2022jp,        | Japanese                       |
 972 |                 | iso-2022-jp                    |                                |
 973 +-----------------+--------------------------------+--------------------------------+
 974 | iso2022_jp_1    | iso2022jp-1, iso-2022-jp-1     | Japanese                       |
 975 +-----------------+--------------------------------+--------------------------------+
 976 | iso2022_jp_2    | iso2022jp-2, iso-2022-jp-2     | Japanese, Korean, Simplified   |
 977 |                 |                                | Chinese, Western Europe, Greek |
 978 +-----------------+--------------------------------+--------------------------------+
 979 | iso2022_jp_2004 | iso2022jp-2004,                | Japanese                       |
 980 |                 | iso-2022-jp-2004               |                                |
 981 +-----------------+--------------------------------+--------------------------------+
 982 | iso2022_jp_3    | iso2022jp-3, iso-2022-jp-3     | Japanese                       |
 983 +-----------------+--------------------------------+--------------------------------+
 984 | iso2022_jp_ext  | iso2022jp-ext, iso-2022-jp-ext | Japanese                       |
 985 +-----------------+--------------------------------+--------------------------------+
 986 | iso2022_kr      | csiso2022kr, iso2022kr,        | Korean                         |
 987 |                 | iso-2022-kr                    |                                |
 988 +-----------------+--------------------------------+--------------------------------+
 989 | latin_1         | iso-8859-1, iso8859-1, 8859,   | West Europe                    |
 990 |                 | cp819, latin, latin1, L1       |                                |
 991 +-----------------+--------------------------------+--------------------------------+
 992 | iso8859_2       | iso-8859-2, latin2, L2         | Central and Eastern Europe     |
 993 +-----------------+--------------------------------+--------------------------------+
 994 | iso8859_3       | iso-8859-3, latin3, L3         | Esperanto, Maltese             |
 995 +-----------------+--------------------------------+--------------------------------+
 996 | iso8859_4       | iso-8859-4, latin4, L4         | Baltic languagues              |
 997 +-----------------+--------------------------------+--------------------------------+
 998 | iso8859_5       | iso-8859-5, cyrillic           | Bulgarian, Byelorussian,       |
 999 |                 |                                | Macedonian, Russian, Serbian   |
1000 +-----------------+--------------------------------+--------------------------------+
1001 | iso8859_6       | iso-8859-6, arabic             | Arabic                         |
1002 +-----------------+--------------------------------+--------------------------------+
1003 | iso8859_7       | iso-8859-7, greek, greek8      | Greek                          |
1004 +-----------------+--------------------------------+--------------------------------+
1005 | iso8859_8       | iso-8859-8, hebrew             | Hebrew                         |
1006 +-----------------+--------------------------------+--------------------------------+
1007 | iso8859_9       | iso-8859-9, latin5, L5         | Turkish                        |
1008 +-----------------+--------------------------------+--------------------------------+
1009 | iso8859_10      | iso-8859-10, latin6, L6        | Nordic languages               |
1010 +-----------------+--------------------------------+--------------------------------+
1011 | iso8859_13      | iso-8859-13                    | Baltic languages               |
1012 +-----------------+--------------------------------+--------------------------------+
1013 | iso8859_14      | iso-8859-14, latin8, L8        | Celtic languages               |
1014 +-----------------+--------------------------------+--------------------------------+
1015 | iso8859_15      | iso-8859-15                    | Western Europe                 |
1016 +-----------------+--------------------------------+--------------------------------+
1017 | johab           | cp1361, ms1361                 | Korean                         |
1018 +-----------------+--------------------------------+--------------------------------+
1019 | koi8_r          |                                | Russian                        |
1020 +-----------------+--------------------------------+--------------------------------+
1021 | koi8_u          |                                | Ukrainian                      |
1022 +-----------------+--------------------------------+--------------------------------+
1023 | mac_cyrillic    | maccyrillic                    | Bulgarian, Byelorussian,       |
1024 |                 |                                | Macedonian, Russian, Serbian   |
1025 +-----------------+--------------------------------+--------------------------------+
1026 | mac_greek       | macgreek                       | Greek                          |
1027 +-----------------+--------------------------------+--------------------------------+
1028 | mac_iceland     | maciceland                     | Icelandic                      |
1029 +-----------------+--------------------------------+--------------------------------+
1030 | mac_latin2      | maclatin2, maccentraleurope    | Central and Eastern Europe     |
1031 +-----------------+--------------------------------+--------------------------------+
1032 | mac_roman       | macroman                       | Western Europe                 |
1033 +-----------------+--------------------------------+--------------------------------+
1034 | mac_turkish     | macturkish                     | Turkish                        |
1035 +-----------------+--------------------------------+--------------------------------+
1036 | ptcp154         | csptcp154, pt154, cp154,       | Kazakh                         |
1037 |                 | cyrillic-asian                 |                                |
1038 +-----------------+--------------------------------+--------------------------------+
1039 | shift_jis       | csshiftjis, shiftjis, sjis,    | Japanese                       |
1040 |                 | s_jis                          |                                |
1041 +-----------------+--------------------------------+--------------------------------+
1042 | shift_jis_2004  | shiftjis2004, sjis_2004,       | Japanese                       |
1043 |                 | sjis2004                       |                                |
1044 +-----------------+--------------------------------+--------------------------------+
1045 | shift_jisx0213  | shiftjisx0213, sjisx0213,      | Japanese                       |
1046 |                 | s_jisx0213                     |                                |
1047 +-----------------+--------------------------------+--------------------------------+
1048 | utf_32          | U32, utf32                     | all languages                  |
1049 +-----------------+--------------------------------+--------------------------------+
1050 | utf_32_be       | UTF-32BE                       | all languages                  |
1051 +-----------------+--------------------------------+--------------------------------+
1052 | utf_32_le       | UTF-32LE                       | all languages                  |
1053 +-----------------+--------------------------------+--------------------------------+
1054 | utf_16          | U16, utf16                     | all languages                  |
1055 +-----------------+--------------------------------+--------------------------------+
1056 | utf_16_be       | UTF-16BE                       | all languages (BMP only)       |
1057 +-----------------+--------------------------------+--------------------------------+
1058 | utf_16_le       | UTF-16LE                       | all languages (BMP only)       |
1059 +-----------------+--------------------------------+--------------------------------+
1060 | utf_7           | U7, unicode-1-1-utf-7          | all languages                  |
1061 +-----------------+--------------------------------+--------------------------------+
1062 | utf_8           | U8, UTF, utf8                  | all languages                  |
1063 +-----------------+--------------------------------+--------------------------------+
1064 | utf_8_sig       |                                | all languages                  |
1065 +-----------------+--------------------------------+--------------------------------+
1066
1067 A number of codecs are specific to Python, so their codec names have no meaning
1068 outside Python. Some of them don't convert from Unicode strings to byte strings,
1069 but instead use the property of the Python codecs machinery that any bijective
1070 function with one argument can be considered as an encoding.
1071
1072 For the codecs listed below, the result in the "encoding" direction is always a
1073 byte string. The result of the "decoding" direction is listed as operand type in
1074 the table.
1075
1076 +--------------------+---------------------------+----------------+---------------------------+
1077 | Codec              | Aliases                   | Operand type   | Purpose                   |
1078 +====================+===========================+================+===========================+
1079 | base64_codec       | base64, base-64           | byte string    | Convert operand to MIME   |
1080 |                    |                           |                | base64                    |
1081 +--------------------+---------------------------+----------------+---------------------------+
1082 | bz2_codec          | bz2                       | byte string    | Compress the operand      |
1083 |                    |                           |                | using bz2                 |
1084 +--------------------+---------------------------+----------------+---------------------------+
1085 | hex_codec          | hex                       | byte string    | Convert operand to        |
1086 |                    |                           |                | hexadecimal               |
1087 |                    |                           |                | representation, with two  |
1088 |                    |                           |                | digits per byte           |
1089 +--------------------+---------------------------+----------------+---------------------------+
1090 | idna               |                           | Unicode string | Implements :rfc:`3490`,   |
1091 |                    |                           |                | see also                  |
1092 |                    |                           |                | :mod:`encodings.idna`     |
1093 +--------------------+---------------------------+----------------+---------------------------+
1094 | mbcs               | dbcs                      | Unicode string | Windows only: Encode      |
1095 |                    |                           |                | operand according to the  |
1096 |                    |                           |                | ANSI codepage (CP_ACP)    |
1097 +--------------------+---------------------------+----------------+---------------------------+
1098 | palmos             |                           | Unicode string | Encoding of PalmOS 3.5    |
1099 +--------------------+---------------------------+----------------+---------------------------+
1100 | punycode           |                           | Unicode string | Implements :rfc:`3492`    |
1101 +--------------------+---------------------------+----------------+---------------------------+
1102 | quopri_codec       | quopri, quoted-printable, | byte string    | Convert operand to MIME   |
1103 |                    | quotedprintable           |                | quoted printable          |
1104 +--------------------+---------------------------+----------------+---------------------------+
1105 | raw_unicode_escape |                           | Unicode string | Produce a string that is  |
1106 |                    |                           |                | suitable as raw Unicode   |
1107 |                    |                           |                | literal in Python source  |
1108 |                    |                           |                | code                      |
1109 +--------------------+---------------------------+----------------+---------------------------+
1110 | rot_13             | rot13                     | Unicode string | Returns the Caesar-cypher |
1111 |                    |                           |                | encryption of the operand |
1112 +--------------------+---------------------------+----------------+---------------------------+
1113 | string_escape      |                           | byte string    | Produce a string that is  |
1114 |                    |                           |                | suitable as string        |
1115 |                    |                           |                | literal in Python source  |
1116 |                    |                           |                | code                      |
1117 +--------------------+---------------------------+----------------+---------------------------+
1118 | undefined          |                           | any            | Raise an exception for    |
1119 |                    |                           |                | all conversions. Can be   |
1120 |                    |                           |                | used as the system        |
1121 |                    |                           |                | encoding if no automatic  |
1122 |                    |                           |                | :term:`coercion` between  |
1123 |                    |                           |                | byte and Unicode strings  |
1124 |                    |                           |                | is desired.               |
1125 +--------------------+---------------------------+----------------+---------------------------+
1126 | unicode_escape     |                           | Unicode string | Produce a string that is  |
1127 |                    |                           |                | suitable as Unicode       |
1128 |                    |                           |                | literal in Python source  |
1129 |                    |                           |                | code                      |
1130 +--------------------+---------------------------+----------------+---------------------------+
1131 | unicode_internal   |                           | Unicode string | Return the internal       |
1132 |                    |                           |                | representation of the     |
1133 |                    |                           |                | operand                   |
1134 +--------------------+---------------------------+----------------+---------------------------+
1135 | uu_codec           | uu                        | byte string    | Convert the operand using |
1136 |                    |                           |                | uuencode                  |
1137 +--------------------+---------------------------+----------------+---------------------------+
1138 | zlib_codec         | zip, zlib                 | byte string    | Compress the operand      |
1139 |                    |                           |                | using gzip                |
1140 +--------------------+---------------------------+----------------+---------------------------+
1141
1142 .. versionadded:: 2.3
1143    The ``idna`` and ``punycode`` encodings.
1144
1145
1146 :mod:`encodings.idna` --- Internationalized Domain Names in Applications
1147 ------------------------------------------------------------------------
1148
1149 .. module:: encodings.idna
1150    :synopsis: Internationalized Domain Names implementation
1151 .. moduleauthor:: Martin v. Löwis
1152
1153 .. versionadded:: 2.3
1154
1155 This module implements :rfc:`3490` (Internationalized Domain Names in
1156 Applications) and :rfc:`3492` (Nameprep: A Stringprep Profile for
1157 Internationalized Domain Names (IDN)). It builds upon the ``punycode`` encoding
1158 and :mod:`stringprep`.
1159
1160 These RFCs together define a protocol to support non-ASCII characters in domain
1161 names. A domain name containing non-ASCII characters (such as
1162 ``www.Alliancefrançaise.nu``) is converted into an ASCII-compatible encoding
1163 (ACE, such as ``www.xn--alliancefranaise-npb.nu``). The ACE form of the domain
1164 name is then used in all places where arbitrary characters are not allowed by
1165 the protocol, such as DNS queries, HTTP :mailheader:`Host` fields, and so
1166 on. This conversion is carried out in the application; if possible invisible to
1167 the user: The application should transparently convert Unicode domain labels to
1168 IDNA on the wire, and convert back ACE labels to Unicode before presenting them
1169 to the user.
1170
1171 Python supports this conversion in several ways: The ``idna`` codec allows to
1172 convert between Unicode and the ACE. Furthermore, the :mod:`socket` module
1173 transparently converts Unicode host names to ACE, so that applications need not
1174 be concerned about converting host names themselves when they pass them to the
1175 socket module. On top of that, modules that have host names as function
1176 parameters, such as :mod:`httplib` and :mod:`ftplib`, accept Unicode host names
1177 (:mod:`httplib` then also transparently sends an IDNA hostname in the
1178 :mailheader:`Host` field if it sends that field at all).
1179
1180 When receiving host names from the wire (such as in reverse name lookup), no
1181 automatic conversion to Unicode is performed: Applications wishing to present
1182 such host names to the user should decode them to Unicode.
1183
1184 The module :mod:`encodings.idna` also implements the nameprep procedure, which
1185 performs certain normalizations on host names, to achieve case-insensitivity of
1186 international domain names, and to unify similar characters. The nameprep
1187 functions can be used directly if desired.
1188
1189
1190 .. function:: nameprep(label)
1191
1192    Return the nameprepped version of *label*. The implementation currently assumes
1193    query strings, so ``AllowUnassigned`` is true.
1194
1195
1196 .. function:: ToASCII(label)
1197
1198    Convert a label to ASCII, as specified in :rfc:`3490`. ``UseSTD3ASCIIRules`` is
1199    assumed to be false.
1200
1201
1202 .. function:: ToUnicode(label)
1203
1204    Convert a label to Unicode, as specified in :rfc:`3490`.
1205
1206
1207 :mod:`encodings.utf_8_sig` --- UTF-8 codec with BOM signature
1208 -------------------------------------------------------------
1209
1210 .. module:: encodings.utf_8_sig
1211    :synopsis: UTF-8 codec with BOM signature
1212 .. moduleauthor:: Walter Dörwald
1213
1214 .. versionadded:: 2.5
1215
1216 This module implements a variant of the UTF-8 codec: On encoding a UTF-8 encoded
1217 BOM will be prepended to the UTF-8 encoded bytes. For the stateful encoder this
1218 is only done once (on the first write to the byte stream).  For decoding an
1219 optional UTF-8 encoded BOM at the start of the data will be skipped.
1220