Doc/howto/unicode.rst

   1 Unicode HOWTO
   2 ================
   3
   4 **Version 1.02**
   5
   6 This HOWTO discusses Python's support for Unicode, and explains various
   7 problems that people commonly encounter when trying to work with Unicode.
   8
   9 Introduction to Unicode
  10 ------------------------------
  11
  12 History of Character Codes
  13 ''''''''''''''''''''''''''''''
  14
  15 In 1968, the American Standard Code for Information Interchange,
  16 better known by its acronym ASCII, was standardized.  ASCII defined
  17 numeric codes for various characters, with the numeric values running from 0 to
  18 127.  For example, the lowercase letter 'a' is assigned 97 as its code
  19 value.
  20
  21 ASCII was an American-developed standard, so it only defined
  22 unaccented characters.  There was an 'e', but no 'é' or 'Í'.  This
  23 meant that languages which required accented characters couldn't be
  24 faithfully represented in ASCII.  (Actually the missing accents matter
  25 for English, too, which contains words such as 'naïve' and 'café', and some
  26 publications have house styles which require spellings such as
  27 'coöperate'.)
  28
  29 For a while people just wrote programs that didn't display accents.  I
  30 remember looking at Apple ][ BASIC programs, published in French-language
  31 publications in the mid-1980s, that had lines like these::
  32
  33         PRINT "FICHER EST COMPLETE."
  34         PRINT "CARACTERE NON ACCEPTE."
  35
  36 Those messages should contain accents, and they just look wrong to
  37 someone who can read French.
  38
  39 In the 1980s, almost all personal computers were 8-bit, meaning that
  40 bytes could hold values ranging from 0 to 255.  ASCII codes only went
  41 up to 127, so some machines assigned values between 128 and 255 to
  42 accented characters.  Different machines had different codes, however,
  43 which led to problems exchanging files.  Eventually various commonly
  44 used sets of values for the 128-255 range emerged.  Some were true
  45 standards, defined by the International Standards Organization, and
  46 some were **de facto** conventions that were invented by one company
  47 or another and managed to catch on.
  48
  49 255 characters aren't very many.  For example, you can't fit
  50 both the accented characters used in Western Europe and the Cyrillic
  51 alphabet used for Russian into the 128-255 range because there are more than
  52 127 such characters.
  53
  54 You could write files using different codes (all your Russian
  55 files in a coding system called KOI8, all your French files in
  56 a different coding system called Latin1), but what if you wanted
  57 to write a French document that quotes some Russian text?  In the
  58 1980s people began to want to solve this problem, and the Unicode
  59 standardization effort began.
  60
  61 Unicode started out using 16-bit characters instead of 8-bit characters.  16
  62 bits means you have 2^16 = 65,536 distinct values available, making it
  63 possible to represent many different characters from many different
  64 alphabets; an initial goal was to have Unicode contain the alphabets for
  65 every single human language.  It turns out that even 16 bits isn't enough to
  66 meet that goal, and the modern Unicode specification uses a wider range of
  67 codes, 0-1,114,111 (0x10ffff in base-16).
  68
  69 There's a related ISO standard, ISO 10646.  Unicode and ISO 10646 were
  70 originally separate efforts, but the specifications were merged with
  71 the 1.1 revision of Unicode.
  72
  73 (This discussion of Unicode's history is highly simplified.  I don't
  74 think the average Python programmer needs to worry about the
  75 historical details; consult the Unicode consortium site listed in the
  76 References for more information.)
  77
  78
  79 Definitions
  80 ''''''''''''''''''''''''
  81
  82 A **character** is the smallest possible component of a text.  'A',
  83 'B', 'C', etc., are all different characters.  So are 'È' and
  84 'Í'.  Characters are abstractions, and vary depending on the
  85 language or context you're talking about.  For example, the symbol for
  86 ohms (Ω) is usually drawn much like the capital letter
  87 omega (Ω) in the Greek alphabet (they may even be the same in
  88 some fonts), but these are two different characters that have
  89 different meanings.
  90
  91 The Unicode standard describes how characters are represented by
  92 **code points**.  A code point is an integer value, usually denoted in
  93 base 16.  In the standard, a code point is written using the notation
  94 U+12ca to mean the character with value 0x12ca (4810 decimal).  The
  95 Unicode standard contains a lot of tables listing characters and their
  96 corresponding code points::
  97
  98         0061    'a'; LATIN SMALL LETTER A
  99         0062    'b'; LATIN SMALL LETTER B
 100         0063    'c'; LATIN SMALL LETTER C
 101         ...
 102         007B    '{'; LEFT CURLY BRACKET
 103
 104 Strictly, these definitions imply that it's meaningless to say 'this is
 105 character U+12ca'.  U+12ca is a code point, which represents some particular
 106 character; in this case, it represents the character 'ETHIOPIC SYLLABLE WI'.
 107 In informal contexts, this distinction between code points and characters will
 108 sometimes be forgotten.
 109
 110 A character is represented on a screen or on paper by a set of graphical
 111 elements that's called a **glyph**.  The glyph for an uppercase A, for
 112 example, is two diagonal strokes and a horizontal stroke, though the exact
 113 details will depend on the font being used.  Most Python code doesn't need
 114 to worry about glyphs; figuring out the correct glyph to display is
 115 generally the job of a GUI toolkit or a terminal's font renderer.
 116
 117
 118 Encodings
 119 '''''''''
 120
 121 To summarize the previous section:
 122 a Unicode string is a sequence of code points, which are
 123 numbers from 0 to 0x10ffff.  This sequence needs to be represented as
 124 a set of bytes (meaning, values from 0-255) in memory.  The rules for
 125 translating a Unicode string into a sequence of bytes are called an
 126 **encoding**.
 127
 128 The first encoding you might think of is an array of 32-bit integers.
 129 In this representation, the string "Python" would look like this::
 130
 131        P           y           t           h           o           n
 132     0x50 00 00 00 79 00 00 00 74 00 00 00 68 00 00 00 6f 00 00 00 6e 00 00 00
 133        0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
 134
 135 This representation is straightforward but using
 136 it presents a number of problems.
 137
 138 1. It's not portable; different processors order the bytes
 139    differently.
 140
 141 2. It's very wasteful of space.  In most texts, the majority of the code
 142    points are less than 127, or less than 255, so a lot of space is occupied
 143    by zero bytes.  The above string takes 24 bytes compared to the 6
 144    bytes needed for an ASCII representation.  Increased RAM usage doesn't
 145    matter too much (desktop computers have megabytes of RAM, and strings
 146    aren't usually that large), but expanding our usage of disk and
 147    network bandwidth by a factor of 4 is intolerable.
 148
 149 3. It's not compatible with existing C functions such as ``strlen()``,
 150    so a new family of wide string functions would need to be used.
 151
 152 4. Many Internet standards are defined in terms of textual data, and
 153    can't handle content with embedded zero bytes.
 154
 155 Generally people don't use this encoding, choosing other encodings
 156 that are more efficient and convenient.
 157
 158 Encodings don't have to handle every possible Unicode character, and
 159 most encodings don't.  For example, Python's default encoding is the
 160 'ascii' encoding.  The rules for converting a Unicode string into the
 161 ASCII encoding are simple; for each code point:
 162
 163 1. If the code point is <128, each byte is the same as the value of the
 164    code point.
 165
 166 2. If the code point is 128 or greater, the Unicode string can't
 167    be represented in this encoding.  (Python raises  a
 168    ``UnicodeEncodeError`` exception in this case.)
 169
 170 Latin-1, also known as ISO-8859-1, is a similar encoding.  Unicode
 171 code points 0-255 are identical to the Latin-1 values, so converting
 172 to this encoding simply requires converting code points to byte
 173 values; if a code point larger than 255 is encountered, the string
 174 can't be encoded into Latin-1.
 175
 176 Encodings don't have to be simple one-to-one mappings like Latin-1.
 177 Consider IBM's EBCDIC, which was used on IBM mainframes.  Letter
 178 values weren't in one block: 'a' through 'i' had values from 129 to
 179 137, but 'j' through 'r' were 145 through 153.  If you wanted to use
 180 EBCDIC as an encoding, you'd probably use some sort of lookup table to
 181 perform the conversion, but this is largely an internal detail.
 182
 183 UTF-8 is one of the most commonly used encodings.  UTF stands for
 184 "Unicode Transformation Format", and the '8' means that 8-bit numbers
 185 are used in the encoding.  (There's also a UTF-16 encoding, but it's
 186 less frequently used than UTF-8.)  UTF-8 uses the following rules:
 187
 188 1. If the code point is <128, it's represented by the corresponding byte value.
 189 2. If the code point is between 128 and 0x7ff, it's turned into two byte values
 190    between 128 and 255.
 191 3. Code points >0x7ff are turned into three- or four-byte sequences, where
 192    each byte of the sequence is between 128 and 255.
 193
 194 UTF-8 has several convenient properties:
 195
 196 1. It can handle any Unicode code point.
 197 2. A Unicode string is turned into a string of bytes containing no embedded zero bytes.  This avoids byte-ordering issues, and means UTF-8 strings can be processed by C functions such as ``strcpy()`` and sent through protocols that can't handle zero bytes.
 198 3. A string of ASCII text is also valid UTF-8 text.
 199 4. UTF-8 is fairly compact; the majority of code points are turned into two bytes, and values less than 128 occupy only a single byte.
 200 5. If bytes are corrupted or lost, it's possible to determine the start of the next UTF-8-encoded code point and resynchronize.  It's also unlikely that random 8-bit data will look like valid UTF-8.
 201
 202
 203
 204 References
 205 ''''''''''''''
 206
 207 The Unicode Consortium site at <http://www.unicode.org> has character
 208 charts, a glossary, and PDF versions of the Unicode specification.  Be
 209 prepared for some difficult reading.
 210 <http://www.unicode.org/history/> is a chronology of the origin and
 211 development of Unicode.
 212
 213 To help understand the standard, Jukka Korpela has written an
 214 introductory guide to reading the Unicode character tables,
 215 available at <http://www.cs.tut.fi/~jkorpela/unicode/guide.html>.
 216
 217 Roman Czyborra wrote another explanation of Unicode's basic principles;
 218 it's at <http://czyborra.com/unicode/characters.html>.
 219 Czyborra has written a number of other Unicode-related documentation,
 220 available from <http://www.cyzborra.com>.
 221
 222 Two other good introductory articles were written by Joel Spolsky
 223 <http://www.joelonsoftware.com/articles/Unicode.html> and Jason
 224 Orendorff <http://www.jorendorff.com/articles/unicode/>.  If this
 225 introduction didn't make things clear to you, you should try reading
 226 one of these alternate articles before continuing.
 227
 228 Wikipedia entries are often helpful; see the entries for "character
 229 encoding" <http://en.wikipedia.org/wiki/Character_encoding> and UTF-8
 230 <http://en.wikipedia.org/wiki/UTF-8>, for example.
 231
 232
 233 Python's Unicode Support
 234 ------------------------
 235
 236 Now that you've learned the rudiments of Unicode, we can look at
 237 Python's Unicode features.
 238
 239
 240 The Unicode Type
 241 '''''''''''''''''''
 242
 243 Unicode strings are expressed as instances of the ``unicode`` type,
 244 one of Python's repertoire of built-in types.  It derives from an
 245 abstract type called ``basestring``, which is also an ancestor of the
 246 ``str`` type; you can therefore check if a value is a string type with
 247 ``isinstance(value, basestring)``.  Under the hood, Python represents
 248 Unicode strings as either 16- or 32-bit integers, depending on how the
 249 Python interpreter was compiled.
 250
 251 The ``unicode()`` constructor has the signature ``unicode(string[, encoding, errors])``.
 252 All of its arguments should be 8-bit strings.  The first argument is converted
 253 to Unicode using the specified encoding; if you leave off the ``encoding`` argument,
 254 the ASCII encoding is used for the conversion, so characters greater than 127 will
 255 be treated as errors::
 256
 257     >>> unicode('abcdef')
 258     u'abcdef'
 259     >>> s = unicode('abcdef')
 260     >>> type(s)
 261     <type 'unicode'>
 262     >>> unicode('abcdef' + chr(255))
 263     Traceback (most recent call last):
 264       File "<stdin>", line 1, in ?
 265     UnicodeDecodeError: 'ascii' codec can't decode byte 0xff in position 6:
 266                         ordinal not in range(128)
 267
 268 The ``errors`` argument specifies the response when the input string can't be converted according to the encoding's rules.  Legal values for this argument
 269 are 'strict' (raise a ``UnicodeDecodeError`` exception),
 270 'replace' (add U+FFFD, 'REPLACEMENT CHARACTER'),
 271 or 'ignore' (just leave the character out of the Unicode result).
 272 The following examples show the differences::
 273
 274     >>> unicode('\x80abc', errors='strict')
 275     Traceback (most recent call last):
 276       File "<stdin>", line 1, in ?
 277     UnicodeDecodeError: 'ascii' codec can't decode byte 0x80 in position 0:
 278                         ordinal not in range(128)
 279     >>> unicode('\x80abc', errors='replace')
 280     u'\ufffdabc'
 281     >>> unicode('\x80abc', errors='ignore')
 282     u'abc'
 283
 284 Encodings are specified as strings containing the encoding's name.
 285 Python 2.4 comes with roughly 100 different encodings; see the Python
 286 Library Reference at
 287 <http://docs.python.org/lib/standard-encodings.html> for a list.  Some
 288 encodings have multiple names; for example, 'latin-1', 'iso_8859_1'
 289 and '8859' are all synonyms for the same encoding.
 290
 291 One-character Unicode strings can also be created with the
 292 ``unichr()`` built-in function, which takes integers and returns a
 293 Unicode string of length 1 that contains the corresponding code point.
 294 The reverse operation is the built-in `ord()` function that takes a
 295 one-character Unicode string and returns the code point value::
 296
 297     >>> unichr(40960)
 298     u'\ua000'
 299     >>> ord(u'\ua000')
 300     40960
 301
 302 Instances of the ``unicode`` type have many of the same methods as
 303 the 8-bit string type for operations such as searching and formatting::
 304
 305     >>> s = u'Was ever feather so lightly blown to and fro as this multitude?'
 306     >>> s.count('e')
 307     5
 308     >>> s.find('feather')
 309     9
 310     >>> s.find('bird')
 311     -1
 312     >>> s.replace('feather', 'sand')
 313     u'Was ever sand so lightly blown to and fro as this multitude?'
 314     >>> s.upper()
 315     u'WAS EVER FEATHER SO LIGHTLY BLOWN TO AND FRO AS THIS MULTITUDE?'
 316
 317 Note that the arguments to these methods can be Unicode strings or 8-bit strings.
 318 8-bit strings will be converted to Unicode before carrying out the operation;
 319 Python's default ASCII encoding will be used, so characters greater than 127 will cause an exception::
 320
 321     >>> s.find('Was\x9f')
 322     Traceback (most recent call last):
 323       File "<stdin>", line 1, in ?
 324     UnicodeDecodeError: 'ascii' codec can't decode byte 0x9f in position 3: ordinal not in range(128)
 325     >>> s.find(u'Was\x9f')
 326     -1
 327
 328 Much Python code that operates on strings will therefore work with
 329 Unicode strings without requiring any changes to the code.  (Input and
 330 output code needs more updating for Unicode; more on this later.)
 331
 332 Another important method is ``.encode([encoding], [errors='strict'])``,
 333 which returns an 8-bit string version of the
 334 Unicode string, encoded in the requested encoding.  The ``errors``
 335 parameter is the same as the parameter of the ``unicode()``
 336 constructor, with one additional possibility; as well as 'strict',
 337 'ignore', and 'replace', you can also pass 'xmlcharrefreplace' which
 338 uses XML's character references.  The following example shows the
 339 different results::
 340
 341     >>> u = unichr(40960) + u'abcd' + unichr(1972)
 342     >>> u.encode('utf-8')
 343     '\xea\x80\x80abcd\xde\xb4'
 344     >>> u.encode('ascii')
 345     Traceback (most recent call last):
 346       File "<stdin>", line 1, in ?
 347     UnicodeEncodeError: 'ascii' codec can't encode character '\ua000' in position 0: ordinal not in range(128)
 348     >>> u.encode('ascii', 'ignore')
 349     'abcd'
 350     >>> u.encode('ascii', 'replace')
 351     '?abcd?'
 352     >>> u.encode('ascii', 'xmlcharrefreplace')
 353     '&#40960;abcd&#1972;'
 354
 355 Python's 8-bit strings have a ``.decode([encoding], [errors])`` method
 356 that interprets the string using the given encoding::
 357
 358     >>> u = unichr(40960) + u'abcd' + unichr(1972)   # Assemble a string
 359     >>> utf8_version = u.encode('utf-8')             # Encode as UTF-8
 360     >>> type(utf8_version), utf8_version
 361     (<type 'str'>, '\xea\x80\x80abcd\xde\xb4')
 362     >>> u2 = utf8_version.decode('utf-8')            # Decode using UTF-8
 363     >>> u == u2                                      # The two strings match
 364     True
 365
 366 The low-level routines for registering and accessing the available
 367 encodings are found in the ``codecs`` module.  However, the encoding
 368 and decoding functions returned by this module are usually more
 369 low-level than is comfortable, so I'm not going to describe the
 370 ``codecs`` module here.  If you need to implement a completely new
 371 encoding, you'll need to learn about the ``codecs`` module interfaces,
 372 but implementing encodings is a specialized task that also won't be
 373 covered here.  Consult the Python documentation to learn more about
 374 this module.
 375
 376 The most commonly used part of the ``codecs`` module is the
 377 ``codecs.open()`` function which will be discussed in the section
 378 on input and output.
 379
 380
 381 Unicode Literals in Python Source Code
 382 ''''''''''''''''''''''''''''''''''''''''''
 383
 384 In Python source code, Unicode literals are written as strings
 385 prefixed with the 'u' or 'U' character: ``u'abcdefghijk'``.  Specific
 386 code points can be written using the ``\u`` escape sequence, which is
 387 followed by four hex digits giving the code point.  The ``\U`` escape
 388 sequence is similar, but expects 8 hex digits, not 4.
 389
 390 Unicode literals can also use the same escape sequences as 8-bit
 391 strings, including ``\x``, but ``\x`` only takes two hex digits so it
 392 can't express an arbitrary code point.  Octal escapes can go up to
 393 U+01ff, which is octal 777.
 394
 395 ::
 396
 397     >>> s = u"a\xac\u1234\u20ac\U00008000"
 398                ^^^^ two-digit hex escape
 399                    ^^^^^^ four-digit Unicode escape
 400                                ^^^^^^^^^^ eight-digit Unicode escape
 401     >>> for c in s:  print ord(c),
 402     ...
 403     97 172 4660 8364 32768
 404
 405 Using escape sequences for code points greater than 127 is fine in
 406 small doses, but becomes an annoyance if you're using many accented
 407 characters, as you would in a program with messages in French or some
 408 other accent-using language.  You can also assemble strings using the
 409 ``unichr()`` built-in function, but this is even more tedious.
 410
 411 Ideally, you'd want to be able to write literals in your language's
 412 natural encoding.  You could then edit Python source code with your
 413 favorite editor which would display the accented characters naturally,
 414 and have the right characters used at runtime.
 415
 416 Python supports writing Unicode literals in any encoding, but you have
 417 to declare the encoding being used.  This is done by including a
 418 special comment as either the first or second line of the source
 419 file::
 420
 421     #!/usr/bin/env python
 422     # -*- coding: latin-1 -*-
 423
 424     u = u'abcdé'
 425     print ord(u[-1])
 426
 427 The syntax is inspired by Emacs's notation for specifying variables local to a file.
 428 Emacs supports many different variables, but Python only supports 'coding'.
 429 The ``-*-`` symbols indicate that the comment is special; within them,
 430 you must supply the name ``coding`` and the name of your chosen encoding,
 431 separated by ``':'``.
 432
 433 If you don't include such a comment, the default encoding used will be
 434 ASCII.  Versions of Python before 2.4 were Euro-centric and assumed
 435 Latin-1 as a default encoding for string literals; in Python 2.4,
 436 characters greater than 127 still work but result in a warning.  For
 437 example, the following program has no encoding declaration::
 438
 439     #!/usr/bin/env python
 440     u = u'abcdé'
 441     print ord(u[-1])
 442
 443 When you run it with Python 2.4, it will output the following warning::
 444
 445     amk:~$ python p263.py
 446     sys:1: DeprecationWarning: Non-ASCII character '\xe9'
 447          in file p263.py on line 2, but no encoding declared;
 448          see http://www.python.org/peps/pep-0263.html for details
 449
 450
 451 Unicode Properties
 452 '''''''''''''''''''
 453
 454 The Unicode specification includes a database of information about
 455 code points.  For each code point that's defined, the information
 456 includes the character's name, its category, the numeric value if
 457 applicable (Unicode has characters representing the Roman numerals and
 458 fractions such as one-third and four-fifths).  There are also
 459 properties related to the code point's use in bidirectional text and
 460 other display-related properties.
 461
 462 The following program displays some information about several
 463 characters, and prints the numeric value of one particular character::
 464
 465     import unicodedata
 466
 467     u = unichr(233) + unichr(0x0bf2) + unichr(3972) + unichr(6000) + unichr(13231)
 468
 469     for i, c in enumerate(u):
 470         print i, '%04x' % ord(c), unicodedata.category(c),
 471         print unicodedata.name(c)
 472
 473     # Get numeric value of second character
 474     print unicodedata.numeric(u[1])
 475
 476 When run, this prints::
 477
 478     0 00e9 Ll LATIN SMALL LETTER E WITH ACUTE
 479     1 0bf2 No TAMIL NUMBER ONE THOUSAND
 480     2 0f84 Mn TIBETAN MARK HALANTA
 481     3 1770 Lo TAGBANWA LETTER SA
 482     4 33af So SQUARE RAD OVER S SQUARED
 483     1000.0
 484
 485 The category codes are abbreviations describing the nature of the
 486 character.  These are grouped into categories such as "Letter",
 487 "Number", "Punctuation", or "Symbol", which in turn are broken up into
 488 subcategories.  To take the codes from the above output, ``'Ll'``
 489 means 'Letter, lowercase', ``'No'`` means "Number, other", ``'Mn'`` is
 490 "Mark, nonspacing", and ``'So'`` is "Symbol, other".  See
 491 <http://www.unicode.org/Public/UNIDATA/UCD.html#General_Category_Values>
 492 for a list of category codes.
 493
 494 References
 495 ''''''''''''''
 496
 497 The Unicode and 8-bit string types are described in the Python library
 498 reference at <http://docs.python.org/lib/typesseq.html>.
 499
 500 The documentation for the ``unicodedata`` module is at
 501 <http://docs.python.org/lib/module-unicodedata.html>.
 502
 503 The documentation for the ``codecs`` module is at
 504 <http://docs.python.org/lib/module-codecs.html>.
 505
 506 Marc-André Lemburg gave a presentation at EuroPython 2002
 507 titled "Python and Unicode".  A PDF version of his slides
 508 is available at <http://www.egenix.com/files/python/Unicode-EPC2002-Talk.pdf>,
 509 and is an excellent overview of the design of Python's Unicode features.
 510
 511
 512 Reading and Writing Unicode Data
 513 ----------------------------------------
 514
 515 Once you've written some code that works with Unicode data, the next
 516 problem is input/output.  How do you get Unicode strings into your
 517 program, and how do you convert Unicode into a form suitable for
 518 storage or transmission?
 519
 520 It's possible that you may not need to do anything depending on your
 521 input sources and output destinations; you should check whether the
 522 libraries used in your application support Unicode natively.  XML
 523 parsers often return Unicode data, for example.  Many relational
 524 databases also support Unicode-valued columns and can return Unicode
 525 values from an SQL query.
 526
 527 Unicode data is usually converted to a particular encoding before it
 528 gets written to disk or sent over a socket.  It's possible to do all
 529 the work yourself: open a file, read an 8-bit string from it, and
 530 convert the string with ``unicode(str, encoding)``.  However, the
 531 manual approach is not recommended.
 532
 533 One problem is the multi-byte nature of encodings; one Unicode
 534 character can be represented by several bytes.  If you want to read
 535 the file in arbitrary-sized chunks (say, 1K or 4K), you need to write
 536 error-handling code to catch the case where only part of the bytes
 537 encoding a single Unicode character are read at the end of a chunk.
 538 One solution would be to read the entire file into memory and then
 539 perform the decoding, but that prevents you from working with files
 540 that are extremely large; if you need to read a 2Gb file, you need 2Gb
 541 of RAM.  (More, really, since for at least a moment you'd need to have
 542 both the encoded string and its Unicode version in memory.)
 543
 544 The solution would be to use the low-level decoding interface to catch
 545 the case of partial coding sequences.   The work of implementing this
 546 has already been done for you: the ``codecs`` module includes a
 547 version of the ``open()`` function that returns a file-like object
 548 that assumes the file's contents are in a specified encoding and
 549 accepts Unicode parameters for methods such as ``.read()`` and
 550 ``.write()``.
 551
 552 The function's parameters are
 553 ``open(filename, mode='rb', encoding=None, errors='strict', buffering=1)``.  ``mode`` can be
 554 ``'r'``, ``'w'``, or ``'a'``, just like the corresponding parameter to the
 555 regular built-in ``open()`` function; add a ``'+'`` to
 556 update the file.  ``buffering`` is similarly
 557 parallel to the standard function's parameter.
 558 ``encoding`` is a string giving
 559 the encoding to use; if it's left as ``None``, a regular Python file
 560 object that accepts 8-bit strings is returned.  Otherwise, a wrapper
 561 object is returned, and data written to or read from the wrapper
 562 object will be converted as needed.  ``errors`` specifies the action
 563 for encoding errors and can be one of the usual values of 'strict',
 564 'ignore', and 'replace'.
 565
 566 Reading Unicode from a file is therefore simple::
 567
 568     import codecs
 569     f = codecs.open('unicode.rst', encoding='utf-8')
 570     for line in f:
 571         print repr(line)
 572
 573 It's also possible to open files in update mode,
 574 allowing both reading and writing::
 575
 576     f = codecs.open('test', encoding='utf-8', mode='w+')
 577     f.write(u'\u4500 blah blah blah\n')
 578     f.seek(0)
 579     print repr(f.readline()[:1])
 580     f.close()
 581
 582 Unicode character U+FEFF is used as a byte-order mark (BOM),
 583 and is often written as the first character of a file in order
 584 to assist with autodetection of the file's byte ordering.
 585 Some encodings, such as UTF-16, expect a BOM to be present at
 586 the start of a file; when such an encoding is used,
 587 the BOM will be automatically written as the first character
 588 and will be silently dropped when the file is read.  There are
 589 variants of these encodings, such as 'utf-16-le' and 'utf-16-be'
 590 for little-endian and big-endian encodings, that specify
 591 one particular byte ordering and don't
 592 skip the BOM.
 593
 594
 595 Unicode filenames
 596 '''''''''''''''''''''''''
 597
 598 Most of the operating systems in common use today support filenames
 599 that contain arbitrary Unicode characters.  Usually this is
 600 implemented by converting the Unicode string into some encoding that
 601 varies depending on the system.  For example, MacOS X uses UTF-8 while
 602 Windows uses a configurable encoding; on Windows, Python uses the name
 603 "mbcs" to refer to whatever the currently configured encoding is.  On
 604 Unix systems, there will only be a filesystem encoding if you've set
 605 the ``LANG`` or ``LC_CTYPE`` environment variables; if you haven't,
 606 the default encoding is ASCII.
 607
 608 The ``sys.getfilesystemencoding()`` function returns the encoding to
 609 use on your current system, in case you want to do the encoding
 610 manually, but there's not much reason to bother.  When opening a file
 611 for reading or writing, you can usually just provide the Unicode
 612 string as the filename, and it will be automatically converted to the
 613 right encoding for you::
 614
 615     filename = u'filename\u4500abc'
 616     f = open(filename, 'w')
 617     f.write('blah\n')
 618     f.close()
 619
 620 Functions in the ``os`` module such as ``os.stat()`` will also accept
 621 Unicode filenames.
 622
 623 ``os.listdir()``, which returns filenames, raises an issue: should it
 624 return the Unicode version of filenames, or should it return 8-bit
 625 strings containing the encoded versions?  ``os.listdir()`` will do
 626 both, depending on whether you provided the directory path as an 8-bit
 627 string or a Unicode string.  If you pass a Unicode string as the path,
 628 filenames will be decoded using the filesystem's encoding and a list
 629 of Unicode strings will be returned, while passing an 8-bit path will
 630 return the 8-bit versions of the filenames.  For example, assuming the
 631 default filesystem encoding is UTF-8, running the following program::
 632
 633         fn = u'filename\u4500abc'
 634         f = open(fn, 'w')
 635         f.close()
 636
 637         import os
 638         print os.listdir('.')
 639         print os.listdir(u'.')
 640
 641 will produce the following output::
 642
 643         amk:~$ python t.py
 644         ['.svn', 'filename\xe4\x94\x80abc', ...]
 645         [u'.svn', u'filename\u4500abc', ...]
 646
 647 The first list contains UTF-8-encoded filenames, and the second list
 648 contains the Unicode versions.
 649
 650
 651
 652 Tips for Writing Unicode-aware Programs
 653 ''''''''''''''''''''''''''''''''''''''''''''
 654
 655 This section provides some suggestions on writing software that
 656 deals with Unicode.
 657
 658 The most important tip is:
 659
 660     Software should only work with Unicode strings internally,
 661     converting to a particular encoding on output.
 662
 663 If you attempt to write processing functions that accept both
 664 Unicode and 8-bit strings, you will find your program vulnerable to
 665 bugs wherever you combine the two different kinds of strings.  Python's
 666 default encoding is ASCII, so whenever a character with an ASCII value >127
 667 is in the input data, you'll get a ``UnicodeDecodeError``
 668 because that character can't be handled by the ASCII encoding.
 669
 670 It's easy to miss such problems if you only test your software
 671 with data that doesn't contain any
 672 accents; everything will seem to work, but there's actually a bug in your
 673 program waiting for the first user who attempts to use characters >127.
 674 A second tip, therefore, is:
 675
 676     Include characters >127 and, even better, characters >255 in your
 677     test data.
 678
 679 When using data coming from a web browser or some other untrusted source,
 680 a common technique is to check for illegal characters in a string
 681 before using the string in a generated command line or storing it in a
 682 database.  If you're doing this, be careful to check
 683 the string once it's in the form that will be used or stored; it's
 684 possible for encodings to be used to disguise characters.  This is especially
 685 true if the input data also specifies the encoding;
 686 many encodings leave the commonly checked-for characters alone,
 687 but Python includes some encodings such as ``'base64'``
 688 that modify every single character.
 689
 690 For example, let's say you have a content management system that takes a
 691 Unicode filename, and you want to disallow paths with a '/' character.
 692 You might write this code::
 693
 694     def read_file (filename, encoding):
 695         if '/' in filename:
 696             raise ValueError("'/' not allowed in filenames")
 697         unicode_name = filename.decode(encoding)
 698         f = open(unicode_name, 'r')
 699         # ... return contents of file ...
 700
 701 However, if an attacker could specify the ``'base64'`` encoding,
 702 they could pass ``'L2V0Yy9wYXNzd2Q='``, which is the base-64
 703 encoded form of the string ``'/etc/passwd'``, to read a
 704 system file.   The above code looks for ``'/'`` characters
 705 in the encoded form and misses the dangerous character
 706 in the resulting decoded form.
 707
 708 References
 709 ''''''''''''''
 710
 711 The PDF slides for Marc-André Lemburg's presentation "Writing
 712 Unicode-aware Applications in Python" are available at
 713 <http://www.egenix.com/files/python/LSM2005-Developing-Unicode-aware-applications-in-Python.pdf>
 714 and discuss questions of character encodings as well as how to
 715 internationalize and localize an application.
 716
 717
 718 Revision History and Acknowledgements
 719 ------------------------------------------
 720
 721 Thanks to the following people who have noted errors or offered
 722 suggestions on this article: Nicholas Bastin,
 723 Marius Gedminas, Kent Johnson, Ken Krugler,
 724 Marc-André Lemburg, Martin von Löwis, Chad Whitacre.
 725
 726 Version 1.0: posted August 5 2005.
 727
 728 Version 1.01: posted August 7 2005.  Corrects factual and markup
 729 errors; adds several links.
 730
 731 Version 1.02: posted August 16 2005.  Corrects factual errors.
 732
 733
 734 .. comment Additional topic: building Python w/ UCS2 or UCS4 support
 735 .. comment Describe obscure -U switch somewhere?
 736 .. comment Describe use of codecs.StreamRecoder and StreamReaderWriter
 737
 738 .. comment
 739    Original outline:
 740
 741    - [ ] Unicode introduction
 742        - [ ] ASCII
 743        - [ ] Terms
 744            - [ ] Character
 745            - [ ] Code point
 746          - [ ] Encodings
 747             - [ ] Common encodings: ASCII, Latin-1, UTF-8
 748        - [ ] Unicode Python type
 749            - [ ] Writing unicode literals
 750                - [ ] Obscurity: -U switch
 751            - [ ] Built-ins
 752                - [ ] unichr()
 753                - [ ] ord()
 754                - [ ] unicode() constructor
 755            - [ ] Unicode type
 756                - [ ] encode(), decode() methods
 757        - [ ] Unicodedata module for character properties
 758        - [ ] I/O
 759            - [ ] Reading/writing Unicode data into files
 760                - [ ] Byte-order marks
 761            - [ ] Unicode filenames
 762        - [ ] Writing Unicode programs
 763            - [ ] Do everything in Unicode
 764            - [ ] Declaring source code encodings (PEP 263)
 765        - [ ] Other issues
 766            - [ ] Building Python (UCS2, UCS4)