Doc/lib/libcsv.tex

   1 \section{\module{csv} --- CSV File Reading and Writing}
   2
   3 \declaremodule{standard}{csv}
   4 \modulesynopsis{Write and read tabular data to and from delimited files.}
   5 \sectionauthor{Skip Montanaro}{skip@pobox.com}
   6
   7 \versionadded{2.3}
   8 \index{csv}
   9 \indexii{data}{tabular}
  10
  11 The so-called CSV (Comma Separated Values) format is the most common import
  12 and export format for spreadsheets and databases.  There is no ``CSV
  13 standard'', so the format is operationally defined by the many applications
  14 which read and write it.  The lack of a standard means that subtle
  15 differences often exist in the data produced and consumed by different
  16 applications.  These differences can make it annoying to process CSV files
  17 from multiple sources.  Still, while the delimiters and quoting characters
  18 vary, the overall format is similar enough that it is possible to write a
  19 single module which can efficiently manipulate such data, hiding the details
  20 of reading and writing the data from the programmer.
  21
  22 The \module{csv} module implements classes to read and write tabular data in
  23 CSV format.  It allows programmers to say, ``write this data in the format
  24 preferred by Excel,'' or ``read data from this file which was generated by
  25 Excel,'' without knowing the precise details of the CSV format used by
  26 Excel.  Programmers can also describe the CSV formats understood by other
  27 applications or define their own special-purpose CSV formats.
  28
  29 The \module{csv} module's \class{reader} and \class{writer} objects read and
  30 write sequences.  Programmers can also read and write data in dictionary
  31 form using the \class{DictReader} and \class{DictWriter} classes.
  32
  33 \begin{notice}
  34   This version of the \module{csv} module doesn't support Unicode
  35   input.  Also, there are currently some issues regarding \ASCII{} NUL
  36   characters.  Accordingly, all input should be UTF-8 or printable
  37   \ASCII{} to be safe; see the examples in section~\ref{csv-examples}.
  38   These restrictions will be removed in the future.
  39 \end{notice}
  40
  41 \begin{seealso}
  42 %  \seemodule{array}{Arrays of uniformly types numeric values.}
  43   \seepep{305}{CSV File API}
  44          {The Python Enhancement Proposal which proposed this addition
  45           to Python.}
  46 \end{seealso}
  47
  48
  49 \subsection{Module Contents \label{csv-contents}}
  50
  51 The \module{csv} module defines the following functions:
  52
  53 \begin{funcdesc}{reader}{csvfile\optional{,
  54                          dialect=\code{'excel'}}\optional{, fmtparam}}
  55 Return a reader object which will iterate over lines in the given
  56 {}\var{csvfile}.  \var{csvfile} can be any object which supports the
  57 iterator protocol and returns a string each time its \method{next}
  58 method is called --- file objects and list objects are both suitable.
  59 If \var{csvfile} is a file object, it must be opened with
  60 the 'b' flag on platforms where that makes a difference.  An optional
  61 {}\var{dialect} parameter can be given
  62 which is used to define a set of parameters specific to a particular CSV
  63 dialect.  It may be an instance of a subclass of the \class{Dialect}
  64 class or one of the strings returned by the \function{list_dialects}
  65 function.  The other optional {}\var{fmtparam} keyword arguments can be
  66 given to override individual formatting parameters in the current
  67 dialect.  For full details about the dialect and formatting
  68 parameters, see section~\ref{csv-fmt-params}, ``Dialects and Formatting
  69 Parameters''.
  70
  71 All data read are returned as strings.  No automatic data type
  72 conversion is performed.
  73
  74 \versionchanged[
  75 The parser is now stricter with respect to multi-line quoted
  76 fields. Previously, if a line ended within a quoted field without a
  77 terminating newline character, a newline would be inserted into the
  78 returned field. This behavior caused problems when reading files
  79 which contained carriage return characters within fields.  The
  80 behavior was changed to return the field without inserting newlines. As
  81 a consequence, if newlines embedded within fields are important, the
  82 input should be split into lines in a manner which preserves the newline
  83 characters]{2.5}
  84
  85 \end{funcdesc}
  86
  87 \begin{funcdesc}{writer}{csvfile\optional{,
  88                          dialect=\code{'excel'}}\optional{, fmtparam}}
  89 Return a writer object responsible for converting the user's data into
  90 delimited strings on the given file-like object.  \var{csvfile} can be any
  91 object with a \function{write} method.  If \var{csvfile} is a file object,
  92 it must be opened with the 'b' flag on platforms where that makes a
  93 difference.  An optional
  94 {}\var{dialect} parameter can be given which is used to define a set of
  95 parameters specific to a particular CSV dialect.  It may be an instance
  96 of a subclass of the \class{Dialect} class or one of the strings
  97 returned by the \function{list_dialects} function.  The other optional
  98 {}\var{fmtparam} keyword arguments can be given to override individual
  99 formatting parameters in the current dialect.  For full details
 100 about the dialect and formatting parameters, see
 101 section~\ref{csv-fmt-params}, ``Dialects and Formatting Parameters''.
 102 To make it as easy as possible to
 103 interface with modules which implement the DB API, the value
 104 \constant{None} is written as the empty string.  While this isn't a
 105 reversible transformation, it makes it easier to dump SQL NULL data values
 106 to CSV files without preprocessing the data returned from a
 107 \code{cursor.fetch*()} call.  All other non-string data are stringified
 108 with \function{str()} before being written.
 109 \end{funcdesc}
 110
 111 \begin{funcdesc}{register_dialect}{name\optional{, dialect}\optional{, fmtparam}}
 112 Associate \var{dialect} with \var{name}.  \var{name} must be a string
 113 or Unicode object. The dialect can be specified either by passing a
 114 sub-class of \class{Dialect}, or by \var{fmtparam} keyword arguments,
 115 or both, with keyword arguments overriding parameters of the dialect.
 116 For full details about the dialect and formatting parameters, see
 117 section~\ref{csv-fmt-params}, ``Dialects and Formatting Parameters''.
 118 \end{funcdesc}
 119
 120 \begin{funcdesc}{unregister_dialect}{name}
 121 Delete the dialect associated with \var{name} from the dialect registry.  An
 122 \exception{Error} is raised if \var{name} is not a registered dialect
 123 name.
 124 \end{funcdesc}
 125
 126 \begin{funcdesc}{get_dialect}{name}
 127 Return the dialect associated with \var{name}.  An \exception{Error} is
 128 raised if \var{name} is not a registered dialect name.
 129
 130 \versionchanged[
 131 This function now returns an immutable \class{Dialect}.  Previously an
 132 instance of the requested dialect was returned.  Users could modify the
 133 underlying class, changing the behavior of active readers and writers.]{2.5}
 134 \end{funcdesc}
 135
 136 \begin{funcdesc}{list_dialects}{}
 137 Return the names of all registered dialects.
 138 \end{funcdesc}
 139
 140 \begin{funcdesc}{field_size_limit}{\optional{new_limit}}
 141   Returns the current maximum field size allowed by the parser. If
 142   \var{new_limit} is given, this becomes the new limit.
 143   \versionadded{2.5}
 144 \end{funcdesc}
 145
 146
 147 The \module{csv} module defines the following classes:
 148
 149 \begin{classdesc}{DictReader}{csvfile\optional{,
 150                               fieldnames=\constant{None},\optional{,
 151                               restkey=\constant{None}\optional{,
 152                               restval=\constant{None}\optional{,
 153                               dialect=\code{'excel'}\optional{,
 154                               *args, **kwds}}}}}}
 155 Create an object which operates like a regular reader but maps the
 156 information read into a dict whose keys are given by the optional
 157 {} \var{fieldnames}
 158 parameter.  If the \var{fieldnames} parameter is omitted, the values in
 159 the first row of the \var{csvfile} will be used as the fieldnames.
 160 If the row read has fewer fields than the fieldnames sequence,
 161 the value of \var{restval} will be used as the default value.  If the row
 162 read has more fields than the fieldnames sequence, the remaining data is
 163 added as a sequence keyed by the value of \var{restkey}.  If the row read
 164 has fewer fields than the fieldnames sequence, the remaining keys take the
 165 value of the optional \var{restval} parameter.  Any other optional or
 166 keyword arguments are passed to the underlying \class{reader} instance.
 167 \end{classdesc}
 168
 169
 170 \begin{classdesc}{DictWriter}{csvfile, fieldnames\optional{,
 171                               restval=""\optional{,
 172                               extrasaction=\code{'raise'}\optional{,
 173                               dialect=\code{'excel'}\optional{,
 174                               *args, **kwds}}}}}
 175 Create an object which operates like a regular writer but maps dictionaries
 176 onto output rows.  The \var{fieldnames} parameter identifies the order in
 177 which values in the dictionary passed to the \method{writerow()} method are
 178 written to the \var{csvfile}.  The optional \var{restval} parameter
 179 specifies the value to be written if the dictionary is missing a key in
 180 \var{fieldnames}.  If the dictionary passed to the \method{writerow()}
 181 method contains a key not found in \var{fieldnames}, the optional
 182 \var{extrasaction} parameter indicates what action to take.  If it is set
 183 to \code{'raise'} a \exception{ValueError} is raised.  If it is set to
 184 \code{'ignore'}, extra values in the dictionary are ignored.  Any other
 185 optional or keyword arguments are passed to the underlying \class{writer}
 186 instance.
 187
 188 Note that unlike the \class{DictReader} class, the \var{fieldnames}
 189 parameter of the \class{DictWriter} is not optional.  Since Python's
 190 \class{dict} objects are not ordered, there is not enough information
 191 available to deduce the order in which the row should be written to the
 192 \var{csvfile}.
 193
 194 \end{classdesc}
 195
 196 \begin{classdesc*}{Dialect}{}
 197 The \class{Dialect} class is a container class relied on primarily for its
 198 attributes, which are used to define the parameters for a specific
 199 \class{reader} or \class{writer} instance.
 200 \end{classdesc*}
 201
 202 \begin{classdesc}{excel}{}
 203 The \class{excel} class defines the usual properties of an Excel-generated
 204 CSV file.  It is registered with the dialect name \code{'excel'}.
 205 \end{classdesc}
 206
 207 \begin{classdesc}{excel_tab}{}
 208 The \class{excel_tab} class defines the usual properties of an
 209 Excel-generated TAB-delimited file.  It is registered with the dialect name
 210 \code{'excel-tab'}.
 211 \end{classdesc}
 212
 213 \begin{classdesc}{Sniffer}{}
 214 The \class{Sniffer} class is used to deduce the format of a CSV file.
 215 \end{classdesc}
 216
 217 The \class{Sniffer} class provides two methods:
 218
 219 \begin{methoddesc}{sniff}{sample\optional{,delimiters=None}}
 220 Analyze the given \var{sample} and return a \class{Dialect} subclass
 221 reflecting the parameters found.  If the optional \var{delimiters} parameter
 222 is given, it is interpreted as a string containing possible valid delimiter
 223 characters.
 224 \end{methoddesc}
 225
 226 \begin{methoddesc}{has_header}{sample}
 227 Analyze the sample text (presumed to be in CSV format) and return
 228 \constant{True} if the first row appears to be a series of column
 229 headers.
 230 \end{methoddesc}
 231
 232
 233 The \module{csv} module defines the following constants:
 234
 235 \begin{datadesc}{QUOTE_ALL}
 236 Instructs \class{writer} objects to quote all fields.
 237 \end{datadesc}
 238
 239 \begin{datadesc}{QUOTE_MINIMAL}
 240 Instructs \class{writer} objects to only quote those fields which contain
 241 special characters such as \var{delimiter}, \var{quotechar} or any of the
 242 characters in \var{lineterminator}.
 243 \end{datadesc}
 244
 245 \begin{datadesc}{QUOTE_NONNUMERIC}
 246 Instructs \class{writer} objects to quote all non-numeric
 247 fields.
 248
 249 Instructs the reader to convert all non-quoted fields to type \var{float}.
 250 \end{datadesc}
 251
 252 \begin{datadesc}{QUOTE_NONE}
 253 Instructs \class{writer} objects to never quote fields.  When the current
 254 \var{delimiter} occurs in output data it is preceded by the current
 255 \var{escapechar} character.  If \var{escapechar} is not set, the writer
 256 will raise \exception{Error} if any characters that require escaping
 257 are encountered.
 258
 259 Instructs \class{reader} to perform no special processing of quote characters.
 260 \end{datadesc}
 261
 262
 263 The \module{csv} module defines the following exception:
 264
 265 \begin{excdesc}{Error}
 266 Raised by any of the functions when an error is detected.
 267 \end{excdesc}
 268
 269
 270 \subsection{Dialects and Formatting Parameters\label{csv-fmt-params}}
 271
 272 To make it easier to specify the format of input and output records,
 273 specific formatting parameters are grouped together into dialects.  A
 274 dialect is a subclass of the \class{Dialect} class having a set of specific
 275 methods and a single \method{validate()} method.  When creating \class{reader}
 276 or \class{writer} objects, the programmer can specify a string or a subclass
 277 of the \class{Dialect} class as the dialect parameter.  In addition to, or
 278 instead of, the \var{dialect} parameter, the programmer can also specify
 279 individual formatting parameters, which have the same names as the
 280 attributes defined below for the \class{Dialect} class.
 281
 282 Dialects support the following attributes:
 283
 284 \begin{memberdesc}[Dialect]{delimiter}
 285 A one-character string used to separate fields.  It defaults to \code{','}.
 286 \end{memberdesc}
 287
 288 \begin{memberdesc}[Dialect]{doublequote}
 289 Controls how instances of \var{quotechar} appearing inside a field should
 290 be themselves be quoted.  When \constant{True}, the character is doubled.
 291 When \constant{False}, the \var{escapechar} is used as a prefix to the
 292 \var{quotechar}.  It defaults to \constant{True}.
 293
 294 On output, if \var{doublequote} is \constant{False} and no
 295 \var{escapechar} is set, \exception{Error} is raised if a \var{quotechar}
 296 is found in a field.
 297 \end{memberdesc}
 298
 299 \begin{memberdesc}[Dialect]{escapechar}
 300 A one-character string used by the writer to escape the \var{delimiter} if
 301 \var{quoting} is set to \constant{QUOTE_NONE} and the \var{quotechar}
 302 if \var{doublequote} is \constant{False}. On reading, the \var{escapechar}
 303 removes any special meaning from the following character. It defaults
 304 to \constant{None}, which disables escaping.
 305 \end{memberdesc}
 306
 307 \begin{memberdesc}[Dialect]{lineterminator}
 308 The string used to terminate lines produced by the \class{writer}.
 309 It defaults to \code{'\e r\e n'}.
 310
 311 \note{The \class{reader} is hard-coded to recognise either \code{'\e r'}
 312 or \code{'\e n'} as end-of-line, and ignores \var{lineterminator}. This
 313 behavior may change in the future.}
 314 \end{memberdesc}
 315
 316 \begin{memberdesc}[Dialect]{quotechar}
 317 A one-character string used to quote fields containing special characters,
 318 such as the \var{delimiter} or \var{quotechar}, or which contain new-line
 319 characters.  It defaults to \code{'"'}.
 320 \end{memberdesc}
 321
 322 \begin{memberdesc}[Dialect]{quoting}
 323 Controls when quotes should be generated by the writer and recognised
 324 by the reader.  It can take on any of the \constant{QUOTE_*} constants
 325 (see section~\ref{csv-contents}) and defaults to \constant{QUOTE_MINIMAL}.
 326 \end{memberdesc}
 327
 328 \begin{memberdesc}[Dialect]{skipinitialspace}
 329 When \constant{True}, whitespace immediately following the \var{delimiter}
 330 is ignored.  The default is \constant{False}.
 331 \end{memberdesc}
 332
 333
 334 \subsection{Reader Objects}
 335
 336 Reader objects (\class{DictReader} instances and objects returned by
 337 the \function{reader()} function) have the following public methods:
 338
 339 \begin{methoddesc}[csv reader]{next}{}
 340 Return the next row of the reader's iterable object as a list, parsed
 341 according to the current dialect.
 342 \end{methoddesc}
 343
 344 Reader objects have the following public attributes:
 345
 346 \begin{memberdesc}[csv reader]{dialect}
 347 A read-only description of the dialect in use by the parser.
 348 \end{memberdesc}
 349
 350 \begin{memberdesc}[csv reader]{line_num}
 351  The number of lines read from the source iterator. This is not the same
 352  as the number of records returned, as records can span multiple lines.
 353  \versionadded{2.5}
 354 \end{memberdesc}
 355
 356
 357 \subsection{Writer Objects}
 358
 359 \class{Writer} objects (\class{DictWriter} instances and objects returned by
 360 the \function{writer()} function) have the following public methods.  A
 361 {}\var{row} must be a sequence of strings or numbers for \class{Writer}
 362 objects and a dictionary mapping fieldnames to strings or numbers (by
 363 passing them through \function{str()} first) for {}\class{DictWriter}
 364 objects.  Note that complex numbers are written out surrounded by parens.
 365 This may cause some problems for other programs which read CSV files
 366 (assuming they support complex numbers at all).
 367
 368 \begin{methoddesc}[csv writer]{writerow}{row}
 369 Write the \var{row} parameter to the writer's file object, formatted
 370 according to the current dialect.
 371 \end{methoddesc}
 372
 373 \begin{methoddesc}[csv writer]{writerows}{rows}
 374 Write all the \var{rows} parameters (a list of \var{row} objects as
 375 described above) to the writer's file object, formatted
 376 according to the current dialect.
 377 \end{methoddesc}
 378
 379 Writer objects have the following public attribute:
 380
 381 \begin{memberdesc}[csv writer]{dialect}
 382 A read-only description of the dialect in use by the writer.
 383 \end{memberdesc}
 384
 385
 386
 387 \subsection{Examples\label{csv-examples}}
 388
 389 The simplest example of reading a CSV file:
 390
 391 \begin{verbatim}
 392 import csv
 393 reader = csv.reader(open("some.csv", "rb"))
 394 for row in reader:
 395     print row
 396 \end{verbatim}
 397
 398 Reading a file with an alternate format:
 399
 400 \begin{verbatim}
 401 import csv
 402 reader = csv.reader(open("passwd", "rb"), delimiter=':', quoting=csv.QUOTE_NONE)
 403 for row in reader:
 404     print row
 405 \end{verbatim}
 406
 407 The corresponding simplest possible writing example is:
 408
 409 \begin{verbatim}
 410 import csv
 411 writer = csv.writer(open("some.csv", "wb"))
 412 writer.writerows(someiterable)
 413 \end{verbatim}
 414
 415 Registering a new dialect:
 416
 417 \begin{verbatim}
 418 import csv
 419
 420 csv.register_dialect('unixpwd', delimiter=':', quoting=csv.QUOTE_NONE)
 421
 422 reader = csv.reader(open("passwd", "rb"), 'unixpwd')
 423 \end{verbatim}
 424
 425 A slightly more advanced use of the reader --- catching and reporting errors:
 426
 427 \begin{verbatim}
 428 import csv, sys
 429 filename = "some.csv"
 430 reader = csv.reader(open(filename, "rb"))
 431 try:
 432     for row in reader:
 433         print row
 434 except csv.Error, e:
 435     sys.exit('file %s, line %d: %s' % (filename, reader.line_num, e))
 436 \end{verbatim}
 437
 438 And while the module doesn't directly support parsing strings, it can
 439 easily be done:
 440
 441 \begin{verbatim}
 442 import csv
 443 for row in csv.reader(['one,two,three']):
 444     print row
 445 \end{verbatim}
 446
 447 The \module{csv} module doesn't directly support reading and writing
 448 Unicode, but it is 8-bit-clean save for some problems with \ASCII{} NUL
 449 characters.  So you can write functions or classes that handle the
 450 encoding and decoding for you as long as you avoid encodings like
 451 UTF-16 that use NULs.  UTF-8 is recommended.
 452
 453 \function{unicode_csv_reader} below is a generator that wraps
 454 \class{csv.reader} to handle Unicode CSV data (a list of Unicode
 455 strings).  \function{utf_8_encoder} is a generator that encodes the
 456 Unicode strings as UTF-8, one string (or row) at a time.  The encoded
 457 strings are parsed by the CSV reader, and
 458 \function{unicode_csv_reader} decodes the UTF-8-encoded cells back
 459 into Unicode:
 460
 461 \begin{verbatim}
 462 import csv
 463
 464 def unicode_csv_reader(unicode_csv_data, dialect=csv.excel, **kwargs):
 465     # csv.py doesn't do Unicode; encode temporarily as UTF-8:
 466     csv_reader = csv.reader(utf_8_encoder(unicode_csv_data),
 467                             dialect=dialect, **kwargs)
 468     for row in csv_reader:
 469         # decode UTF-8 back to Unicode, cell by cell:
 470         yield [unicode(cell, 'utf-8') for cell in row]
 471
 472 def utf_8_encoder(unicode_csv_data):
 473     for line in unicode_csv_data:
 474         yield line.encode('utf-8')
 475 \end{verbatim}
 476
 477 For all other encodings the following \class{UnicodeReader} and
 478 \class{UnicodeWriter} classes can be used. They take an additional
 479 \var{encoding} parameter in their constructor and make sure that the data
 480 passes the real reader or writer encoded as UTF-8:
 481
 482 \begin{verbatim}
 483 import csv, codecs, cStringIO
 484
 485 class UTF8Recoder:
 486     """
 487     Iterator that reads an encoded stream and reencodes the input to UTF-8
 488     """
 489     def __init__(self, f, encoding):
 490         self.reader = codecs.getreader(encoding)(f)
 491
 492     def __iter__(self):
 493         return self
 494
 495     def next(self):
 496         return self.reader.next().encode("utf-8")
 497
 498 class UnicodeReader:
 499     """
 500     A CSV reader which will iterate over lines in the CSV file "f",
 501     which is encoded in the given encoding.
 502     """
 503
 504     def __init__(self, f, dialect=csv.excel, encoding="utf-8", **kwds):
 505         f = UTF8Recoder(f, encoding)
 506         self.reader = csv.reader(f, dialect=dialect, **kwds)
 507
 508     def next(self):
 509         row = self.reader.next()
 510         return [unicode(s, "utf-8") for s in row]
 511
 512     def __iter__(self):
 513         return self
 514
 515 class UnicodeWriter:
 516     """
 517     A CSV writer which will write rows to CSV file "f",
 518     which is encoded in the given encoding.
 519     """
 520
 521     def __init__(self, f, dialect=csv.excel, encoding="utf-8", **kwds):
 522         # Redirect output to a queue
 523         self.queue = cStringIO.StringIO()
 524         self.writer = csv.writer(self.queue, dialect=dialect, **kwds)
 525         self.stream = f
 526         self.encoder = codecs.getincrementalencoder(encoding)()
 527
 528     def writerow(self, row):
 529         self.writer.writerow([s.encode("utf-8") for s in row])
 530         # Fetch UTF-8 output from the queue ...
 531         data = self.queue.getvalue()
 532         data = data.decode("utf-8")
 533         # ... and reencode it into the target encoding
 534         data = self.encoder.encode(data)
 535         # write to the target stream
 536         self.stream.write(data)
 537         # empty queue
 538         self.queue.truncate(0)
 539
 540     def writerows(self, rows):
 541         for row in rows:
 542             self.writerow(row)
 543 \end{verbatim}