Doc/lib/libcsv.tex

   1 \section{\module{csv} --- CSV File Reading and Writing}
   2
   3 \declaremodule{standard}{csv}
   4 \modulesynopsis{Write and read tabular data to and from delimited files.}
   5 \sectionauthor{Skip Montanaro}{skip@pobox.com}
   6
   7 \versionadded{2.3}
   8 \index{csv}
   9 \indexii{data}{tabular}
  10
  11 The so-called CSV (Comma Separated Values) format is the most common import
  12 and export format for spreadsheets and databases.  There is no ``CSV
  13 standard'', so the format is operationally defined by the many applications
  14 which read and write it.  The lack of a standard means that subtle
  15 differences often exist in the data produced and consumed by different
  16 applications.  These differences can make it annoying to process CSV files
  17 from multiple sources.  Still, while the delimiters and quoting characters
  18 vary, the overall format is similar enough that it is possible to write a
  19 single module which can efficiently manipulate such data, hiding the details
  20 of reading and writing the data from the programmer.
  21
  22 The \module{csv} module implements classes to read and write tabular data in
  23 CSV format.  It allows programmers to say, ``write this data in the format
  24 preferred by Excel,'' or ``read data from this file which was generated by
  25 Excel,'' without knowing the precise details of the CSV format used by
  26 Excel.  Programmers can also describe the CSV formats understood by other
  27 applications or define their own special-purpose CSV formats.
  28
  29 The \module{csv} module's \class{reader} and \class{writer} objects read and
  30 write sequences.  Programmers can also read and write data in dictionary
  31 form using the \class{DictReader} and \class{DictWriter} classes.
  32
  33 \begin{notice}
  34   This version of the \module{csv} module doesn't support Unicode
  35   input.  Also, there are currently some issues regarding \ASCII{} NUL
  36   characters.  Accordingly, all input should be UTF-8 or printable
  37   \ASCII{} to be safe; see the examples in section~\ref{csv-examples}.
  38   These restrictions will be removed in the future.
  39 \end{notice}
  40
  41 \begin{seealso}
  42 %  \seemodule{array}{Arrays of uniformly types numeric values.}
  43   \seepep{305}{CSV File API}
  44          {The Python Enhancement Proposal which proposed this addition
  45           to Python.}
  46 \end{seealso}
  47
  48
  49 \subsection{Module Contents \label{csv-contents}}
  50
  51 The \module{csv} module defines the following functions:
  52
  53 \begin{funcdesc}{reader}{csvfile\optional{,
  54                          dialect=\code{'excel'}}\optional{, fmtparam}}
  55 Return a reader object which will iterate over lines in the given
  56 {}\var{csvfile}.  \var{csvfile} can be any object which supports the
  57 iterator protocol and returns a string each time its \method{next}
  58 method is called --- file objects and list objects are both suitable.
  59 If \var{csvfile} is a file object, it must be opened with
  60 the 'b' flag on platforms where that makes a difference.  An optional
  61 {}\var{dialect} parameter can be given
  62 which is used to define a set of parameters specific to a particular CSV
  63 dialect.  It may be an instance of a subclass of the \class{Dialect}
  64 class or one of the strings returned by the \function{list_dialects}
  65 function.  The other optional {}\var{fmtparam} keyword arguments can be
  66 given to override individual formatting parameters in the current
  67 dialect.  For more information about the dialect and formatting
  68 parameters, see section~\ref{csv-fmt-params}, ``Dialects and Formatting
  69 Parameters'' for details of these parameters.
  70
  71 All data read are returned as strings.  No automatic data type
  72 conversion is performed.
  73
  74 \versionchanged[
  75 If literal newlines are important within a field, users need to read their
  76 file in a way that preserves the newlines. The behavior before 2.5 would
  77 introduce spurious characters into quoted fields, with no way for the user
  78 to control that behavior. The previous behavior caused considerable
  79 problems, particularly on platforms that did not use the Unix line ending
  80 conventions, or with files that originated on those platforms --- users were
  81 finding mysterious newlines where they didn't expect them]{2.5}
  82
  83 \end{funcdesc}
  84
  85 \begin{funcdesc}{writer}{csvfile\optional{,
  86                          dialect=\code{'excel'}}\optional{, fmtparam}}
  87 Return a writer object responsible for converting the user's data into
  88 delimited strings on the given file-like object.  \var{csvfile} can be any
  89 object with a \function{write} method.  If \var{csvfile} is a file object,
  90 it must be opened with the 'b' flag on platforms where that makes a
  91 difference.  An optional
  92 {}\var{dialect} parameter can be given which is used to define a set of
  93 parameters specific to a particular CSV dialect.  It may be an instance
  94 of a subclass of the \class{Dialect} class or one of the strings
  95 returned by the \function{list_dialects} function.  The other optional
  96 {}\var{fmtparam} keyword arguments can be given to override individual
  97 formatting parameters in the current dialect.  For more information
  98 about the dialect and formatting parameters, see
  99 section~\ref{csv-fmt-params}, ``Dialects and Formatting Parameters'' for
 100 details of these parameters.  To make it as easy as possible to
 101 interface with modules which implement the DB API, the value
 102 \constant{None} is written as the empty string.  While this isn't a
 103 reversible transformation, it makes it easier to dump SQL NULL data values
 104 to CSV files without preprocessing the data returned from a
 105 \code{cursor.fetch*()} call.  All other non-string data are stringified
 106 with \function{str()} before being written.
 107 \end{funcdesc}
 108
 109 \begin{funcdesc}{register_dialect}{name\optional{, dialect}\optional{, fmtparam}}
 110 Associate \var{dialect} with \var{name}.  \var{name} must be a string
 111 or Unicode object. The dialect can be specified either by passing a
 112 sub-class of \class{Dialect}, or by \var{fmtparam} keyword arguments,
 113 or both, with keyword arguments overriding parameters of the dialect.
 114 For more information about the dialect and formatting parameters, see
 115 section~\ref{csv-fmt-params}, ``Dialects and Formatting Parameters''
 116 for details of these parameters.
 117 \end{funcdesc}
 118
 119 \begin{funcdesc}{unregister_dialect}{name}
 120 Delete the dialect associated with \var{name} from the dialect registry.  An
 121 \exception{Error} is raised if \var{name} is not a registered dialect
 122 name.
 123 \end{funcdesc}
 124
 125 \begin{funcdesc}{get_dialect}{name}
 126 Return the dialect associated with \var{name}.  An \exception{Error} is
 127 raised if \var{name} is not a registered dialect name.
 128 \end{funcdesc}
 129
 130 \begin{funcdesc}{list_dialects}{}
 131 Return the names of all registered dialects.
 132 \end{funcdesc}
 133
 134 \begin{funcdesc}{field_size_limit}{\optional{new_limit}}
 135   Returns the current maximum field size allowed by the parser. If
 136   \var{new_limit} is given, this becomes the new limit.
 137   \versionadded{2.5}
 138 \end{funcdesc}
 139
 140
 141 The \module{csv} module defines the following classes:
 142
 143 \begin{classdesc}{DictReader}{csvfile\optional{,
 144                               fieldnames=\constant{None},\optional{,
 145                               restkey=\constant{None}\optional{,
 146                               restval=\constant{None}\optional{,
 147                               dialect=\code{'excel'}\optional{,
 148                               *args, **kwds}}}}}}
 149 Create an object which operates like a regular reader but maps the
 150 information read into a dict whose keys are given by the optional
 151 {} \var{fieldnames}
 152 parameter.  If the \var{fieldnames} parameter is omitted, the values in
 153 the first row of the \var{csvfile} will be used as the fieldnames.
 154 If the row read has fewer fields than the fieldnames sequence,
 155 the value of \var{restval} will be used as the default value.  If the row
 156 read has more fields than the fieldnames sequence, the remaining data is
 157 added as a sequence keyed by the value of \var{restkey}.  If the row read
 158 has fewer fields than the fieldnames sequence, the remaining keys take the
 159 value of the optional \var{restval} parameter.  Any other optional or
 160 keyword arguments are passed to the underlying \class{reader} instance.
 161 \end{classdesc}
 162
 163
 164 \begin{classdesc}{DictWriter}{csvfile, fieldnames\optional{,
 165                               restval=""\optional{,
 166                               extrasaction=\code{'raise'}\optional{,
 167                               dialect=\code{'excel'}\optional{,
 168                               *args, **kwds}}}}}
 169 Create an object which operates like a regular writer but maps dictionaries
 170 onto output rows.  The \var{fieldnames} parameter identifies the order in
 171 which values in the dictionary passed to the \method{writerow()} method are
 172 written to the \var{csvfile}.  The optional \var{restval} parameter
 173 specifies the value to be written if the dictionary is missing a key in
 174 \var{fieldnames}.  If the dictionary passed to the \method{writerow()}
 175 method contains a key not found in \var{fieldnames}, the optional
 176 \var{extrasaction} parameter indicates what action to take.  If it is set
 177 to \code{'raise'} a \exception{ValueError} is raised.  If it is set to
 178 \code{'ignore'}, extra values in the dictionary are ignored.  Any other
 179 optional or keyword arguments are passed to the underlying \class{writer}
 180 instance.
 181
 182 Note that unlike the \class{DictReader} class, the \var{fieldnames}
 183 parameter of the \class{DictWriter} is not optional.  Since Python's
 184 \class{dict} objects are not ordered, there is not enough information
 185 available to deduce the order in which the row should be written to the
 186 \var{csvfile}.
 187
 188 \end{classdesc}
 189
 190 \begin{classdesc*}{Dialect}{}
 191 The \class{Dialect} class is a container class relied on primarily for its
 192 attributes, which are used to define the parameters for a specific
 193 \class{reader} or \class{writer} instance.
 194 \end{classdesc*}
 195
 196 \begin{classdesc}{excel}{}
 197 The \class{excel} class defines the usual properties of an Excel-generated
 198 CSV file.
 199 \end{classdesc}
 200
 201 \begin{classdesc}{excel_tab}{}
 202 The \class{excel_tab} class defines the usual properties of an
 203 Excel-generated TAB-delimited file.
 204 \end{classdesc}
 205
 206 \begin{classdesc}{Sniffer}{}
 207 The \class{Sniffer} class is used to deduce the format of a CSV file.
 208 \end{classdesc}
 209
 210 The \class{Sniffer} class provides two methods:
 211
 212 \begin{methoddesc}{sniff}{sample\optional{,delimiters=None}}
 213 Analyze the given \var{sample} and return a \class{Dialect} subclass
 214 reflecting the parameters found.  If the optional \var{delimiters} parameter
 215 is given, it is interpreted as a string containing possible valid delimiter
 216 characters.
 217 \end{methoddesc}
 218
 219 \begin{methoddesc}{has_header}{sample}
 220 Analyze the sample text (presumed to be in CSV format) and return
 221 \constant{True} if the first row appears to be a series of column
 222 headers.
 223 \end{methoddesc}
 224
 225
 226 The \module{csv} module defines the following constants:
 227
 228 \begin{datadesc}{QUOTE_ALL}
 229 Instructs \class{writer} objects to quote all fields.
 230 \end{datadesc}
 231
 232 \begin{datadesc}{QUOTE_MINIMAL}
 233 Instructs \class{writer} objects to only quote those fields which contain
 234 special characters such as \var{delimiter}, \var{quotechar} or any of the
 235 characters in \var{lineterminator}.
 236 \end{datadesc}
 237
 238 \begin{datadesc}{QUOTE_NONNUMERIC}
 239 Instructs \class{writer} objects to quote all non-numeric
 240 fields.
 241
 242 Instructs the reader to convert all non-quoted fields to type \var{float}.
 243 \end{datadesc}
 244
 245 \begin{datadesc}{QUOTE_NONE}
 246 Instructs \class{writer} objects to never quote fields.  When the current
 247 \var{delimiter} occurs in output data it is preceded by the current
 248 \var{escapechar} character.  If \var{escapechar} is not set, the writer
 249 will raise \exception{Error} if any characters that require escaping
 250 are encountered.
 251
 252 Instructs \class{reader} to perform no special processing of quote characters.
 253 \end{datadesc}
 254
 255
 256 The \module{csv} module defines the following exception:
 257
 258 \begin{excdesc}{Error}
 259 Raised by any of the functions when an error is detected.
 260 \end{excdesc}
 261
 262
 263 \subsection{Dialects and Formatting Parameters\label{csv-fmt-params}}
 264
 265 To make it easier to specify the format of input and output records,
 266 specific formatting parameters are grouped together into dialects.  A
 267 dialect is a subclass of the \class{Dialect} class having a set of specific
 268 methods and a single \method{validate()} method.  When creating \class{reader}
 269 or \class{writer} objects, the programmer can specify a string or a subclass
 270 of the \class{Dialect} class as the dialect parameter.  In addition to, or
 271 instead of, the \var{dialect} parameter, the programmer can also specify
 272 individual formatting parameters, which have the same names as the
 273 attributes defined below for the \class{Dialect} class.
 274
 275 Dialects support the following attributes:
 276
 277 \begin{memberdesc}[Dialect]{delimiter}
 278 A one-character string used to separate fields.  It defaults to \code{','}.
 279 \end{memberdesc}
 280
 281 \begin{memberdesc}[Dialect]{doublequote}
 282 Controls how instances of \var{quotechar} appearing inside a field should
 283 be themselves be quoted.  When \constant{True}, the character is doubled.
 284 When \constant{False}, the \var{escapechar} is used as a prefix to the
 285 \var{quotechar}.  It defaults to \constant{True}.
 286
 287 On output, if \var{doublequote} is \constant{False} and no
 288 \var{escapechar} is set, \exception{Error} is raised if a \var{quotechar}
 289 is found in a field.
 290 \end{memberdesc}
 291
 292 \begin{memberdesc}[Dialect]{escapechar}
 293 A one-character string used by the writer to escape the \var{delimiter} if
 294 \var{quoting} is set to \constant{QUOTE_NONE} and the \var{quotechar}
 295 if \var{doublequote} is \constant{False}. On reading, the \var{escapechar}
 296 removes any special meaning from the following character. It defaults
 297 to \constant{None}, which disables escaping.
 298 \end{memberdesc}
 299
 300 \begin{memberdesc}[Dialect]{lineterminator}
 301 The string used to terminate lines produced by the \class{writer}.
 302 It defaults to \code{'\e r\e n'}.
 303
 304 \note{The \class{reader} is hard-coded to recognise either \code{'\e r'}
 305 or \code{'\e n'} as end-of-line, and ignores \var{lineterminator}. This
 306 behavior may change in the future.}
 307 \end{memberdesc}
 308
 309 \begin{memberdesc}[Dialect]{quotechar}
 310 A one-character string used to quote fields containing special characters,
 311 such as the \var{delimiter} or \var{quotechar}, or which contain new-line
 312 characters.  It defaults to \code{'"'}.
 313 \end{memberdesc}
 314
 315 \begin{memberdesc}[Dialect]{quoting}
 316 Controls when quotes should be generated by the writer and recognised
 317 by the reader.  It can take on any of the \constant{QUOTE_*} constants
 318 (see section~\ref{csv-contents}) and defaults to \constant{QUOTE_MINIMAL}.
 319 \end{memberdesc}
 320
 321 \begin{memberdesc}[Dialect]{skipinitialspace}
 322 When \constant{True}, whitespace immediately following the \var{delimiter}
 323 is ignored.  The default is \constant{False}.
 324 \end{memberdesc}
 325
 326
 327 \subsection{Reader Objects}
 328
 329 Reader objects (\class{DictReader} instances and objects returned by
 330 the \function{reader()} function) have the following public methods:
 331
 332 \begin{methoddesc}[csv reader]{next}{}
 333 Return the next row of the reader's iterable object as a list, parsed
 334 according to the current dialect.
 335 \end{methoddesc}
 336
 337 Reader objects have the following public attributes:
 338
 339 \begin{memberdesc}[csv reader]{dialect}
 340 A read-only description of the dialect in use by the parser.
 341 \end{memberdesc}
 342
 343 \begin{memberdesc}[csv reader]{line_num}
 344  The number of lines read from the source iterator. This is not the same
 345  as the number of records returned, as records can span multiple lines.
 346 \end{memberdesc}
 347
 348
 349 \subsection{Writer Objects}
 350
 351 \class{Writer} objects (\class{DictWriter} instances and objects returned by
 352 the \function{writer()} function) have the following public methods.  A
 353 {}\var{row} must be a sequence of strings or numbers for \class{Writer}
 354 objects and a dictionary mapping fieldnames to strings or numbers (by
 355 passing them through \function{str()} first) for {}\class{DictWriter}
 356 objects.  Note that complex numbers are written out surrounded by parens.
 357 This may cause some problems for other programs which read CSV files
 358 (assuming they support complex numbers at all).
 359
 360 \begin{methoddesc}[csv writer]{writerow}{row}
 361 Write the \var{row} parameter to the writer's file object, formatted
 362 according to the current dialect.
 363 \end{methoddesc}
 364
 365 \begin{methoddesc}[csv writer]{writerows}{rows}
 366 Write all the \var{rows} parameters (a list of \var{row} objects as
 367 described above) to the writer's file object, formatted
 368 according to the current dialect.
 369 \end{methoddesc}
 370
 371 Writer objects have the following public attribute:
 372
 373 \begin{memberdesc}[csv writer]{dialect}
 374 A read-only description of the dialect in use by the writer.
 375 \end{memberdesc}
 376
 377
 378
 379 \subsection{Examples\label{csv-examples}}
 380
 381 The simplest example of reading a CSV file:
 382
 383 \begin{verbatim}
 384 import csv
 385 reader = csv.reader(open("some.csv", "rb"))
 386 for row in reader:
 387     print row
 388 \end{verbatim}
 389
 390 Reading a file with an alternate format:
 391
 392 \begin{verbatim}
 393 import csv
 394 reader = csv.reader(open("passwd", "rb"), delimiter=':', quoting=csv.QUOTE_NONE)
 395 for row in reader:
 396     print row
 397 \end{verbatim}
 398
 399 The corresponding simplest possible writing example is:
 400
 401 \begin{verbatim}
 402 import csv
 403 writer = csv.writer(open("some.csv", "wb"))
 404 writer.writerows(someiterable)
 405 \end{verbatim}
 406
 407 Registering a new dialect:
 408
 409 \begin{verbatim}
 410 import csv
 411
 412 csv.register_dialect('unixpwd', delimiter=':', quoting=csv.QUOTE_NONE)
 413
 414 reader = csv.reader(open("passwd", "rb"), 'unixpwd')
 415 \end{verbatim}
 416
 417 A slightly more advanced use of the reader --- catching and reporting errors:
 418
 419 \begin{verbatim}
 420 import csv, sys
 421 filename = "some.csv"
 422 reader = csv.reader(open(filename, "rb"))
 423 try:
 424     for row in reader:
 425         print row
 426 except csv.Error, e:
 427     sys.exit('file %s, line %d: %s' % (filename, reader.line_num, e))
 428 \end{verbatim}
 429
 430 And while the module doesn't directly support parsing strings, it can
 431 easily be done:
 432
 433 \begin{verbatim}
 434 import csv
 435 for row in csv.reader(['one,two,three']):
 436     print row
 437 \end{verbatim}
 438
 439 The \module{csv} module doesn't directly support reading and writing
 440 Unicode, but it is 8-bit-clean save for some problems with \ASCII{} NUL
 441 characters.  So you can write functions or classes that handle the
 442 encoding and decoding for you as long as you avoid encodings like
 443 UTF-16 that use NULs.  UTF-8 is recommended.
 444
 445 \function{unicode_csv_reader} below is a generator that wraps
 446 \class{csv.reader} to handle Unicode CSV data (a list of Unicode
 447 strings).  \function{utf_8_encoder} is a generator that encodes the
 448 Unicode strings as UTF-8, one string (or row) at a time.  The encoded
 449 strings are parsed by the CSV reader, and
 450 \function{unicode_csv_reader} decodes the UTF-8-encoded cells back
 451 into Unicode:
 452
 453 \begin{verbatim}
 454 import csv
 455
 456 def unicode_csv_reader(unicode_csv_data, dialect=csv.excel, **kwargs):
 457     # csv.py doesn't do Unicode; encode temporarily as UTF-8:
 458     csv_reader = csv.reader(utf_8_encoder(unicode_csv_data),
 459                             dialect=dialect, **kwargs)
 460     for row in csv_reader:
 461         # decode UTF-8 back to Unicode, cell by cell:
 462         yield [unicode(cell, 'utf-8') for cell in row]
 463
 464 def utf_8_encoder(unicode_csv_data):
 465     for line in unicode_csv_data:
 466         yield line.encode('utf-8')
 467 \end{verbatim}
 468
 469 For all other encodings the following \class{UnicodeReader} and
 470 \class{UnicodeWriter} classes can be used. They take an additional
 471 \var{encoding} parameter in their constructor and make sure that the data
 472 passes the real reader or writer encoded as UTF-8:
 473
 474 \begin{verbatim}
 475 import csv, codecs, cStringIO
 476
 477 class UTF8Recoder:
 478     """
 479     Iterator that reads an encoded stream and reencodes the input to UTF-8
 480     """
 481     def __init__(self, f, encoding):
 482         self.reader = codecs.getreader(encoding)(f)
 483
 484     def __iter__(self):
 485         return self
 486
 487     def next(self):
 488         return self.reader.next().encode("utf-8")
 489
 490 class UnicodeReader:
 491     """
 492     A CSV reader which will iterate over lines in the CSV file "f",
 493     which is encoded in the given encoding.
 494     """
 495
 496     def __init__(self, f, dialect=csv.excel, encoding="utf-8", **kwds):
 497         f = UTF8Recoder(f, encoding)
 498         self.reader = csv.reader(f, dialect=dialect, **kwds)
 499
 500     def next(self):
 501         row = self.reader.next()
 502         return [unicode(s, "utf-8") for s in row]
 503
 504     def __iter__(self):
 505         return self
 506
 507 class UnicodeWriter:
 508     """
 509     A CSV writer which will write rows to CSV file "f",
 510     which is encoded in the given encoding.
 511     """
 512
 513     def __init__(self, f, dialect=csv.excel, encoding="utf-8", **kwds):
 514         # Redirect output to a queue
 515         self.queue = cStringIO.StringIO()
 516         self.writer = csv.writer(self.queue, dialect=dialect, **kwds)
 517         self.stream = f
 518         self.encoder = codecs.getincrementalencoder(encoding)()
 519
 520     def writerow(self, row):
 521         self.writer.writerow([s.encode("utf-8") for s in row])
 522         # Fetch UTF-8 output from the queue ...
 523         data = self.queue.getvalue()
 524         data = data.decode("utf-8")
 525         # ... and reencode it into the target encoding
 526         data = self.encoder.encode(data)
 527         # write to the target stream
 528         self.stream.write(data)
 529         # empty queue
 530         self.queue.truncate(0)
 531
 532     def writerows(self, rows):
 533         for row in rows:
 534             self.writerow(row)
 535 \end{verbatim}