Fix case for 'Unix'
[pytest.git] / Doc / lib / libcsv.tex
bloba9f490dc8872699c64bff8cb728cae9257fc0633
1 \section{\module{csv} --- CSV File Reading and Writing}
3 \declaremodule{standard}{csv}
4 \modulesynopsis{Write and read tabular data to and from delimited files.}
5 \sectionauthor{Skip Montanaro}{skip@pobox.com}
7 \versionadded{2.3}
8 \index{csv}
9 \indexii{data}{tabular}
11 The so-called CSV (Comma Separated Values) format is the most common import
12 and export format for spreadsheets and databases. There is no ``CSV
13 standard'', so the format is operationally defined by the many applications
14 which read and write it. The lack of a standard means that subtle
15 differences often exist in the data produced and consumed by different
16 applications. These differences can make it annoying to process CSV files
17 from multiple sources. Still, while the delimiters and quoting characters
18 vary, the overall format is similar enough that it is possible to write a
19 single module which can efficiently manipulate such data, hiding the details
20 of reading and writing the data from the programmer.
22 The \module{csv} module implements classes to read and write tabular data in
23 CSV format. It allows programmers to say, ``write this data in the format
24 preferred by Excel,'' or ``read data from this file which was generated by
25 Excel,'' without knowing the precise details of the CSV format used by
26 Excel. Programmers can also describe the CSV formats understood by other
27 applications or define their own special-purpose CSV formats.
29 The \module{csv} module's \class{reader} and \class{writer} objects read and
30 write sequences. Programmers can also read and write data in dictionary
31 form using the \class{DictReader} and \class{DictWriter} classes.
33 \begin{notice}
34 This version of the \module{csv} module doesn't support Unicode
35 input. Also, there are currently some issues regarding \ASCII{} NUL
36 characters. Accordingly, all input should be UTF-8 or printable
37 \ASCII{} to be safe; see the examples in section~\ref{csv-examples}.
38 These restrictions will be removed in the future.
39 \end{notice}
41 \begin{seealso}
42 % \seemodule{array}{Arrays of uniformly types numeric values.}
43 \seepep{305}{CSV File API}
44 {The Python Enhancement Proposal which proposed this addition
45 to Python.}
46 \end{seealso}
49 \subsection{Module Contents \label{csv-contents}}
51 The \module{csv} module defines the following functions:
53 \begin{funcdesc}{reader}{csvfile\optional{,
54 dialect=\code{'excel'}}\optional{, fmtparam}}
55 Return a reader object which will iterate over lines in the given
56 {}\var{csvfile}. \var{csvfile} can be any object which supports the
57 iterator protocol and returns a string each time its \method{next}
58 method is called --- file objects and list objects are both suitable.
59 If \var{csvfile} is a file object, it must be opened with
60 the 'b' flag on platforms where that makes a difference. An optional
61 {}\var{dialect} parameter can be given
62 which is used to define a set of parameters specific to a particular CSV
63 dialect. It may be an instance of a subclass of the \class{Dialect}
64 class or one of the strings returned by the \function{list_dialects}
65 function. The other optional {}\var{fmtparam} keyword arguments can be
66 given to override individual formatting parameters in the current
67 dialect. For more information about the dialect and formatting
68 parameters, see section~\ref{csv-fmt-params}, ``Dialects and Formatting
69 Parameters'' for details of these parameters.
71 All data read are returned as strings. No automatic data type
72 conversion is performed.
74 \versionchanged[
75 If literal newlines are important within a field, users need to read their
76 file in a way that preserves the newlines. The behavior before 2.5 would
77 introduce spurious characters into quoted fields, with no way for the user
78 to control that behavior. The previous behavior caused considerable
79 problems, particularly on platforms that did not use the Unix line ending
80 conventions, or with files that originated on those platforms --- users were
81 finding mysterious newlines where they didn't expect them]{2.5}
83 \end{funcdesc}
85 \begin{funcdesc}{writer}{csvfile\optional{,
86 dialect=\code{'excel'}}\optional{, fmtparam}}
87 Return a writer object responsible for converting the user's data into
88 delimited strings on the given file-like object. \var{csvfile} can be any
89 object with a \function{write} method. If \var{csvfile} is a file object,
90 it must be opened with the 'b' flag on platforms where that makes a
91 difference. An optional
92 {}\var{dialect} parameter can be given which is used to define a set of
93 parameters specific to a particular CSV dialect. It may be an instance
94 of a subclass of the \class{Dialect} class or one of the strings
95 returned by the \function{list_dialects} function. The other optional
96 {}\var{fmtparam} keyword arguments can be given to override individual
97 formatting parameters in the current dialect. For more information
98 about the dialect and formatting parameters, see
99 section~\ref{csv-fmt-params}, ``Dialects and Formatting Parameters'' for
100 details of these parameters. To make it as easy as possible to
101 interface with modules which implement the DB API, the value
102 \constant{None} is written as the empty string. While this isn't a
103 reversible transformation, it makes it easier to dump SQL NULL data values
104 to CSV files without preprocessing the data returned from a
105 \code{cursor.fetch*()} call. All other non-string data are stringified
106 with \function{str()} before being written.
107 \end{funcdesc}
109 \begin{funcdesc}{register_dialect}{name\optional{, dialect}\optional{, fmtparam}}
110 Associate \var{dialect} with \var{name}. \var{name} must be a string
111 or Unicode object. The dialect can be specified either by passing a
112 sub-class of \class{Dialect}, or by \var{fmtparam} keyword arguments,
113 or both, with keyword arguments overriding parameters of the dialect.
114 For more information about the dialect and formatting parameters, see
115 section~\ref{csv-fmt-params}, ``Dialects and Formatting Parameters''
116 for details of these parameters.
117 \end{funcdesc}
119 \begin{funcdesc}{unregister_dialect}{name}
120 Delete the dialect associated with \var{name} from the dialect registry. An
121 \exception{Error} is raised if \var{name} is not a registered dialect
122 name.
123 \end{funcdesc}
125 \begin{funcdesc}{get_dialect}{name}
126 Return the dialect associated with \var{name}. An \exception{Error} is
127 raised if \var{name} is not a registered dialect name.
128 \end{funcdesc}
130 \begin{funcdesc}{list_dialects}{}
131 Return the names of all registered dialects.
132 \end{funcdesc}
134 \begin{funcdesc}{field_size_limit}{\optional{new_limit}}
135 Returns the current maximum field size allowed by the parser. If
136 \var{new_limit} is given, this becomes the new limit.
137 \versionadded{2.5}
138 \end{funcdesc}
141 The \module{csv} module defines the following classes:
143 \begin{classdesc}{DictReader}{csvfile\optional{,
144 fieldnames=\constant{None},\optional{,
145 restkey=\constant{None}\optional{,
146 restval=\constant{None}\optional{,
147 dialect=\code{'excel'}\optional{,
148 *args, **kwds}}}}}}
149 Create an object which operates like a regular reader but maps the
150 information read into a dict whose keys are given by the optional
151 {} \var{fieldnames}
152 parameter. If the \var{fieldnames} parameter is omitted, the values in
153 the first row of the \var{csvfile} will be used as the fieldnames.
154 If the row read has fewer fields than the fieldnames sequence,
155 the value of \var{restval} will be used as the default value. If the row
156 read has more fields than the fieldnames sequence, the remaining data is
157 added as a sequence keyed by the value of \var{restkey}. If the row read
158 has fewer fields than the fieldnames sequence, the remaining keys take the
159 value of the optional \var{restval} parameter. Any other optional or
160 keyword arguments are passed to the underlying \class{reader} instance.
161 \end{classdesc}
164 \begin{classdesc}{DictWriter}{csvfile, fieldnames\optional{,
165 restval=""\optional{,
166 extrasaction=\code{'raise'}\optional{,
167 dialect=\code{'excel'}\optional{,
168 *args, **kwds}}}}}
169 Create an object which operates like a regular writer but maps dictionaries
170 onto output rows. The \var{fieldnames} parameter identifies the order in
171 which values in the dictionary passed to the \method{writerow()} method are
172 written to the \var{csvfile}. The optional \var{restval} parameter
173 specifies the value to be written if the dictionary is missing a key in
174 \var{fieldnames}. If the dictionary passed to the \method{writerow()}
175 method contains a key not found in \var{fieldnames}, the optional
176 \var{extrasaction} parameter indicates what action to take. If it is set
177 to \code{'raise'} a \exception{ValueError} is raised. If it is set to
178 \code{'ignore'}, extra values in the dictionary are ignored. Any other
179 optional or keyword arguments are passed to the underlying \class{writer}
180 instance.
182 Note that unlike the \class{DictReader} class, the \var{fieldnames}
183 parameter of the \class{DictWriter} is not optional. Since Python's
184 \class{dict} objects are not ordered, there is not enough information
185 available to deduce the order in which the row should be written to the
186 \var{csvfile}.
188 \end{classdesc}
190 \begin{classdesc*}{Dialect}{}
191 The \class{Dialect} class is a container class relied on primarily for its
192 attributes, which are used to define the parameters for a specific
193 \class{reader} or \class{writer} instance.
194 \end{classdesc*}
196 \begin{classdesc}{excel}{}
197 The \class{excel} class defines the usual properties of an Excel-generated
198 CSV file.
199 \end{classdesc}
201 \begin{classdesc}{excel_tab}{}
202 The \class{excel_tab} class defines the usual properties of an
203 Excel-generated TAB-delimited file.
204 \end{classdesc}
206 \begin{classdesc}{Sniffer}{}
207 The \class{Sniffer} class is used to deduce the format of a CSV file.
208 \end{classdesc}
210 The \class{Sniffer} class provides two methods:
212 \begin{methoddesc}{sniff}{sample\optional{,delimiters=None}}
213 Analyze the given \var{sample} and return a \class{Dialect} subclass
214 reflecting the parameters found. If the optional \var{delimiters} parameter
215 is given, it is interpreted as a string containing possible valid delimiter
216 characters.
217 \end{methoddesc}
219 \begin{methoddesc}{has_header}{sample}
220 Analyze the sample text (presumed to be in CSV format) and return
221 \constant{True} if the first row appears to be a series of column
222 headers.
223 \end{methoddesc}
226 The \module{csv} module defines the following constants:
228 \begin{datadesc}{QUOTE_ALL}
229 Instructs \class{writer} objects to quote all fields.
230 \end{datadesc}
232 \begin{datadesc}{QUOTE_MINIMAL}
233 Instructs \class{writer} objects to only quote those fields which contain
234 special characters such as \var{delimiter}, \var{quotechar} or any of the
235 characters in \var{lineterminator}.
236 \end{datadesc}
238 \begin{datadesc}{QUOTE_NONNUMERIC}
239 Instructs \class{writer} objects to quote all non-numeric
240 fields.
242 Instructs the reader to convert all non-quoted fields to type \var{float}.
243 \end{datadesc}
245 \begin{datadesc}{QUOTE_NONE}
246 Instructs \class{writer} objects to never quote fields. When the current
247 \var{delimiter} occurs in output data it is preceded by the current
248 \var{escapechar} character. If \var{escapechar} is not set, the writer
249 will raise \exception{Error} if any characters that require escaping
250 are encountered.
252 Instructs \class{reader} to perform no special processing of quote characters.
253 \end{datadesc}
256 The \module{csv} module defines the following exception:
258 \begin{excdesc}{Error}
259 Raised by any of the functions when an error is detected.
260 \end{excdesc}
263 \subsection{Dialects and Formatting Parameters\label{csv-fmt-params}}
265 To make it easier to specify the format of input and output records,
266 specific formatting parameters are grouped together into dialects. A
267 dialect is a subclass of the \class{Dialect} class having a set of specific
268 methods and a single \method{validate()} method. When creating \class{reader}
269 or \class{writer} objects, the programmer can specify a string or a subclass
270 of the \class{Dialect} class as the dialect parameter. In addition to, or
271 instead of, the \var{dialect} parameter, the programmer can also specify
272 individual formatting parameters, which have the same names as the
273 attributes defined below for the \class{Dialect} class.
275 Dialects support the following attributes:
277 \begin{memberdesc}[Dialect]{delimiter}
278 A one-character string used to separate fields. It defaults to \code{','}.
279 \end{memberdesc}
281 \begin{memberdesc}[Dialect]{doublequote}
282 Controls how instances of \var{quotechar} appearing inside a field should
283 be themselves be quoted. When \constant{True}, the character is doubled.
284 When \constant{False}, the \var{escapechar} is used as a prefix to the
285 \var{quotechar}. It defaults to \constant{True}.
287 On output, if \var{doublequote} is \constant{False} and no
288 \var{escapechar} is set, \exception{Error} is raised if a \var{quotechar}
289 is found in a field.
290 \end{memberdesc}
292 \begin{memberdesc}[Dialect]{escapechar}
293 A one-character string used by the writer to escape the \var{delimiter} if
294 \var{quoting} is set to \constant{QUOTE_NONE} and the \var{quotechar}
295 if \var{doublequote} is \constant{False}. On reading, the \var{escapechar}
296 removes any special meaning from the following character. It defaults
297 to \constant{None}, which disables escaping.
298 \end{memberdesc}
300 \begin{memberdesc}[Dialect]{lineterminator}
301 The string used to terminate lines produced by the \class{writer}.
302 It defaults to \code{'\e r\e n'}.
304 \note{The \class{reader} is hard-coded to recognise either \code{'\e r'}
305 or \code{'\e n'} as end-of-line, and ignores \var{lineterminator}. This
306 behavior may change in the future.}
307 \end{memberdesc}
309 \begin{memberdesc}[Dialect]{quotechar}
310 A one-character string used to quote fields containing special characters,
311 such as the \var{delimiter} or \var{quotechar}, or which contain new-line
312 characters. It defaults to \code{'"'}.
313 \end{memberdesc}
315 \begin{memberdesc}[Dialect]{quoting}
316 Controls when quotes should be generated by the writer and recognised
317 by the reader. It can take on any of the \constant{QUOTE_*} constants
318 (see section~\ref{csv-contents}) and defaults to \constant{QUOTE_MINIMAL}.
319 \end{memberdesc}
321 \begin{memberdesc}[Dialect]{skipinitialspace}
322 When \constant{True}, whitespace immediately following the \var{delimiter}
323 is ignored. The default is \constant{False}.
324 \end{memberdesc}
327 \subsection{Reader Objects}
329 Reader objects (\class{DictReader} instances and objects returned by
330 the \function{reader()} function) have the following public methods:
332 \begin{methoddesc}[csv reader]{next}{}
333 Return the next row of the reader's iterable object as a list, parsed
334 according to the current dialect.
335 \end{methoddesc}
337 Reader objects have the following public attributes:
339 \begin{memberdesc}[csv reader]{dialect}
340 A read-only description of the dialect in use by the parser.
341 \end{memberdesc}
343 \begin{memberdesc}[csv reader]{line_num}
344 The number of lines read from the source iterator. This is not the same
345 as the number of records returned, as records can span multiple lines.
346 \end{memberdesc}
349 \subsection{Writer Objects}
351 \class{Writer} objects (\class{DictWriter} instances and objects returned by
352 the \function{writer()} function) have the following public methods. A
353 {}\var{row} must be a sequence of strings or numbers for \class{Writer}
354 objects and a dictionary mapping fieldnames to strings or numbers (by
355 passing them through \function{str()} first) for {}\class{DictWriter}
356 objects. Note that complex numbers are written out surrounded by parens.
357 This may cause some problems for other programs which read CSV files
358 (assuming they support complex numbers at all).
360 \begin{methoddesc}[csv writer]{writerow}{row}
361 Write the \var{row} parameter to the writer's file object, formatted
362 according to the current dialect.
363 \end{methoddesc}
365 \begin{methoddesc}[csv writer]{writerows}{rows}
366 Write all the \var{rows} parameters (a list of \var{row} objects as
367 described above) to the writer's file object, formatted
368 according to the current dialect.
369 \end{methoddesc}
371 Writer objects have the following public attribute:
373 \begin{memberdesc}[csv writer]{dialect}
374 A read-only description of the dialect in use by the writer.
375 \end{memberdesc}
379 \subsection{Examples\label{csv-examples}}
381 The simplest example of reading a CSV file:
383 \begin{verbatim}
384 import csv
385 reader = csv.reader(open("some.csv", "rb"))
386 for row in reader:
387 print row
388 \end{verbatim}
390 Reading a file with an alternate format:
392 \begin{verbatim}
393 import csv
394 reader = csv.reader(open("passwd", "rb"), delimiter=':', quoting=csv.QUOTE_NONE)
395 for row in reader:
396 print row
397 \end{verbatim}
399 The corresponding simplest possible writing example is:
401 \begin{verbatim}
402 import csv
403 writer = csv.writer(open("some.csv", "wb"))
404 writer.writerows(someiterable)
405 \end{verbatim}
407 Registering a new dialect:
409 \begin{verbatim}
410 import csv
412 csv.register_dialect('unixpwd', delimiter=':', quoting=csv.QUOTE_NONE)
414 reader = csv.reader(open("passwd", "rb"), 'unixpwd')
415 \end{verbatim}
417 A slightly more advanced use of the reader --- catching and reporting errors:
419 \begin{verbatim}
420 import csv, sys
421 filename = "some.csv"
422 reader = csv.reader(open(filename, "rb"))
423 try:
424 for row in reader:
425 print row
426 except csv.Error, e:
427 sys.exit('file %s, line %d: %s' % (filename, reader.line_num, e))
428 \end{verbatim}
430 And while the module doesn't directly support parsing strings, it can
431 easily be done:
433 \begin{verbatim}
434 import csv
435 for row in csv.reader(['one,two,three']):
436 print row
437 \end{verbatim}
439 The \module{csv} module doesn't directly support reading and writing
440 Unicode, but it is 8-bit-clean save for some problems with \ASCII{} NUL
441 characters. So you can write functions or classes that handle the
442 encoding and decoding for you as long as you avoid encodings like
443 UTF-16 that use NULs. UTF-8 is recommended.
445 \function{unicode_csv_reader} below is a generator that wraps
446 \class{csv.reader} to handle Unicode CSV data (a list of Unicode
447 strings). \function{utf_8_encoder} is a generator that encodes the
448 Unicode strings as UTF-8, one string (or row) at a time. The encoded
449 strings are parsed by the CSV reader, and
450 \function{unicode_csv_reader} decodes the UTF-8-encoded cells back
451 into Unicode:
453 \begin{verbatim}
454 import csv
456 def unicode_csv_reader(unicode_csv_data, dialect=csv.excel, **kwargs):
457 # csv.py doesn't do Unicode; encode temporarily as UTF-8:
458 csv_reader = csv.reader(utf_8_encoder(unicode_csv_data),
459 dialect=dialect, **kwargs)
460 for row in csv_reader:
461 # decode UTF-8 back to Unicode, cell by cell:
462 yield [unicode(cell, 'utf-8') for cell in row]
464 def utf_8_encoder(unicode_csv_data):
465 for line in unicode_csv_data:
466 yield line.encode('utf-8')
467 \end{verbatim}
469 For all other encodings the following \class{UnicodeReader} and
470 \class{UnicodeWriter} classes can be used. They take an additional
471 \var{encoding} parameter in their constructor and make sure that the data
472 passes the real reader or writer encoded as UTF-8:
474 \begin{verbatim}
475 import csv, codecs, cStringIO
477 class UTF8Recoder:
479 Iterator that reads an encoded stream and reencodes the input to UTF-8
481 def __init__(self, f, encoding):
482 self.reader = codecs.getreader(encoding)(f)
484 def __iter__(self):
485 return self
487 def next(self):
488 return self.reader.next().encode("utf-8")
490 class UnicodeReader:
492 A CSV reader which will iterate over lines in the CSV file "f",
493 which is encoded in the given encoding.
496 def __init__(self, f, dialect=csv.excel, encoding="utf-8", **kwds):
497 f = UTF8Recoder(f, encoding)
498 self.reader = csv.reader(f, dialect=dialect, **kwds)
500 def next(self):
501 row = self.reader.next()
502 return [unicode(s, "utf-8") for s in row]
504 def __iter__(self):
505 return self
507 class UnicodeWriter:
509 A CSV writer which will write rows to CSV file "f",
510 which is encoded in the given encoding.
513 def __init__(self, f, dialect=csv.excel, encoding="utf-8", **kwds):
514 # Redirect output to a queue
515 self.queue = cStringIO.StringIO()
516 self.writer = csv.writer(self.queue, dialect=dialect, **kwds)
517 self.stream = f
518 self.encoder = codecs.getincrementalencoder(encoding)()
520 def writerow(self, row):
521 self.writer.writerow([s.encode("utf-8") for s in row])
522 # Fetch UTF-8 output from the queue ...
523 data = self.queue.getvalue()
524 data = data.decode("utf-8")
525 # ... and reencode it into the target encoding
526 data = self.encoder.encode(data)
527 # write to the target stream
528 self.stream.write(data)
529 # empty queue
530 self.queue.truncate(0)
532 def writerows(self, rows):
533 for row in rows:
534 self.writerow(row)
535 \end{verbatim}