1 \section{\module{shlex
} ---
2 Simple lexical analysis
}
4 \declaremodule{standard
}{shlex
}
5 \modulesynopsis{Simple lexical analysis for
\UNIX\ shell-like languages.
}
6 \moduleauthor{Eric S. Raymond
}{esr@snark.thyrsus.com
}
7 \moduleauthor{Gustavo Niemeyer
}{niemeyer@conectiva.com
}
8 \sectionauthor{Eric S. Raymond
}{esr@snark.thyrsus.com
}
9 \sectionauthor{Gustavo Niemeyer
}{niemeyer@conectiva.com
}
13 The
\class{shlex
} class makes it easy to write lexical analyzers for
14 simple syntaxes resembling that of the
\UNIX{} shell. This will often
15 be useful for writing minilanguages, (for example, in run control
16 files for Python applications) or for parsing quoted strings.
18 \note{The
\module{shlex
} module currently does not support Unicode input.
}
20 The
\module{shlex
} module defines the following functions:
22 \begin{funcdesc
}{split
}{s
\optional{, comments
}}
23 Split the string
\var{s
} using shell-like syntax. If
\var{comments
} is
24 \constant{False
} (the default), the parsing of comments in the given
25 string will be disabled (setting the
\member{commenters
} member of the
26 \class{shlex
} instance to the empty string). This function operates
31 The
\module{shlex
} module defines the following class:
33 \begin{classdesc
}{shlex
}{\optional{instream
\optional{,
34 infile
\optional{, posix
}}}}
35 A
\class{shlex
} instance or subclass instance is a lexical analyzer
36 object. The initialization argument, if present, specifies where to
37 read characters from. It must be a file-/stream-like object with
38 \method{read()
} and
\method{readline()
} methods, or a string (strings
39 are accepted since Python
2.3). If no argument is given, input will be
40 taken from
\code{sys.stdin
}. The second optional argument is a filename
41 string, which sets the initial value of the
\member{infile
} member. If
42 the
\var{instream
} argument is omitted or equal to
\code{sys.stdin
},
43 this second argument defaults to ``stdin''. The
\var{posix
} argument
44 was introduced in Python
2.3, and defines the operational mode. When
45 \var{posix
} is not true (default), the
\class{shlex
} instance will
46 operate in compatibility mode. When operating in
\POSIX{} mode,
47 \class{shlex
} will try to be as close as possible to the
\POSIX{} shell
48 parsing rules. See section~
\ref{shlex-objects
}.
52 \seemodule{ConfigParser
}{Parser for configuration files similar to the
53 Windows
\file{.ini
} files.
}
57 \subsection{shlex Objects
\label{shlex-objects
}}
59 A
\class{shlex
} instance has the following methods:
61 \begin{methoddesc
}{get_token
}{}
62 Return a token. If tokens have been stacked using
63 \method{push_token()
}, pop a token off the stack. Otherwise, read one
64 from the input stream. If reading encounters an immediate
65 end-of-file,
\member{self.eof
} is returned (the empty string (
\code{''
})
66 in non-
\POSIX{} mode, and
\code{None
} in
\POSIX{} mode).
69 \begin{methoddesc
}{push_token
}{str
}
70 Push the argument onto the token stack.
73 \begin{methoddesc
}{read_token
}{}
74 Read a raw token. Ignore the pushback stack, and do not interpret source
75 requests. (This is not ordinarily a useful entry point, and is
76 documented here only for the sake of completeness.)
79 \begin{methoddesc
}{sourcehook
}{filename
}
80 When
\class{shlex
} detects a source request (see
81 \member{source
} below) this method is given the following token as
82 argument, and expected to return a tuple consisting of a filename and
83 an open file-like object.
85 Normally, this method first strips any quotes off the argument. If
86 the result is an absolute pathname, or there was no previous source
87 request in effect, or the previous source was a stream
88 (such as
\code{sys.stdin
}), the result is left alone. Otherwise, if the
89 result is a relative pathname, the directory part of the name of the
90 file immediately before it on the source inclusion stack is prepended
91 (this behavior is like the way the C preprocessor handles
92 \code{\#include "file.h"
}).
94 The result of the manipulations is treated as a filename, and returned
95 as the first component of the tuple, with
96 \function{open()
} called on it to yield the second component. (Note:
97 this is the reverse of the order of arguments in instance initialization!)
99 This hook is exposed so that you can use it to implement directory
100 search paths, addition of file extensions, and other namespace hacks.
101 There is no corresponding `close' hook, but a shlex instance will call
102 the
\method{close()
} method of the sourced input stream when it
105 For more explicit control of source stacking, use the
106 \method{push_source()
} and
\method{pop_source()
} methods.
109 \begin{methoddesc
}{push_source
}{stream
\optional{, filename
}}
110 Push an input source stream onto the input stack. If the filename
111 argument is specified it will later be available for use in error
112 messages. This is the same method used internally by the
113 \method{sourcehook
} method.
117 \begin{methoddesc
}{pop_source
}{}
118 Pop the last-pushed input source from the input stack.
119 This is the same method used internally when the lexer reaches
120 \EOF{} on a stacked input stream.
124 \begin{methoddesc
}{error_leader
}{\optional{file
\optional{, line
}}}
125 This method generates an error message leader in the format of a
126 \UNIX{} C compiler error label; the format is
\code{'"\%s", line \%d: '
},
127 where the
\samp{\%s
} is replaced with the name of the current source
128 file and the
\samp{\%d
} with the current input line number (the
129 optional arguments can be used to override these).
131 This convenience is provided to encourage
\module{shlex
} users to
132 generate error messages in the standard, parseable format understood
133 by Emacs and other
\UNIX{} tools.
136 Instances of
\class{shlex
} subclasses have some public instance
137 variables which either control lexical analysis or can be used for
140 \begin{memberdesc
}{commenters
}
141 The string of characters that are recognized as comment beginners.
142 All characters from the comment beginner to end of line are ignored.
143 Includes just
\character{\#
} by default.
146 \begin{memberdesc
}{wordchars
}
147 The string of characters that will accumulate into multi-character
148 tokens. By default, includes all
\ASCII{} alphanumerics and
152 \begin{memberdesc
}{whitespace
}
153 Characters that will be considered whitespace and skipped. Whitespace
154 bounds tokens. By default, includes space, tab, linefeed and
158 \begin{memberdesc
}{escape
}
159 Characters that will be considered as escape. This will be only used
160 in
\POSIX{} mode, and includes just
\character{\textbackslash} by default.
164 \begin{memberdesc
}{quotes
}
165 Characters that will be considered string quotes. The token
166 accumulates until the same quote is encountered again (thus, different
167 quote types protect each other as in the shell.) By default, includes
168 \ASCII{} single and double quotes.
171 \begin{memberdesc
}{escapedquotes
}
172 Characters in
\member{quotes
} that will interpret escape characters
173 defined in
\member{escape
}. This is only used in
\POSIX{} mode, and
174 includes just
\character{"
} by default.
178 \begin{memberdesc
}{whitespace_split
}
179 If
\code{True
}, tokens will only be split in whitespaces. This is useful, for
180 example, for parsing command lines with
\class{shlex
}, getting tokens
181 in a similar way to shell arguments.
185 \begin{memberdesc
}{infile
}
186 The name of the current input file, as initially set at class
187 instantiation time or stacked by later source requests. It may
188 be useful to examine this when constructing error messages.
191 \begin{memberdesc
}{instream
}
192 The input stream from which this
\class{shlex
} instance is reading
196 \begin{memberdesc
}{source
}
197 This member is
\code{None
} by default. If you assign a string to it,
198 that string will be recognized as a lexical-level inclusion request
199 similar to the
\samp{source
} keyword in various shells. That is, the
200 immediately following token will opened as a filename and input taken
201 from that stream until
\EOF, at which point the
\method{close()
}
202 method of that stream will be called and the input source will again
203 become the original input stream. Source requests may be stacked any
204 number of levels deep.
207 \begin{memberdesc
}{debug
}
208 If this member is numeric and
\code{1} or more, a
\class{shlex
}
209 instance will print verbose progress output on its behavior. If you
210 need to use this, you can read the module source code to learn the
214 \begin{memberdesc
}{lineno
}
215 Source line number (count of newlines seen so far plus one).
218 \begin{memberdesc
}{token
}
219 The token buffer. It may be useful to examine this when catching
223 \begin{memberdesc
}{eof
}
224 Token used to determine end of file. This will be set to the empty
225 string (
\code{''
}), in non-
\POSIX{} mode, and to
\code{None
} in
230 \subsection{Parsing Rules
\label{shlex-parsing-rules
}}
232 When operating in non-
\POSIX{} mode,
\class{shlex
} will try to obey to
236 \item Quote characters are not recognized within words
237 (
\code{Do"Not"Separate
} is parsed as the single word
238 \code{Do"Not"Separate
});
239 \item Escape characters are not recognized;
240 \item Enclosing characters in quotes preserve the literal value of
241 all characters within the quotes;
242 \item Closing quotes separate words (
\code{"Do"Separate
} is parsed
243 as
\code{"Do"
} and
\code{Separate
});
244 \item If
\member{whitespace_split
} is
\code{False
}, any character not
245 declared to be a word character, whitespace, or a quote will be
246 returned as a single-character token. If it is
\code{True
},
247 \class{shlex
} will only split words in whitespaces;
248 \item EOF is signaled with an empty string (
\code{''
});
249 \item It's not possible to parse empty strings, even if quoted.
252 When operating in
\POSIX{} mode,
\class{shlex
} will try to obey to the
253 following parsing rules.
256 \item Quotes are stripped out, and do not separate words
257 (
\code{"Do"Not"Separate"
} is parsed as the single word
258 \code{DoNotSeparate
});
259 \item Non-quoted escape characters (e.g.
\character{\textbackslash})
260 preserve the literal value of the next character that follows;
261 \item Enclosing characters in quotes which are not part of
262 \member{escapedquotes
} (e.g.
\character{'
}) preserve the literal
263 value of all characters within the quotes;
264 \item Enclosing characters in quotes which are part of
265 \member{escapedquotes
} (e.g.
\character{"
}) preserves the literal
266 value of all characters within the quotes, with the exception of
267 the characters mentioned in
\member{escape
}. The escape characters
268 retain its special meaning only when followed by the quote in use,
269 or the escape character itself. Otherwise the escape character
270 will be considered a normal character.
271 \item EOF is signaled with a
\constant{None
} value;
272 \item Quoted empty strings (
\code{''
}) are allowed;