1 .TH REGEX 3 "17 May 1993"
4 .\" one other place knows this name: the SEE ALSO section
8 regcomp, regexec, regerror, regfree \- regular-expression library
12 #include <sys/types.h>
16 int regcomp(regex_t\ *preg, const\ char\ *pattern, int\ cflags);
18 int\ regexec(const\ regex_t\ *preg, const\ char\ *string,
19 size_t\ nmatch, regmatch_t\ pmatch[], int\ eflags);
21 size_t\ regerror(int\ errcode, const\ regex_t\ *preg,
22 char\ *errbuf, size_t\ errbuf_size);
24 void\ regfree(regex_t\ *preg);
28 These routines implement POSIX 1003.2 regular expressions (``RE''s);
32 compiles an RE written as a string into an internal form,
34 matches that internal form against a string and reports results,
36 transforms error codes from either into human-readable messages,
39 frees any dynamically-allocated storage used by the internal form
44 declares two structure types,
48 the former for compiled internal forms and the latter for match reporting.
49 It also declares the four functions,
52 and a number of constants with names starting with ``REG_''.
55 compiles the regular expression contained in the
58 subject to the flags in
60 and places the results in the
62 structure pointed to by
65 is the bitwise OR of zero or more of the following flags:
66 .IP REG_EXTENDED \w'REG_EXTENDED'u+2n
67 Compile modern (``extended'') REs,
68 rather than the obsolete (``basic'') REs that
71 This is a synonym for 0,
72 provided as a counterpart to REG_EXTENDED to improve readability.
74 Compile with recognition of all special characters turned off.
75 All characters are thus considered ordinary,
76 so the ``RE'' is a literal string.
78 compatible with but not specified by POSIX 1003.2,
79 and should be used with
80 caution in software intended to be portable to other systems.
81 REG_EXTENDED and REG_NOSPEC may not be used
85 Compile for matching that ignores upper/lower case distinctions.
89 Compile for matching that need only report success or failure,
92 Compile for newline-sensitive matching.
93 By default, newline is a completely ordinary character with no special
94 meaning in either REs or strings.
96 `[^' bracket expressions and `.' never match newline,
97 a `^' anchor matches the null string after any newline in the string
98 in addition to its normal function,
99 and the `$' anchor matches the null string before any newline in the
100 string in addition to its normal function.
102 The regular expression ends,
103 not at the first NUL,
104 but just before the character pointed to by the
106 member of the structure pointed to by
112 This flag permits inclusion of NULs in the RE;
113 they are considered ordinary characters.
114 This is an extension,
115 compatible with but not specified by POSIX 1003.2,
116 and should be used with
117 caution in software intended to be portable to other systems.
121 returns 0 and fills in the structure pointed to by
123 One member of that structure
130 contains the number of parenthesized subexpressions within the RE
131 (except that the value of this member is undefined if the
132 REG_NOSUB flag was used).
135 fails, it returns a non-zero error code;
139 matches the compiled RE pointed to by
143 subject to the flags in
145 and reports results using
148 and the returned value.
149 The RE must have been compiled by a previous invocation of
151 The compiled form is not altered during execution of
153 so a single compiled RE can be used simultaneously by multiple threads.
156 the NUL-terminated string pointed to by
158 is considered to be the text of an entire line, minus any terminating
162 argument is the bitwise OR of zero or more of the following flags:
163 .IP REG_NOTBOL \w'REG_STARTEND'u+2n
164 The first character of
166 is not the beginning of a line, so the `^' anchor should not match before it.
167 This does not affect the behavior of newlines under REG_NEWLINE.
171 does not end a line, so the `$' anchor should not match before it.
172 This does not affect the behavior of newlines under REG_NEWLINE.
174 The string is considered to start at
175 \fIstring\fR\ + \fIpmatch\fR[0].\fIrm_so\fR
176 and to have a terminating NUL located at
177 \fIstring\fR\ + \fIpmatch\fR[0].\fIrm_eo\fR
178 (there need not actually be a NUL at that location),
179 regardless of the value of
181 See below for the definition of
185 This is an extension,
186 compatible with but not specified by POSIX 1003.2,
187 and should be used with
188 caution in software intended to be portable to other systems.
189 Note that a non-zero \fIrm_so\fR does not imply REG_NOTBOL;
190 REG_STARTEND affects only the location of the string,
191 not how it is matched.
195 for a discussion of what is matched in situations where an RE or a
196 portion thereof could match any of several substrings of
201 returns 0 for success and the non-zero code REG_NOMATCH for failure.
202 Other non-zero error codes may be returned in exceptional situations;
205 If REG_NOSUB was specified in the compilation of the RE,
212 argument (but see below for the case where REG_STARTEND is specified).
215 points to an array of
219 Such a structure has at least the members
225 (a signed arithmetic type at least as large as an
229 containing respectively the offset of the first character of a substring
230 and the offset of the first character after the end of the substring.
231 Offsets are measured from the beginning of the
235 An empty substring is denoted by equal offsets,
236 both indicating the character following the empty substring.
238 The 0th member of the
240 array is filled in to indicate what substring of
242 was matched by the entire RE.
243 Remaining members report what substring was matched by parenthesized
244 subexpressions within the RE;
247 reports subexpression
249 with subexpressions counted (starting at 1) by the order of their opening
250 parentheses in the RE, left to right.
251 Unused entries in the array\(emcorresponding either to subexpressions that
252 did not participate in the match at all, or to subexpressions that do not
253 exist in the RE (that is, \fIi\fR\ > \fIpreg\fR\->\fIre_nsub\fR)\(emhave both
258 If a subexpression participated in the match several times,
259 the reported substring is the last one it matched.
260 (Note, as an example in particular, that when the RE `(b*)+' matches `bbb',
261 the parenthesized subexpression matches each of the three `b's and then
262 an infinite number of empty strings following the last `b',
263 so the reported substring is one of the empties.)
265 If REG_STARTEND is specified,
267 must point to at least one
271 is 0 or REG_NOSUB was specified),
272 to hold the input offsets for REG_STARTEND.
273 Use for output is still entirely controlled by
277 is 0 or REG_NOSUB was specified,
280 will not be changed by a successful
290 to a human-readable, printable message.
294 the error code should have arisen from use of
299 and if the error code came from
301 it should have been the result from the most recent
306 may be able to supply a more detailed message using information
310 places the NUL-terminated message into the buffer pointed to by
312 limiting the length (including the NUL) to at most
315 If the whole message won't fit,
316 as much of it as will fit before the terminating NUL is supplied.
318 the returned value is the size of buffer needed to hold the whole
319 message (including terminating NUL).
324 is ignored but the return value is still correct.
330 is first ORed with REG_ITOA,
331 the ``message'' that results is the printable name of the error code,
332 e.g. ``REG_NOMATCH'',
333 rather than an explanation thereof.
339 shall be non-NULL and the
341 member of the structure it points to
342 must point to the printable name of an error code;
343 in this case, the result in
345 is the decimal digits of
346 the numeric value of the error code
347 (0 if the name is not recognized).
348 REG_ITOA and REG_ATOI are intended primarily as debugging facilities;
350 compatible with but not specified by POSIX 1003.2,
351 and should be used with
352 caution in software intended to be portable to other systems.
353 Be warned also that they are considered experimental and changes are possible.
356 frees any dynamically-allocated storage associated with the compiled RE
361 is no longer a valid compiled RE
362 and the effect of supplying it to
368 None of these functions references global variables except for tables
370 all are safe for use from multiple threads if the arguments are safe.
371 .SH IMPLEMENTATION CHOICES
372 There are a number of decisions that 1003.2 leaves up to the implementor,
373 either by explicitly saying ``undefined'' or by virtue of them being
374 forbidden by the RE grammar.
375 This implementation treats them as follows.
379 for a discussion of the definition of case-independent matching.
381 There is no particular limit on the length of REs,
382 except insofar as memory is limited.
383 Memory usage is approximately linear in RE size, and largely insensitive
384 to RE complexity, except for bounded repetitions.
385 See BUGS for one short RE using them
386 that will run almost any system out of memory.
388 A backslashed character other than one specifically given a magic meaning
389 by 1003.2 (such magic meanings occur only in obsolete [``basic''] REs)
390 is taken as an ordinary character.
392 Any unmatched [ is a REG_EBRACK error.
394 Equivalence classes cannot begin or end bracket-expression ranges.
395 The endpoint of one range cannot begin another.
397 RE_DUP_MAX, the limit on repetition counts in bounded repetitions, is 255.
399 A repetition operator (?, *, +, or bounds) cannot follow another
401 A repetition operator cannot begin an expression or subexpression
402 or follow `^' or `|'.
404 `|' cannot appear first or last in a (sub)expression or after another `|',
405 i.e. an operand of `|' cannot be an empty subexpression.
406 An empty parenthesized subexpression, `()', is legal and matches an
408 An empty string is not a legal RE.
410 A `{' followed by a digit is considered the beginning of bounds for a
411 bounded repetition, which must then follow the syntax for bounds.
412 A `{' \fInot\fR followed by a digit is considered an ordinary character.
414 `^' and `$' beginning and ending subexpressions in obsolete (``basic'')
415 REs are anchors, not ordinary characters.
419 POSIX 1003.2, sections 2.8 (Regular Expression Notation)
421 B.5 (C Binding for Regular Expression Matching).
423 Non-zero error codes from
427 include the following:
430 .ta \w'REG_ECOLLATE'u+3n
431 REG_NOMATCH regexec() failed to match
432 REG_BADPAT invalid regular expression
433 REG_ECOLLATE invalid collating element
434 REG_ECTYPE invalid character class
435 REG_EESCAPE \e applied to unescapable character
436 REG_ESUBREG invalid backreference number
437 REG_EBRACK brackets [ ] not balanced
438 REG_EPAREN parentheses ( ) not balanced
439 REG_EBRACE braces { } not balanced
440 REG_BADBR invalid repetition count(s) in { }
441 REG_ERANGE invalid character range in [ ]
442 REG_ESPACE ran out of memory
443 REG_BADRPT ?, *, or + operand invalid
444 REG_EMPTY empty (sub)expression
445 REG_ASSERT ``can't happen''\(emyou found a bug
446 REG_INVARG invalid argument, e.g. negative-length string
449 Written by Henry Spencer at University of Toronto,
450 henry@zoo.toronto.edu.
452 This is an alpha release with known defects.
453 Please report problems.
455 There is one known functionality bug.
456 The implementation of internationalization is incomplete:
457 the locale is always assumed to be the default one of 1003.2,
458 and only the collating elements etc. of that locale are available.
460 The back-reference code is subtle and doubts linger about its correctness
465 This will improve with later releases.
467 exceeding 0 is expensive;
469 exceeding 1 is worse.
471 is largely insensitive to RE complexity \fIexcept\fR that back
472 references are massively expensive.
473 RE length does matter; in particular, there is a strong speed bonus
474 for keeping RE length under about 30 characters,
475 with most special characters counting roughly double.
478 implements bounded repetitions by macro expansion,
479 which is costly in time and space if counts are large
480 or bounded repetitions are nested.
482 `((((a{1,100}){1,100}){1,100}){1,100}){1,100}'
483 will (eventually) run almost any existing machine out of swap space.
485 There are suspected problems with response to obscure error conditions.
487 certain kinds of internal overflow,
488 produced only by truly enormous REs or by multiply nested bounded repetitions,
489 are probably not handled well.
491 Due to a mistake in 1003.2, things like `a)b' are legal REs because `)' is
492 a special character only in the presence of a previous unmatched `('.
493 This can't be fixed until the spec is fixed.
495 The standard's definition of back references is vague.
497 `a\e(\e(b\e)*\e2\e)*d' match `abbbd'?
498 Until the standard is clarified,
499 behavior in such cases should not be relied on.
501 The implementation of word-boundary matching is a bit of a kludge,
502 and bugs may lurk in combinations of word-boundary matching and anchoring.