1 .\" Copyright (c) 1992, 1993, 1994 Henry Spencer.
2 .\" Copyright (c) 1992, 1993, 1994
3 .\" The Regents of the University of California. All rights reserved.
5 .\" This code is derived from software contributed to Berkeley by
8 .\" Redistribution and use in source and binary forms, with or without
9 .\" modification, are permitted provided that the following conditions
11 .\" 1. Redistributions of source code must retain the above copyright
12 .\" notice, this list of conditions and the following disclaimer.
13 .\" 2. Redistributions in binary form must reproduce the above copyright
14 .\" notice, this list of conditions and the following disclaimer in the
15 .\" documentation and/or other materials provided with the distribution.
16 .\" 3. All advertising materials mentioning features or use of this software
17 .\" must display the following acknowledgement:
18 .\" This product includes software developed by the University of
19 .\" California, Berkeley and its contributors.
20 .\" 4. Neither the name of the University nor the names of its contributors
21 .\" may be used to endorse or promote products derived from this software
22 .\" without specific prior written permission.
24 .\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
25 .\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
26 .\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
27 .\" ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
28 .\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
29 .\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
30 .\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
31 .\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
32 .\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
33 .\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
36 .\" @(#)regex.3 8.4 (Berkeley) 3/20/94
37 .\" $FreeBSD: src/lib/libc/regex/regex.3,v 1.4.2.4 2001/12/14 18:33:56 ru Exp $
38 .\" $DragonFly: src/lib/libc/regex/regex.3,v 1.5 2008/06/02 06:50:08 hasso Exp $
48 .Nd regular-expression library
54 .Fn regcomp "regex_t *preg" "const char *pattern" "int cflags"
57 .Fa "const regex_t *preg" "const char *string"
58 .Fa "size_t nmatch" "regmatch_t pmatch[]" "int eflags"
62 .Fa "int errcode" "const regex_t *preg"
63 .Fa "char *errbuf" "size_t errbuf_size"
66 .Fn regfree "regex_t *preg"
68 These routines implement
75 compiles an RE written as a string into an internal form,
77 matches that internal form against a string and reports results,
79 transforms error codes from either into human-readable messages,
82 frees any dynamically-allocated storage used by the internal form
87 declares two structure types,
91 the former for compiled internal forms and the latter for match reporting.
92 It also declares the four functions,
95 and a number of constants with names starting with
99 compiles the regular expression contained in the
102 subject to the flags in
104 and places the results in the
106 structure pointed to by
109 is the bitwise OR of zero or more of the following flags:
110 .Bl -tag -width REG_EXTENDED
115 rather than the obsolete
120 This is a synonym for 0,
121 provided as a counterpart to
123 to improve readability.
125 Compile with recognition of all special characters turned off.
126 All characters are thus considered ordinary,
130 This is an extension,
131 compatible with but not specified by
133 and should be used with
134 caution in software intended to be portable to other systems.
142 Compile for matching that ignores upper/lower case distinctions.
146 Compile for matching that need only report success or failure,
147 not what was matched.
149 Compile for newline-sensitive matching.
150 By default, newline is a completely ordinary character with no special
151 meaning in either REs or strings.
154 bracket expressions and
159 anchor matches the null string after any newline in the string
160 in addition to its normal function,
163 anchor matches the null string before any newline in the
164 string in addition to its normal function.
166 The regular expression ends,
167 not at the first NUL,
168 but just before the character pointed to by the
170 member of the structure pointed to by
176 This flag permits inclusion of NULs in the RE;
177 they are considered ordinary characters.
178 This is an extension,
179 compatible with but not specified by
181 and should be used with
182 caution in software intended to be portable to other systems.
187 returns 0 and fills in the structure pointed to by
189 One member of that structure
196 contains the number of parenthesized subexpressions within the RE
197 (except that the value of this member is undefined if the
202 fails, it returns a non-zero error code;
207 matches the compiled RE pointed to by
211 subject to the flags in
213 and reports results using
216 and the returned value.
217 The RE must have been compiled by a previous invocation of
219 The compiled form is not altered during execution of
221 so a single compiled RE can be used simultaneously by multiple threads.
224 the NUL-terminated string pointed to by
226 is considered to be the text of an entire line, minus any terminating
230 argument is the bitwise OR of zero or more of the following flags:
231 .Bl -tag -width REG_STARTEND
233 The first character of
235 is not the beginning of a line, so the
237 anchor should not match before it.
238 This does not affect the behavior of newlines under
243 does not end a line, so the
245 anchor should not match before it.
246 This does not affect the behavior of newlines under
249 The string is considered to start at
252 .Fa pmatch Ns [0]. Ns Va rm_so
253 and to have a terminating NUL located at
256 .Fa pmatch Ns [0]. Ns Va rm_eo
257 (there need not actually be a NUL at that location),
258 regardless of the value of
260 See below for the definition of
264 This is an extension,
265 compatible with but not specified by
267 and should be used with
268 caution in software intended to be portable to other systems.
274 affects only the location of the string,
275 not how it is matched.
280 for a discussion of what is matched in situations where an RE or a
281 portion thereof could match any of several substrings of
286 returns 0 for success and the non-zero code
289 Other non-zero error codes may be returned in exceptional situations;
295 was specified in the compilation of the RE,
302 argument (but see below for the case where
307 points to an array of
311 Such a structure has at least the members
317 (a signed arithmetic type at least as large as an
321 containing respectively the offset of the first character of a substring
322 and the offset of the first character after the end of the substring.
323 Offsets are measured from the beginning of the
327 An empty substring is denoted by equal offsets,
328 both indicating the character following the empty substring.
330 The 0th member of the
332 array is filled in to indicate what substring of
334 was matched by the entire RE.
335 Remaining members report what substring was matched by parenthesized
336 subexpressions within the RE;
339 reports subexpression
341 with subexpressions counted (starting at 1) by the order of their opening
342 parentheses in the RE, left to right.
343 Unused entries in the array (corresponding either to subexpressions that
344 did not participate in the match at all, or to subexpressions that do not
345 exist in the RE (that is,
348 .Fa preg Ns -> Ns Va re_nsub ) )
354 If a subexpression participated in the match several times,
355 the reported substring is the last one it matched.
356 (Note, as an example in particular, that when the RE
360 the parenthesized subexpression matches each of the three
363 an infinite number of empty strings following the last
365 so the reported substring is one of the empties.)
371 must point to at least one
378 to hold the input offsets for
380 Use for output is still entirely controlled by
389 will not be changed by a successful
399 to a human-readable, printable message.
403 .No non\- Ns Dv NULL ,
404 the error code should have arisen from use of
409 and if the error code came from
411 it should have been the result from the most recent
416 may be able to supply a more detailed message using information
420 places the NUL-terminated message into the buffer pointed to by
422 limiting the length (including the NUL) to at most
425 If the whole message won't fit,
426 as much of it as will fit before the terminating NUL is supplied.
428 the returned value is the size of buffer needed to hold the whole
429 message (including terminating NUL).
434 is ignored but the return value is still correct.
444 that results is the printable name of the error code,
447 rather than an explanation thereof.
458 member of the structure it points to
459 must point to the printable name of an error code;
460 in this case, the result in
462 is the decimal digits of
463 the numeric value of the error code
464 (0 if the name is not recognized).
468 are intended primarily as debugging facilities;
470 compatible with but not specified by
472 and should be used with
473 caution in software intended to be portable to other systems.
474 Be warned also that they are considered experimental and changes are possible.
477 frees any dynamically-allocated storage associated with the compiled RE
482 is no longer a valid compiled RE
483 and the effect of supplying it to
489 None of these functions references global variables except for tables
491 all are safe for use from multiple threads if the arguments are safe.
492 .Sh IMPLEMENTATION CHOICES
493 There are a number of decisions that
495 leaves up to the implementor,
496 either by explicitly saying
498 or by virtue of them being
499 forbidden by the RE grammar.
500 This implementation treats them as follows.
504 for a discussion of the definition of case-independent matching.
506 There is no particular limit on the length of REs,
507 except insofar as memory is limited.
508 Memory usage is approximately linear in RE size, and largely insensitive
509 to RE complexity, except for bounded repetitions.
512 for one short RE using them
513 that will run almost any system out of memory.
515 A backslashed character other than one specifically given a magic meaning
518 (such magic meanings occur only in obsolete
521 is taken as an ordinary character.
529 Equivalence classes cannot begin or end bracket-expression ranges.
530 The endpoint of one range cannot begin another.
533 the limit on repetition counts in bounded repetitions, is 255.
535 A repetition operator
540 cannot follow another
542 A repetition operator cannot begin an expression or subexpression
549 cannot appear first or last in a (sub)expression or after another
553 cannot be an empty subexpression.
554 An empty parenthesized subexpression,
556 is legal and matches an
558 An empty string is not a legal RE.
562 followed by a digit is considered the beginning of bounds for a
563 bounded repetition, which must then follow the syntax for bounds.
567 followed by a digit is considered an ordinary character.
572 beginning and ending subexpressions in obsolete
574 REs are anchors, not ordinary characters.
576 Non-zero error codes from
580 include the following:
582 .Bl -tag -width REG_ECOLLATE -compact
587 invalid regular expression
589 invalid collating element
591 invalid character class
594 applied to unescapable character
596 invalid backreference number
610 invalid repetition count(s) in
613 invalid character range in
624 empty (sub)expression
626 can't happen - you found a bug
628 invalid argument, e.g. negative-length string
635 sections 2.8 (Regular Expression Notation)
637 B.5 (C Binding for Regular Expression Matching).
639 Originally written by
641 Altered for inclusion in the
645 This is an alpha release with known defects.
646 Please report problems.
648 The back-reference code is subtle and doubts linger about its correctness
653 This will improve with later releases.
655 exceeding 0 is expensive;
657 exceeding 1 is worse.
659 is largely insensitive to RE complexity
662 references are massively expensive.
663 RE length does matter; in particular, there is a strong speed bonus
664 for keeping RE length under about 30 characters,
665 with most special characters counting roughly double.
668 implements bounded repetitions by macro expansion,
669 which is costly in time and space if counts are large
670 or bounded repetitions are nested.
672 .Ql "((((a{1,100}){1,100}){1,100}){1,100}){1,100}"
673 will (eventually) run almost any existing machine out of swap space.
675 There are suspected problems with response to obscure error conditions.
677 certain kinds of internal overflow,
678 produced only by truly enormous REs or by multiply nested bounded repetitions,
679 are probably not handled well.
685 are legal REs because
688 a special character only in the presence of a previous unmatched
690 This can't be fixed until the spec is fixed.
692 The standard's definition of back references is vague.
694 .Ql "a\e(\e(b\e)*\e2\e)*d"
697 Until the standard is clarified,
698 behavior in such cases should not be relied on.
700 The implementation of word-boundary matching is a bit of a kludge,
701 and bugs may lurk in combinations of word-boundary matching and anchoring.