1 .\" Copyright (c) 1992, 1993, 1994 Henry Spencer.
2 .\" Copyright (c) 1992, 1993, 1994
3 .\" The Regents of the University of California. All rights reserved.
5 .\" This code is derived from software contributed to Berkeley by
8 .\" Redistribution and use in source and binary forms, with or without
9 .\" modification, are permitted provided that the following conditions
11 .\" 1. Redistributions of source code must retain the above copyright
12 .\" notice, this list of conditions and the following disclaimer.
13 .\" 2. Redistributions in binary form must reproduce the above copyright
14 .\" notice, this list of conditions and the following disclaimer in the
15 .\" documentation and/or other materials provided with the distribution.
16 .\" 3. Neither the name of the University nor the names of its contributors
17 .\" may be used to endorse or promote products derived from this software
18 .\" without specific prior written permission.
20 .\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
21 .\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
22 .\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
23 .\" ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
24 .\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
25 .\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
26 .\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
27 .\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
28 .\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
29 .\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
32 .\" @(#)regex.3 8.4 (Berkeley) 3/20/94
33 .\" $FreeBSD: src/lib/libc/regex/regex.3,v 1.21 2007/01/09 00:28:04 imp Exp $
53 .Nd regular-expression library
57 .Sy (Standards-compliant APIs)
62 .Fa "regex_t *restrict preg"
63 .Fa "const char *restrict pattern"
69 .Fa "const regex_t *restrict preg"
70 .Fa "char *restrict errbuf"
71 .Fa "size_t errbuf_size"
75 .Fa "const regex_t *restrict preg"
76 .Fa "const char *restrict string"
78 .Fa "regmatch_t pmatch[restrict]"
86 .Sy (Non-portable extensions)
89 .Fa "regex_t *restrict preg"
90 .Fa "const char *restrict pattern"
96 .Fa "const regex_t *restrict preg"
97 .Fa "const char *restrict string"
100 .Fa "regmatch_t pmatch[restrict]"
105 .Fa "regex_t *restrict preg"
106 .Fa "const wchar_t *restrict widepat"
111 .Fa "const regex_t *restrict preg"
112 .Fa "const wchar_t *restrict widestr"
114 .Fa "regmatch_t pmatch[restrict]"
119 .Fa "regex_t *restrict preg"
120 .Fa "const wchar_t *restrict widepat"
126 .Fa "const regex_t *restrict preg"
127 .Fa "const wchar_t *restrict widestr"
130 .Fa "regmatch_t pmatch[restrict]"
137 .Fa "regex_t *restrict preg"
138 .Fa "const char *restrict pattern"
140 .Fa "locale_t restrict"
144 .Fa "regex_t *restrict preg"
145 .Fa "const char *restrict pattern"
148 .Fa "locale_t restrict"
152 .Fa "regex_t *restrict preg"
153 .Fa "const wchar_t *restrict widepat"
155 .Fa "locale_t restrict"
159 .Fa "regex_t *restrict preg"
160 .Fa "const wchar_t *restrict widepat"
163 .Fa "locale_t restrict"
166 These routines implement
175 compiles an RE, written as a string, into an internal form.
177 matches that internal form against a string and reports results.
179 transforms error codes from either into human-readable messages.
181 frees any dynamically-allocated storage used by the internal form
186 declares two structure types,
190 the former for compiled internal forms and the latter for match reporting.
191 It also declares the four functions,
194 and a number of constants with names starting with
200 compiles the regular expression contained in the
203 subject to the flags in
205 and places the results in the
207 structure pointed to by
212 is the bitwise OR of zero or more of the following flags:
213 .Bl -tag -width REG_EXTENDED
218 rather than the obsolete
223 This is a synonym for 0,
224 provided as a counterpart to
226 to improve readability.
228 Compile with recognition of all special characters turned off.
229 All characters are thus considered ordinary,
233 This is an extension,
234 compatible with but not specified by
236 and should be used with
237 caution in software intended to be portable to other systems.
248 Compile for matching that ignores upper/lower case distinctions.
252 Compile for matching that need only report success or failure,
253 not what was matched.
255 Compile for newline-sensitive matching.
256 By default, newline is a completely ordinary character with no special
257 meaning in either REs or strings.
260 bracket expressions and
265 anchor matches the null string after any newline in the string
266 in addition to its normal function,
269 anchor matches the null string before any newline in the
270 string in addition to its normal function.
274 is not recognized by any of the wide character or
279 variants can be used instead of
281 see EXTENDED APIS below.)
282 The regular expression ends,
283 not at the first NUL,
284 but just before the character pointed to by the
286 member of the structure pointed to by
292 This flag permits inclusion of NULs in the RE;
293 they are considered ordinary characters.
294 This is an extension,
295 compatible with but not specified by
297 and should be used with
298 caution in software intended to be portable to other systems.
300 Recognized enhanced regular expression features; see
303 This is an extension not specified by
305 and should be used with
306 caution in software intended to be portable to other systems.
308 Use minimal (non-greedy) repetitions instead of the normal greedy ones; see
311 (This only applies when both
316 This is an extension not specified by
318 and should be used with
319 caution in software intended to be portable to other systems.
327 returns 0 and fills in the structure pointed to by
329 One member of that structure
336 contains the number of parenthesized subexpressions within the RE
337 (except that the value of this member is undefined if the
342 fails, it returns a non-zero error code;
349 matches the compiled RE pointed to by
353 subject to the flags in
355 and reports results using
358 and the returned value.
359 The RE must have been compiled by a previous invocation of
361 The compiled form is not altered during execution of
363 so a single compiled RE can be used simultaneously by multiple threads.
366 the NUL-terminated string pointed to by
368 is considered to be the text of an entire line, minus any terminating
372 argument is the bitwise OR of zero or more of the following flags:
373 .Bl -tag -width REG_STARTEND
375 The first character of
377 is not the beginning of a line, so the
379 anchor should not match before it.
380 This does not affect the behavior of newlines under
385 does not end a line, so the
387 anchor should not match before it.
388 This does not affect the behavior of newlines under
391 The string is considered to start at
394 .Fa pmatch Ns [0]. Ns Va rm_so
395 and to have a terminating NUL located at
398 .Fa pmatch Ns [0]. Ns Va rm_eo
399 (there need not actually be a NUL at that location),
400 regardless of the value of
402 See below for the definition of
406 This is an extension,
407 compatible with but not specified by
409 and should be used with
410 caution in software intended to be portable to other systems.
416 affects only the location of the string,
417 not how it is matched.
422 for a discussion of what is matched in situations where an RE or a
423 portion thereof could match any of several substrings of
428 returns 0 for success and the non-zero code
431 Other non-zero error codes may be returned in exceptional situations;
437 was specified in the compilation of the RE,
444 argument (but see below for the case where
449 points to an array of
453 Such a structure has at least the members
459 (a signed arithmetic type at least as large as an
463 containing respectively the offset of the first character of a substring
464 and the offset of the first character after the end of the substring.
465 Offsets are measured from the beginning of the
469 An empty substring is denoted by equal offsets,
470 both indicating the character following the empty substring.
472 The 0th member of the
474 array is filled in to indicate what substring of
476 was matched by the entire RE.
477 Remaining members report what substring was matched by parenthesized
478 subexpressions within the RE;
481 reports subexpression
483 with subexpressions counted (starting at 1) by the order of their opening
484 parentheses in the RE, left to right.
485 Unused entries in the array (corresponding either to subexpressions that
486 did not participate in the match at all, or to subexpressions that do not
487 exist in the RE (that is,
490 .Fa preg Ns -> Ns Va re_nsub ) )
496 If a subexpression participated in the match several times,
497 the reported substring is the last one it matched.
498 (Note, as an example in particular, that when the RE
502 the parenthesized subexpression matches each of the three
505 an infinite number of empty strings following the last
507 so the reported substring is one of the empties.)
513 must point to at least one
520 to hold the input offsets for
522 Use for output is still entirely controlled by
531 will not be changed by a successful
543 to a human-readable, printable message.
547 .No non\- Ns Dv NULL ,
548 the error code should have arisen from use of
553 and if the error code came from
555 it should have been the result from the most recent
561 may be able to supply a more detailed message using information
567 places the NUL-terminated message into the buffer pointed to by
569 limiting the length (including the NUL) to at most
572 If the whole message will not fit,
573 as much of it as will fit before the terminating NUL is supplied.
575 the returned value is the size of buffer needed to hold the whole
576 message (including terminating NUL).
581 is ignored but the return value is still correct.
591 that results is the printable name of the error code,
594 rather than an explanation thereof.
605 member of the structure it points to
606 must point to the printable name of an error code;
607 in this case, the result in
609 is the decimal digits of
610 the numeric value of the error code
611 (0 if the name is not recognized).
615 are intended primarily as debugging facilities;
617 compatible with but not specified by
619 and should be used with
620 caution in software intended to be portable to other systems.
621 Be warned also that they are considered experimental and changes are possible.
626 frees any dynamically-allocated storage associated with the compiled RE
631 is no longer a valid compiled RE
632 and the effect of supplying it to
638 None of these functions references global variables except for tables
640 all are safe for use from multiple threads if the arguments are safe.
642 These extended APIs are available in Mac OS X 10.8 and beyond, when the
643 deployment target is 10.8 or later.
644 It should also be noted that any of the
646 variants may be used to initialize a
648 structure, that can then be passed to any of the
651 So it is quite legal to compile a wide character RE and use it to match a
652 multibyte character string, or vice versa.
656 routine compiles regular expressions like
658 but the length of the regular expression string is specified, allowing a string
659 that is not NUL terminated and/or contains NUL characters.
660 This is a modern replacement for using
670 but the length of the string to match is specified, allowing a string
671 that is not NUL terminated and/or contains NUL characters.
677 variants take a wide-character
679 string for the regular expression and string to match.
684 are variants that allow specifying the wide character string length, and
685 so allows wide character strings that are not NUL terminated and/or
686 contains NUL characters.
687 .Sh INTERACTION WITH THE LOCALE
690 or one of its variants is run, the regular expression is compiled into an
691 internal form, which may include specific information about the locale currently
692 in effect, such as equivalence classes or multi-character collation symbols.
693 So a reference to the current locale is also stored with the internal form,
696 is run, it can use the same locale (even if the locale is changed in-between
702 To provide more direct control over which locale is used,
705 appended to their names are provided that work just like the variants
708 except that a locale (via a
710 variable type) is specified directly.
711 Note that only variants of
717 variants just use the reference to the locale stored in the internal form.
718 .Sh IMPLEMENTATION CHOICES
721 implementation in Mac OS X 10.8 and later is based on a heavily modified subset
722 of TRE (http://laurikari.net/tre/).
723 This provides improved performance, better conformance and additional features.
724 However, both API and binary compatibility have been maintained with previous
725 releases, so binaries
726 built on previous releases should work on 10.8 and later, and binaries built on
727 10.8 and later should be able to run on previous releases (as long as none of
728 the new variants or new features are used.
730 There are a number of decisions that
732 leaves up to the implementor,
733 either by explicitly saying
735 or by virtue of them being
736 forbidden by the RE grammar.
737 This implementation treats them as follows.
741 for a discussion of the definition of case-independent matching.
743 There is no particular limit on the length of REs,
744 except insofar as memory is limited.
745 Memory usage is approximately linear in RE size, and largely insensitive
746 to RE complexity, except for bounded repetitions.
749 for one short RE using them
750 that will run almost any system out of memory.
752 A backslashed character other than one specifically given a magic meaning
755 (such magic meanings occur only in obsolete
758 is taken as an ordinary character.
766 Equivalence classes cannot begin or end bracket-expression ranges.
767 The endpoint of one range cannot begin another.
770 the limit on repetition counts in bounded repetitions, is 255.
772 A repetition operator
777 cannot follow another
778 repetition operator, except for the use of
780 for minimal repetition (for enhanced extended REs; see
783 A repetition operator cannot begin an expression or subexpression
790 cannot appear first or last in a (sub)expression or after another
794 cannot be an empty subexpression.
795 An empty parenthesized subexpression,
797 is legal and matches an
799 An empty string is not a legal RE.
803 followed by a digit is considered the beginning of bounds for a
804 bounded repetition, which must then follow the syntax for bounds.
808 followed by a digit is considered an ordinary character.
813 beginning and ending subexpressions in obsolete
815 REs are anchors, not ordinary characters.
817 Non-zero error codes from
821 include the following:
823 .Bl -tag -width REG_ECOLLATE -compact
830 invalid regular expression
832 invalid collating element
834 invalid character class
837 applied to unescapable character
839 invalid backreference number
853 invalid repetition count(s) in
856 invalid character range in
867 empty (sub)expression
869 cannot happen - you found a bug
871 invalid argument, e.g.\& negative-length string
873 illegal byte sequence (bad multibyte character)
880 sections 2.8 (Regular Expression Notation)
882 B.5 (C Binding for Regular Expression Matching).
886 implementation is based on a heavily modified subset of TRE
887 (http://laurikari.net/tre/), originally written by Ville Laurikari.
888 Previous releases used an implementation originally written by
890 and altered for inclusion in the
894 The beginning-of-line and end-of-line anchors (
898 are currently implemented so that repetitions can not be applied to them.
899 The standards are unclear about whether this is legal, but other
901 packages do support this case.
902 It is best to avoid this non-portable (and not really very useful) case.
904 The back-reference code is subtle and doubts linger about its correctness
909 variants use one of two internal matching engines.
910 The normal one is linear worst-case time in the length of the text being
911 searched, and quadratic worst-case time in the length of the used regular
913 When back-references are used, a slower, backtracking engine is used.
914 While all backtracking matching engines suffer from extreme slowness for certain
915 pathological cases, the normal engines doesn't suffer from these cases.
916 It is advised to avoid back-references whenever possible.
921 implements bounded repetitions by macro expansion,
922 which is costly in time and space if counts are large
923 or bounded repetitions are nested.
925 .Ql "((((a{1,100}){1,100}){1,100}){1,100}){1,100}"
926 will (eventually) run almost any existing machine out of swap space.
932 are legal REs because
935 a special character only in the presence of a previous unmatched
937 This cannot be fixed until the spec is fixed.
939 The standard's definition of back references is vague.
941 .Ql "a\e(\e(b\e)*\e2\e)*d"
944 Until the standard is clarified,
945 behavior in such cases should not be relied on.