1 .\" Copyright (c) 1992, 1993, 1994 Henry Spencer.
2 .\" Copyright (c) 1992, 1993, 1994
3 .\" The Regents of the University of California. All rights reserved.
5 .\" This code is derived from software contributed to Berkeley by
8 .\" Redistribution and use in source and binary forms, with or without
9 .\" modification, are permitted provided that the following conditions
11 .\" 1. Redistributions of source code must retain the above copyright
12 .\" notice, this list of conditions and the following disclaimer.
13 .\" 2. Redistributions in binary form must reproduce the above copyright
14 .\" notice, this list of conditions and the following disclaimer in the
15 .\" documentation and/or other materials provided with the distribution.
16 .\" 3. Neither the name of the University nor the names of its contributors
17 .\" may be used to endorse or promote products derived from this software
18 .\" without specific prior written permission.
20 .\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
21 .\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
22 .\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
23 .\" ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
24 .\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
25 .\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
26 .\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
27 .\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
28 .\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
29 .\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
32 .\" @(#)re_format.7 8.3 (Berkeley) 3/20/94
33 .\" $FreeBSD: src/lib/libc/regex/re_format.7,v 1.12 2008/09/05 17:41:20 keramida Exp $
40 .Nd POSIX 1003.2 regular expressions
47 modern REs (roughly those of
52 and obsolete REs (roughly those of
57 Obsolete REs mostly exist for backward compatibility in some old programs;
58 they will be discussed at the end.
60 leaves some aspects of RE syntax and semantics open;
61 `\(dd' marks decisions on these aspects that
62 may not be fully portable to other
66 A (modern) RE is one\(dd or more non-empty\(dd
70 It matches anything that matches one of the branches.
72 A branch is one\(dd or more
75 It matches a match for the first, followed by a match for the second, etc.
88 matches a sequence of 0 or more matches of the atom.
91 matches a sequence of 1 or more matches of the atom.
94 matches a sequence of 0 or 1 matches of the atom.
100 followed by an unsigned decimal integer,
103 possibly followed by another unsigned decimal integer,
106 The integers must lie between 0 and
109 and if there are two of them, the first may not exceed the second.
110 An atom followed by a bound containing one integer
113 a sequence of exactly
116 An atom followed by a bound
117 containing one integer
122 or more matches of the atom.
123 An atom followed by a bound
124 containing two integers
133 (inclusive) matches of the atom.
135 An atom is a regular expression enclosed in
137 (matching a match for the
141 (matching the null string)\(dd,
143 .Em bracket expression
146 (matching any single character),
148 (matching the null string at the beginning of a line),
150 (matching the null string at the end of a line), a
152 followed by one of the characters
154 (matching that character taken as an ordinary character),
157 followed by any other character\(dd
158 (matching that character taken as an ordinary character,
161 had not been present\(dd),
162 or a single character with no other significance (matching that character).
165 followed by a character other than a digit is an ordinary
166 character, not the beginning of a bound\(dd.
167 It is illegal to end an RE with
171 .Em bracket expression
172 is a list of characters enclosed in
174 It normally matches any single character from the list (but see below).
175 If the list begins with
177 it matches any single character
180 from the rest of the list.
181 If two characters in the list are separated by
186 of characters between those two (inclusive) in the
189 in ASCII matches any decimal digit.
190 It is illegal\(dd for two ranges to share an
193 Ranges are very collating-sequence-dependent,
194 and portable programs should avoid relying on them.
198 in the list, make it the first character
199 (following a possible
203 make it the first or last character,
204 or the second endpoint of a range.
207 as the first endpoint of a range,
212 to make it a collating element (see below).
213 With the exception of these and some combinations using
215 (see next paragraphs), all other special characters, including
217 lose their special significance within a bracket expression.
219 Within a bracket expression, a collating element (a character,
220 a multi-character sequence that collates as if it were a single character,
221 or a collating-sequence name for either)
227 sequence of characters of that collating element.
228 The sequence is a single element of the bracket expression's list.
229 A bracket expression containing a multi-character collating element
230 can thus match more than one character,
231 e.g.\& if the collating sequence includes a
236 matches the first five characters
240 Within a bracket expression, a collating element enclosed in
244 is an equivalence class, standing for the sequences of characters
245 of all collating elements equivalent to that one, including itself.
246 (If there are no other equivalent collating elements,
247 the treatment is as if the enclosing delimiters were
255 are the members of an equivalence class,
262 An equivalence class may not\(dd be an endpoint
265 Within a bracket expression, the name of a
271 stands for the list of all characters belonging to that
273 Standard character class names are:
274 .Bl -column "alnum" "digit" "xdigit" -offset indent
275 .It Em "alnum digit punct"
276 .It Em "alpha graph space"
277 .It Em "blank lower upper"
278 .It Em "cntrl print xdigit"
281 These stand for the character classes defined in
283 A locale may provide others.
284 A character class may not be used as an endpoint of a range.
286 A bracketed expression like
288 can be used to match a single character that belongs to a character
290 The reverse, matching any character that does not belong to a specific
291 class, the negation operator of bracket expressions may be used:
294 There are two special cases\(dd of bracket expressions:
295 the bracket expressions
299 match the null string at the beginning and end of a word respectively.
300 A word is defined as a sequence of word characters
301 which is neither preceded nor followed by
303 A word character is an
305 character (as defined by
308 This is an extension,
309 compatible with but not specified by
311 and should be used with
312 caution in software intended to be portable to other systems.
314 In the event that an RE could match more than one substring of a given
316 the RE matches the one starting earliest in the string.
317 If the RE could match more than one substring starting at that point,
318 it matches the longest.
319 Subexpressions also match the longest possible substrings, subject to
320 the constraint that the whole match be as long as possible,
321 with subexpressions starting earlier in the RE taking priority over
323 Note that higher-level subexpressions thus take priority over
324 their lower-level component subexpressions.
326 Match lengths are measured in characters, not collating elements.
327 A null string is considered longer than no match at all.
330 matches the three middle characters of
332 .Ql (wee|week)(knights|nights)
333 matches all ten characters of
339 the parenthesized subexpression
340 matches all three characters, and
345 both the whole RE and the parenthesized
346 subexpression match the null string.
348 If case-independent matching is specified,
349 the effect is much as if all case distinctions had vanished from the
351 When an alphabetic that exists in multiple cases appears as an
352 ordinary character outside a bracket expression, it is effectively
353 transformed into a bracket expression containing both cases,
357 When it appears inside a bracket expression, all case counterparts
358 of it are added to the bracket expression, so that (e.g.)
367 No particular limit is imposed on the length of REs\(dd.
368 Programs intended to be portable should not employ REs longer
370 as an implementation can refuse to accept such REs and remain
375 regular expressions differ in several respects.
377 is an ordinary character and there is no equivalent
378 for its functionality.
382 are ordinary characters, and their functionality
383 can be expressed using bounds
390 in modern REs is equivalent to
392 The delimiters for bounds are
400 by themselves ordinary characters.
401 The parentheses for nested subexpressions are
409 by themselves ordinary characters.
411 is an ordinary character except at the beginning of the
412 RE or\(dd the beginning of a parenthesized subexpression,
414 is an ordinary character except at the end of the
415 RE or\(dd the end of a parenthesized subexpression,
418 is an ordinary character if it appears at the beginning of the
419 RE or the beginning of a parenthesized subexpression
420 (after a possible leading
422 Finally, there is one new type of atom, a
425 followed by a non-zero decimal digit
427 matches the same sequence of characters
430 parenthesized subexpression
431 (numbering subexpressions by the positions of their opening parentheses,
441 .Sh ENHANCED FEATURES
444 flag is passed to one of the
446 variants, additional features are activated.
449 implementations in scripting languages such as
453 these additional features may conflict with the
455 standards in some ways.
456 Use this with care in situations which require portability
457 (including to past versions of the Mac OS X using the previous
461 For enhanced basic REs,
466 remain regular characters, but
471 have the same special meaning as the unescaped characters do for
472 extended REs, i.e., one or more matches, zero or one matches and alteration,
474 For enhanced extended REs,
475 back references are available.
476 Additional enhanced features are listed below.
478 Within a bracket expression, most characters lose their magic.
479 This also applies to the additional enhanced features, which don't operate
480 inside a bracket expression.
481 .Ss Assertions (available for both enhanced basic and enhanced extended REs)
486 (the assertions that match the null string at the beginning and end of line,
487 respectively), the following assertions become available:
488 .Bl -tag -width ".Sy \eB" -offset indent
490 Matches the null string at the beginning of a word.
491 This is equivalent to
494 Matches the null string at the end of a word.
495 This is equivalent to
498 Matches the null string at a word boundary (either the beginning or end of
501 Matches the null string where there is no word boundary.
502 This is the opposite of
505 .Ss Shortcuts (available for both enhanced basic and enhanced extended REs)
506 The following shortcuts can be used to replace more complicated
508 .Bl -tag -width ".Sy \eD" -offset indent
510 Matches a digit character.
511 This is equivalent to
514 Matches a non-digit character.
515 This is equivalent to
518 Matches a space character.
519 This is equivalent to
522 Matches a non-space character.
523 This is equivalent to
526 Matches a word character.
527 This is equivalent to
530 Matches a non-word character.
531 This is equivalent to
534 .Ss Literal Sequences (available for both enhanced basic and enhanced extended REs)
535 Literals are normally just ordinary characters that are matched directly.
536 Under enhanced mode, certain character sequences are
537 converted to specific literals.
538 .Bl -tag -width ".Sy \ea" -offset indent
542 character (ASCII code 7).
546 character (ASCII code 27).
550 character (ASCII code 12).
553 .Dq new-line/line-feed
554 character (ASCII code 10).
558 character (ASCII code 13).
562 character (ASCII code 9).
565 Literals can also be specified directly, using their wide character values.
566 Note that when matching a multibyte character string, the string's bytes
567 are converted to wide character before comparing.
568 This means that a single literal wide character value may match more than
569 one string byte, depending on the locale's wide character encoding.
570 .Bl -tag -width ".Sy \ex{ Ns Em x.. Ns Sy \&}" -offset indent
572 An arbitray eight-bit value.
575 sequence represents zero, one or two hexadecimal digits.
578 is less than two hexadecimal digits, and the character following this sequence
579 happens to be a hexadecimal digit, use the (following) brace form to avoid
581 .It Sy \ex{ Ns Em x.. Ns Sy \&}
582 An arbitrary, up to 32-bit value.
585 sequence is an arbitrary sequence of hexadecimal digits that is long enough
586 to represent the necessary value.
588 .Ss Inline Literal Mode (available for both enhanced basic and enhanced extended REs)
591 sequence causes literal
596 ends literal mode, and returns to normal regular expression processing.
597 This is similar to specifying the
603 except that rather than applying to the whole RE string, it only applies to
608 Note that it is not possible to have a
610 in the middle of an inline literal range, as that would terminate literal mode
612 .Ss Minimal Repetitions (available for enhanced extended REs only)
613 By default, the repetition operators,
621 they try to match as many times as possible.
622 In enhanced mode, appending a
624 to a repetition operator makes it minimal (or
626 it tries to match the fewest number of times (including zero times, as
629 For example, against the string
633 would match the entire string,
636 would match the null string at the beginning of the line
637 (matches zero times).
638 Likewise, against the string
642 would also match the entire string,
645 would only match the first two characters.
651 will make the regular
653 repetition operators ungreedy by default.
656 makes them greedy again.
658 Note that minimal repetitions are not specified by an official
659 standard, so there may be differences between different implementations.
660 In the current implementation, minimal repetitions have a high precedence,
661 and can cause other standards requirements to be violated.
662 For instance, on the string
666 will only match the first four characters, violating the rules that the longest
667 possible match is made and the longest subexpressions are matched.
670 forces the entire string to be matched.
671 .Ss Non-capturing Parenthesized Subexpressions (available for enhanced extended REs only)
672 Normally, the match offsets to parenthesized subexpressions are
677 is not specified, and
679 is large enough to encompass the parenthesized subexpression in question).
680 In enhanced mode, if the first two characters following the left parenthesis
683 grouping of the remaining contents is done, but the corresponding offsets are
687 For example, against the string
691 would have two subexpression matches in
699 there would only be one subexpression match, that of
706 would again match the entire string, but only
710 .Ss Inline Options (available for enhanced extended REs only)
711 Like the inline literal mode mentioned above, other options can be switched
712 on and off for part of a RE.
713 .Ql (? Ns Em o.. Ns \&)
714 will turn on the options specified in
716 (one or more options characters; see below), while
717 .Ql (?- Ns Em o.. Ns \&)
718 will turn off the specified options, and
719 .Ql (? Ns Em o1.. Ns \&- Ns Em o2.. Ns \&)
720 will turn on the first set of options, and turn off the second set.
722 The available options are:
723 .Bl -tag -width ".Sy \&U" -offset indent
725 Turning on this option will ignore case during matching, while turning off
726 will restore case-sensitive matching.
731 this option can be use to turn that off.
733 Turn on or off special handling of the newline character.
738 this option can be use to turn that off.
740 Turning on this option will make ungreedy repetitions the default, while
741 turning off will make greedy repetitions the default.
746 this option can be use to turn that off.
749 The scope of the option change begins immediately following the right
751 but up to the end of the enclosing subexpression (if any).
752 Thus, for example, given the RE
756 portion matches case sensitively,
758 matches case insensitively, and
760 matches case sensitively again (since is it outside the scope of the
761 subexpression in which the inline option was specified).
763 The inline options syntax can be combined with the non-capturing parenthesized
764 subexpression to limit the option scope to just that of the subexpression.
767 is similar to the previous example, except for the parenthesize subexpression
770 in the previous example.
771 .Ss Inline Comments (available for enhanced extended REs only)
773 .Ql (?# Ns Em comment Ns \&)
774 can be used to embed comments within a RE.
777 can not contain a right parenthesis.
778 Also note that while syntactically, option characters can be added before
781 character, they will be ignored.
785 .%T Regular Expression Notation
791 Having two kinds of REs is a botch.
797 is an ordinary character in
798 the absence of an unmatched
800 this was an unintentional result of a wording error,
801 and change is likely.
804 Back references are a dreadful botch,
805 posing major problems for efficient implementations.
806 They are also somewhat vaguely defined
808 .Ql a\e(\e(b\e)*\e2\e)*d
814 specification of case-independent matching is vague.
816 .Dq one case implies all cases
817 definition given above
818 is current consensus among implementors as to the right interpretation.
820 The bracket syntax for word boundaries is incredibly ugly.