4 RE_FORMAT(7) Device and Network Interfaces RE_FORMAT(7)
9 re_format - POSIX 1003.2 regular expressions
12 Regular expressions (``RE''s), as defined in POSIX 1003.2,
13 come in two forms: modern REs (roughly those of egrep;
14 1003.2 calls these ``extended'' REs) and obsolete REs
15 (roughly those of ed; 1003.2 ``basic'' REs). Obsolete REs
16 mostly exist for backward compatibility in some old pro-
17 grams; they will be discussed at the end. 1003.2 leaves
18 some aspects of RE syntax and semantics open; ` - ' marks
19 decisions on these aspects that may not be fully portable to
20 other 1003.2 implementations.
22 A (modern) RE is one- or more non-empty- branches, separated
23 by `|'. It matches anything that matches one of the
26 A branch is one- or more pieces, concatenated. It matches a
27 match for the first, followed by a match for the second,
30 A piece is an atom possibly followed by a single- `*', `+',
31 `?', or bound. An atom followed by `*' matches a sequence
32 of 0 or more matches of the atom. An atom followed by `+'
33 matches a sequence of 1 or more matches of the atom. An
34 atom followed by `?' matches a sequence of 0 or 1 matches of
37 A bound is `{' followed by an unsigned decimal integer, pos-
38 sibly followed by `,' possibly followed by another unsigned
39 decimal integer, always followed by `}'. The integers must
40 lie between 0 and RE_DUP_MAX (255-) inclusive, and if there
41 are two of them, the first may not exceed the second. An
42 atom followed by a bound containing one integer i and no
43 comma matches a sequence of exactly i matches of the atom.
44 An atom followed by a bound containing one integer i and a
45 comma matches a sequence of i or more matches of the atom.
46 An atom followed by a bound containing two integers i and j
47 matches a sequence of i through j (inclusive) matches of the
50 An atom is a regular expression enclosed in `()' (matching a
51 match for the regular expression), an empty set of `()'
52 (matching the null string) - , a bracket expression (see
53 below), `.' (matching any single character), `^' (matching
54 the null string at the beginning of a line), `$' (matching
55 the null string at the end of a line), a `\' followed by one
56 of the characters `^.[$()|*+?{\' (matching that character
57 taken as an ordinary character), a `\' followed by any other
58 character- (matching that character taken as an ordinary
59 character, as if the `\' had not been present-), or a single
63 SunOS 5.5 Last change: March 20, 1994 1
70 RE_FORMAT(7) Device and Network Interfaces RE_FORMAT(7)
74 character with no other significance (matching that charac-
75 ter). A `{' followed by a character other than a digit is
76 an ordinary character, not the beginning of a bound-. It is
77 illegal to end an RE with `\'.
79 A bracket expression is a list of characters enclosed in
80 `[]'. It normally matches any single character from the
81 list (but see below). If the list begins with `^', it
82 matches any single character (but see below) not from the
83 rest of the list. If two characters in the list are
84 separated by ` -', this is shorthand for the full range of
85 characters between those two (inclusive) in the collating
86 sequence, e.g. `[0-9]' in ASCII matches any decimal digit.
87 It is illegal- for two ranges to share an endpoint, e.g.
88 `a-c-e'. Ranges are very collating-sequence-dependent, and
89 portable programs should avoid relying on them.
91 To include a literal `]' in the list, make it the first
92 character (following a possible `^'). To include a literal
93 `-', make it the first or last character, or the second end-
94 point of a range. To use a literal `-' as the first end-
95 point of a range, enclose it in `[.' and `.]' to make it a
96 collating element (see below). With the exception of these
97 and some combinations using `[' (see next paragraphs), all
98 other special characters, including `\', lose their special
99 significance within a bracket expression.
101 Within a bracket expression, a collating element (a charac-
102 ter, a multi-character sequence that collates as if it were
103 a single character, or a collating-sequence name for either)
104 enclosed in `[.' and `.]' stands for the sequence of charac-
105 ters of that collating element. The sequence is a single
106 element of the bracket expression's list. A bracket expres-
107 sion containing a multi-character collating element can thus
108 match more than one character, e.g. if the collating
109 sequence includes a `ch' collating element, then the RE
110 `[[.ch.]]*c' matches the first five characters of `chchcc'.
112 Within a bracket expression, a collating element enclosed in
113 `[=' and `=]' is an equivalence class, standing for the
114 sequences of characters of all collating elements equivalent
115 to that one, including itself. (If there are no other
116 equivalent collating elements, the treatment is as if the
117 enclosing delimiters were `[.' and `.]'.) For example, if o
118 and ^ are the members of an equivalence class, then
119 `[[=o=]]', `[[=^=]]', and `[o^]' are all synonymous. An
120 equivalence class may not- be an endpoint of a range.
122 Within a bracket expression, the name of a character class
123 enclosed in `[:' and `:]' stands for the list of all charac-
124 ters belonging to that class. Standard character class
129 SunOS 5.5 Last change: March 20, 1994 2
136 RE_FORMAT(7) Device and Network Interfaces RE_FORMAT(7)
145 These stand for the character classes defined in ctype(3).
146 A locale may provide others. A character class may not be
147 used as an endpoint of a range.
149 There are two special cases- of bracket expressions: the
150 bracket expressions `[[:<:]]' and `[[:>:]]' match the null
151 string at the beginning and end of a word respectively. A
152 word is defined as a sequence of word characters which is
153 neither preceded nor followed by word characters. A word
154 character is an alnum character (as defined by ctype(3)) or
155 an underscore. This is an extension, compatible with but
156 not specified by POSIX 1003.2, and should be used with cau-
157 tion in software intended to be portable to other systems.
159 In the event that an RE could match more than one substring
160 of a given string, the RE matches the one starting earliest
161 in the string. If the RE could match more than one sub-
162 string starting at that point, it matches the longest.
163 Subexpressions also match the longest possible substrings,
164 subject to the constraint that the whole match be as long as
165 possible, with subexpressions starting earlier in the RE
166 taking priority over ones starting later. Note that
167 higher-level subexpressions thus take priority over their
168 lower-level component subexpressions.
170 Match lengths are measured in characters, not collating ele-
171 ments. A null string is considered longer than no match at
172 all. For example, `bb*' matches the three middle characters
173 of `abbbc', `(wee|week)(knights|nights)' matches all ten
174 characters of `weeknights', when `(.*).*' is matched against
175 `abc' the parenthesized subexpression matches all three
176 characters, and when `(a*)*' is matched against `bc' both
177 the whole RE and the parenthesized subexpression match the
180 If case-independent matching is specified, the effect is
181 much as if all case distinctions had vanished from the
182 alphabet. When an alphabetic that exists in multiple cases
183 appears as an ordinary character outside a bracket expres-
184 sion, it is effectively transformed into a bracket expres-
185 sion containing both cases, e.g. `x' becomes `[xX]'. When
186 it appears inside a bracket expression, all case counter-
187 parts of it are added to the bracket expression, so that
188 (e.g.) `[x]' becomes `[xX]' and `[^x]' becomes `[^xX]'.
190 No particular limit is imposed on the length of REs-. Pro-
191 grams intended to be portable should not employ REs longer
195 SunOS 5.5 Last change: March 20, 1994 3
202 RE_FORMAT(7) Device and Network Interfaces RE_FORMAT(7)
206 than 256 bytes, as an implementation can refuse to accept
207 such REs and remain POSIX-compliant.
209 Obsolete (``basic'') regular expressions differ in several
210 respects. `|', `+', and `?' are ordinary characters and
211 there is no equivalent for their functionality. The delim-
212 iters for bounds are `\{' and `\}', with `{' and `}' by
213 themselves ordinary characters. The parentheses for nested
214 subexpressions are `\(' and `\)', with `(' and `)' by them-
215 selves ordinary characters. `^' is an ordinary character
216 except at the beginning of the RE or- the beginning of a
217 parenthesized subexpression, `$' is an ordinary character
218 except at the end of the RE or- the end of a parenthesized
219 subexpression, and `*' is an ordinary character if it
220 appears at the beginning of the RE or the beginning of a
221 parenthesized subexpression (after a possible leading `^').
222 Finally, there is one new type of atom, a back reference:
223 `\' followed by a non-zero decimal digit d matches the same
224 sequence of characters matched by the dth parenthesized
225 subexpression (numbering subexpressions by the positions of
226 their opening parentheses, left to right), so that (e.g.)
227 `\([bc]\)\1' matches `bb' or `cc' but not `bc'.
232 POSIX 1003.2, section 2.8 (Regular Expression Notation).
235 Having two kinds of REs is a botch.
237 The current 1003.2 spec says that `)' is an ordinary charac-
238 ter in the absence of an unmatched `('; this was an uninten-
239 tional result of a wording error, and change is likely.
242 Back references are a dreadful botch, posing major problems
243 for efficient implementations. They are also somewhat
244 vaguely defined (does `a\(\(b\)*\2\)*d' match `abbbd'?).
247 1003.2's specification of case-independent matching is
248 vague. The ``one case implies all cases'' definition given
249 above is current consensus among implementors as to the
250 right interpretation.
252 The syntax for word boundaries is incredibly ugly.
261 SunOS 5.5 Last change: March 20, 1994 4