lib/libcompat/regexp/regexp.3

   1 .\" Copyright (c) 1991, 1993
   2 .\"     The Regents of the University of California.  All rights reserved.
   3 .\"
   4 .\" Redistribution and use in source and binary forms, with or without
   5 .\" modification, are permitted provided that the following conditions
   6 .\" are met:
   7 .\" 1. Redistributions of source code must retain the above copyright
   8 .\"    notice, this list of conditions and the following disclaimer.
   9 .\" 2. Redistributions in binary form must reproduce the above copyright
  10 .\"    notice, this list of conditions and the following disclaimer in the
  11 .\"    documentation and/or other materials provided with the distribution.
  12 .\" 3. All advertising materials mentioning features or use of this software
  13 .\"    must display the following acknowledgement:
  14 .\"     This product includes software developed by the University of
  15 .\"     California, Berkeley and its contributors.
  16 .\" 4. Neither the name of the University nor the names of its contributors
  17 .\"    may be used to endorse or promote products derived from this software
  18 .\"    without specific prior written permission.
  19 .\"
  20 .\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
  21 .\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  22 .\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  23 .\" ARE DISCLAIMED.  IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
  24 .\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  25 .\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  26 .\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  27 .\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  28 .\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  29 .\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  30 .\" SUCH DAMAGE.
  31 .\"
  32 .\"     @(#)regexp.3    8.1 (Berkeley) 6/4/93
  33 .\" $FreeBSD: src/lib/libcompat/regexp/regexp.3,v 1.6.2.2 2001/12/17 10:08:29 ru Exp $
  34 .\" $DragonFly: src/lib/libcompat/regexp/regexp.3,v 1.3 2006/04/08 08:17:06 swildner Exp $
  35 .\"
  36 .Dd June 4, 1993
  37 .Dt REGEXP 3
  38 .Os
  39 .Sh NAME
  40 .Nm regcomp ,
  41 .Nm regexec ,
  42 .Nm regsub ,
  43 .Nm regerror
  44 .Nd regular expression handlers
  45 .Sh LIBRARY
  46 .Lb libcompat
  47 .Sh SYNOPSIS
  48 .In regexp.h
  49 .Ft regexp *
  50 .Fn regcomp "const char *exp"
  51 .Ft int
  52 .Fn regexec "const regexp *prog" "const char *string"
  53 .Ft void
  54 .Fn regsub "const regexp *prog" "const char *source" "char *dest"
  55 .Sh DESCRIPTION
  56 .Bf Sy
  57 This interface is made obsolete by
  58 .Xr regex 3 .
  59 .Ef
  60 .Pp
  61 The
  62 .Fn regcomp ,
  63 .Fn regexec ,
  64 .Fn regsub ,
  65 and
  66 .Fn regerror
  67 functions
  68 implement
  69 .Xr egrep 1 Ns -style
  70 regular expressions and supporting facilities.
  71 .Pp
  72 The
  73 .Fn regcomp
  74 function
  75 compiles a regular expression into a structure of type
  76 .Vt regexp ,
  77 and returns a pointer to it.
  78 The space has been allocated using
  79 .Xr malloc 3
  80 and may be released by
  81 .Xr free 3 .
  82 .Pp
  83 The
  84 .Fn regexec
  85 function
  86 matches a
  87 .Dv NUL Ns -terminated
  88 .Fa string
  89 against the compiled regular expression
  90 in
  91 .Fa prog .
  92 It returns 1 for success and 0 for failure, and adjusts the contents of
  93 .Fa prog Ns 's
  94 .Em startp
  95 and
  96 .Em endp
  97 (see below) accordingly.
  98 .Pp
  99 The members of a
 100 .Vt regexp
 101 structure include at least the following (not necessarily in order):
 102 .Bd -literal -offset indent
 103 char *startp[NSUBEXP];
 104 char *endp[NSUBEXP];
 105 .Ed
 106 .Pp
 107 where
 108 .Dv NSUBEXP
 109 is defined (as 10) in the header file.
 110 Once a successful
 111 .Fn regexec
 112 has been done using the
 113 .Fn regexp ,
 114 each
 115 .Em startp Ns - Em endp
 116 pair describes one substring
 117 within the
 118 .Fa string ,
 119 with the
 120 .Em startp
 121 pointing to the first character of the substring and
 122 the
 123 .Em endp
 124 pointing to the first character following the substring.
 125 The 0th substring is the substring of
 126 .Fa string
 127 that matched the whole
 128 regular expression.
 129 The others are those substrings that matched parenthesized expressions
 130 within the regular expression, with parenthesized expressions numbered
 131 in left-to-right order of their opening parentheses.
 132 .Pp
 133 The
 134 .Fn regsub
 135 function
 136 copies
 137 .Fa source
 138 to
 139 .Fa dest ,
 140 making substitutions according to the
 141 most recent
 142 .Fn regexec
 143 performed using
 144 .Fa prog .
 145 Each instance of `&' in
 146 .Fa source
 147 is replaced by the substring
 148 indicated by
 149 .Em startp Ns Bq
 150 and
 151 .Em endp Ns Bq .
 152 Each instance of
 153 .Sq \e Ns Em n ,
 154 where
 155 .Em n
 156 is a digit, is replaced by
 157 the substring indicated by
 158 .Em startp Ns Bq Em n
 159 and
 160 .Em endp Ns Bq Em n .
 161 To get a literal `&' or
 162 .Sq \e Ns Em n
 163 into
 164 .Fa dest ,
 165 prefix it with `\e';
 166 to get a literal `\e' preceding `&' or
 167 .Sq \e Ns Em n ,
 168 prefix it with
 169 another `\e'.
 170 .Pp
 171 The
 172 .Fn regerror
 173 function
 174 is called whenever an error is detected in
 175 .Fn regcomp ,
 176 .Fn regexec ,
 177 or
 178 .Fn regsub .
 179 The default
 180 .Fn regerror
 181 writes the string
 182 .Fa msg ,
 183 with a suitable indicator of origin,
 184 on the standard
 185 error output
 186 and invokes
 187 .Xr exit 3 .
 188 The
 189 .Fn regerror
 190 function
 191 can be replaced by the user if other actions are desirable.
 192 .Sh REGULAR EXPRESSION SYNTAX
 193 A regular expression is zero or more
 194 .Em branches ,
 195 separated by `|'.
 196 It matches anything that matches one of the branches.
 197 .Pp
 198 A branch is zero or more
 199 .Em pieces ,
 200 concatenated.
 201 It matches a match for the first, followed by a match for the second, etc.
 202 .Pp
 203 A piece is an
 204 .Em atom
 205 possibly followed by `*', `+', or `?'.
 206 An atom followed by `*' matches a sequence of 0 or more matches of the atom.
 207 An atom followed by `+' matches a sequence of 1 or more matches of the atom.
 208 An atom followed by `?' matches a match of the atom, or the null string.
 209 .Pp
 210 An atom is a regular expression in parentheses (matching a match for the
 211 regular expression), a
 212 .Em range
 213 (see below), `.'
 214 (matching any single character), `^' (matching the null string at the
 215 beginning of the input string), `$' (matching the null string at the
 216 end of the input string), a `\e' followed by a single character (matching
 217 that character), or a single character with no other significance
 218 (matching that character).
 219 .Pp
 220 A
 221 .Em range
 222 is a sequence of characters enclosed in `[]'.
 223 It normally matches any single character from the sequence.
 224 If the sequence begins with `^',
 225 it matches any single character
 226 .Em not
 227 from the rest of the sequence.
 228 If two characters in the sequence are separated by `\-', this is shorthand
 229 for the full list of
 230 .Tn ASCII
 231 characters between them
 232 (e.g. `[0-9]' matches any decimal digit).
 233 To include a literal `]' in the sequence, make it the first character
 234 (following a possible `^').
 235 To include a literal `\-', make it the first or last character.
 236 .Sh AMBIGUITY
 237 If a regular expression could match two different parts of the input string,
 238 it will match the one which begins earliest.
 239 If both begin in the same place but match different lengths, or match
 240 the same length in different ways, life gets messier, as follows.
 241 .Pp
 242 In general, the possibilities in a list of branches are considered in
 243 left-to-right order, the possibilities for `*', `+', and `?' are
 244 considered longest-first, nested constructs are considered from the
 245 outermost in, and concatenated constructs are considered leftmost-first.
 246 The match that will be chosen is the one that uses the earliest
 247 possibility in the first choice that has to be made.
 248 If there is more than one choice, the next will be made in the same manner
 249 (earliest possibility) subject to the decision on the first choice.
 250 And so forth.
 251 .Pp
 252 For example,
 253 .Sq Li (ab|a)b*c
 254 could match
 255 `abc' in one of two ways.
 256 The first choice is between `ab' and `a'; since `ab' is earlier, and does
 257 lead to a successful overall match, it is chosen.
 258 Since the `b' is already spoken for,
 259 the `b*' must match its last possibility\(emthe empty string\(emsince
 260 it must respect the earlier choice.
 261 .Pp
 262 In the particular case where no `|'s are present and there is only one
 263 `*', `+', or `?', the net effect is that the longest possible
 264 match will be chosen.
 265 So
 266 .Sq Li ab* ,
 267 presented with `xabbbby', will match `abbbb'.
 268 Note that if
 269 .Sq Li ab* ,
 270 is tried against `xabyabbbz', it
 271 will match `ab' just after `x', due to the begins-earliest rule.
 272 (In effect, the decision on where to start the match is the first choice
 273 to be made, hence subsequent choices must respect it even if this leads them
 274 to less-preferred alternatives.)
 275 .Sh RETURN VALUES
 276 The
 277 .Fn regcomp
 278 function
 279 returns
 280 .Dv NULL
 281 for a failure
 282 .Pf ( Fn regerror
 283 permitting),
 284 where failures are syntax errors, exceeding implementation limits,
 285 or applying `+' or `*' to a possibly-null operand.
 286 .Sh SEE ALSO
 287 .Xr ed 1 ,
 288 .Xr egrep 1 ,
 289 .Xr ex 1 ,
 290 .Xr expr 1 ,
 291 .Xr fgrep 1 ,
 292 .Xr grep 1 ,
 293 .Xr regex 3
 294 .Sh HISTORY
 295 Both code and manual page for
 296 .Fn regcomp ,
 297 .Fn regexec ,
 298 .Fn regsub ,
 299 and
 300 .Fn regerror
 301 were written at the University of Toronto
 302 and appeared in
 303 .Bx 4.3 tahoe .
 304 They are intended to be compatible with the Bell V8
 305 .Xr regexp 3 ,
 306 but are not derived from Bell code.
 307 .Sh BUGS
 308 Empty branches and empty regular expressions are not portable to V8.
 309 .Pp
 310 The restriction against
 311 applying `*' or `+' to a possibly-null operand is an artifact of the
 312 simplistic implementation.
 313 .Pp
 314 Does not support
 315 .Xr egrep 1 Ns 's
 316 newline-separated branches;
 317 neither does the V8
 318 .Xr regexp 3 ,
 319 though.
 320 .Pp
 321 Due to emphasis on
 322 compactness and simplicity,
 323 it's not strikingly fast.
 324 It does give special attention to handling simple cases quickly.