release/src/router/php/ext/ereg/regex/regex.3

   1 .TH REGEX 3 "17 May 1993"
   2 .BY "Henry Spencer"
   3 .de ZR
   4 .\" one other place knows this name:  the SEE ALSO section
   5 .IR regex (7) \\$1
   6 ..
   7 .SH NAME
   8 regcomp, regexec, regerror, regfree \- regular-expression library
   9 .SH SYNOPSIS
  10 .ft B
  11 .\".na
  12 #include <sys/types.h>
  13 .br
  14 #include <regex.h>
  15 .HP 10
  16 int regcomp(regex_t\ *preg, const\ char\ *pattern, int\ cflags);
  17 .HP
  18 int\ regexec(const\ regex_t\ *preg, const\ char\ *string,
  19 size_t\ nmatch, regmatch_t\ pmatch[], int\ eflags);
  20 .HP
  21 size_t\ regerror(int\ errcode, const\ regex_t\ *preg,
  22 char\ *errbuf, size_t\ errbuf_size);
  23 .HP
  24 void\ regfree(regex_t\ *preg);
  25 .\".ad
  26 .ft
  27 .SH DESCRIPTION
  28 These routines implement POSIX 1003.2 regular expressions (``RE''s);
  29 see
  30 .ZR .
  31 .I Regcomp
  32 compiles an RE written as a string into an internal form,
  33 .I regexec
  34 matches that internal form against a string and reports results,
  35 .I regerror
  36 transforms error codes from either into human-readable messages,
  37 and
  38 .I regfree
  39 frees any dynamically-allocated storage used by the internal form
  40 of an RE.
  41 .PP
  42 The header
  43 .I <regex.h>
  44 declares two structure types,
  45 .I regex_t
  46 and
  47 .IR regmatch_t ,
  48 the former for compiled internal forms and the latter for match reporting.
  49 It also declares the four functions,
  50 a type
  51 .IR regoff_t ,
  52 and a number of constants with names starting with ``REG_''.
  53 .PP
  54 .I Regcomp
  55 compiles the regular expression contained in the
  56 .I pattern
  57 string,
  58 subject to the flags in
  59 .IR cflags ,
  60 and places the results in the
  61 .I regex_t
  62 structure pointed to by
  63 .IR preg .
  64 .I Cflags
  65 is the bitwise OR of zero or more of the following flags:
  66 .IP REG_EXTENDED \w'REG_EXTENDED'u+2n
  67 Compile modern (``extended'') REs,
  68 rather than the obsolete (``basic'') REs that
  69 are the default.
  70 .IP REG_BASIC
  71 This is a synonym for 0,
  72 provided as a counterpart to REG_EXTENDED to improve readability.
  73 .IP REG_NOSPEC
  74 Compile with recognition of all special characters turned off.
  75 All characters are thus considered ordinary,
  76 so the ``RE'' is a literal string.
  77 This is an extension,
  78 compatible with but not specified by POSIX 1003.2,
  79 and should be used with
  80 caution in software intended to be portable to other systems.
  81 REG_EXTENDED and REG_NOSPEC may not be used
  82 in the same call to
  83 .IR regcomp .
  84 .IP REG_ICASE
  85 Compile for matching that ignores upper/lower case distinctions.
  86 See
  87 .ZR .
  88 .IP REG_NOSUB
  89 Compile for matching that need only report success or failure,
  90 not what was matched.
  91 .IP REG_NEWLINE
  92 Compile for newline-sensitive matching.
  93 By default, newline is a completely ordinary character with no special
  94 meaning in either REs or strings.
  95 With this flag,
  96 `[^' bracket expressions and `.' never match newline,
  97 a `^' anchor matches the null string after any newline in the string
  98 in addition to its normal function,
  99 and the `$' anchor matches the null string before any newline in the
 100 string in addition to its normal function.
 101 .IP REG_PEND
 102 The regular expression ends,
 103 not at the first NUL,
 104 but just before the character pointed to by the
 105 .I re_endp
 106 member of the structure pointed to by
 107 .IR preg .
 108 The
 109 .I re_endp
 110 member is of type
 111 .IR const\ char\ * .
 112 This flag permits inclusion of NULs in the RE;
 113 they are considered ordinary characters.
 114 This is an extension,
 115 compatible with but not specified by POSIX 1003.2,
 116 and should be used with
 117 caution in software intended to be portable to other systems.
 118 .PP
 119 When successful,
 120 .I regcomp
 121 returns 0 and fills in the structure pointed to by
 122 .IR preg .
 123 One member of that structure
 124 (other than
 125 .IR re_endp )
 126 is publicized:
 127 .IR re_nsub ,
 128 of type
 129 .IR size_t ,
 130 contains the number of parenthesized subexpressions within the RE
 131 (except that the value of this member is undefined if the
 132 REG_NOSUB flag was used).
 133 If
 134 .I regcomp
 135 fails, it returns a non-zero error code;
 136 see DIAGNOSTICS.
 137 .PP
 138 .I Regexec
 139 matches the compiled RE pointed to by
 140 .I preg
 141 against the
 142 .IR string ,
 143 subject to the flags in
 144 .IR eflags ,
 145 and reports results using
 146 .IR nmatch ,
 147 .IR pmatch ,
 148 and the returned value.
 149 The RE must have been compiled by a previous invocation of
 150 .IR regcomp .
 151 The compiled form is not altered during execution of
 152 .IR regexec ,
 153 so a single compiled RE can be used simultaneously by multiple threads.
 154 .PP
 155 By default,
 156 the NUL-terminated string pointed to by
 157 .I string
 158 is considered to be the text of an entire line, minus any terminating
 159 newline.
 160 The
 161 .I eflags
 162 argument is the bitwise OR of zero or more of the following flags:
 163 .IP REG_NOTBOL \w'REG_STARTEND'u+2n
 164 The first character of
 165 the string
 166 is not the beginning of a line, so the `^' anchor should not match before it.
 167 This does not affect the behavior of newlines under REG_NEWLINE.
 168 .IP REG_NOTEOL
 169 The NUL terminating
 170 the string
 171 does not end a line, so the `$' anchor should not match before it.
 172 This does not affect the behavior of newlines under REG_NEWLINE.
 173 .IP REG_STARTEND
 174 The string is considered to start at
 175 \fIstring\fR\ + \fIpmatch\fR[0].\fIrm_so\fR
 176 and to have a terminating NUL located at
 177 \fIstring\fR\ + \fIpmatch\fR[0].\fIrm_eo\fR
 178 (there need not actually be a NUL at that location),
 179 regardless of the value of
 180 .IR nmatch .
 181 See below for the definition of
 182 .IR pmatch
 183 and
 184 .IR nmatch .
 185 This is an extension,
 186 compatible with but not specified by POSIX 1003.2,
 187 and should be used with
 188 caution in software intended to be portable to other systems.
 189 Note that a non-zero \fIrm_so\fR does not imply REG_NOTBOL;
 190 REG_STARTEND affects only the location of the string,
 191 not how it is matched.
 192 .PP
 193 See
 194 .ZR
 195 for a discussion of what is matched in situations where an RE or a
 196 portion thereof could match any of several substrings of
 197 .IR string .
 198 .PP
 199 Normally,
 200 .I regexec
 201 returns 0 for success and the non-zero code REG_NOMATCH for failure.
 202 Other non-zero error codes may be returned in exceptional situations;
 203 see DIAGNOSTICS.
 204 .PP
 205 If REG_NOSUB was specified in the compilation of the RE,
 206 or if
 207 .I nmatch
 208 is 0,
 209 .I regexec
 210 ignores the
 211 .I pmatch
 212 argument (but see below for the case where REG_STARTEND is specified).
 213 Otherwise,
 214 .I pmatch
 215 points to an array of
 216 .I nmatch
 217 structures of type
 218 .IR regmatch_t .
 219 Such a structure has at least the members
 220 .I rm_so
 221 and
 222 .IR rm_eo ,
 223 both of type
 224 .I regoff_t
 225 (a signed arithmetic type at least as large as an
 226 .I off_t
 227 and a
 228 .IR ssize_t ),
 229 containing respectively the offset of the first character of a substring
 230 and the offset of the first character after the end of the substring.
 231 Offsets are measured from the beginning of the
 232 .I string
 233 argument given to
 234 .IR regexec .
 235 An empty substring is denoted by equal offsets,
 236 both indicating the character following the empty substring.
 237 .PP
 238 The 0th member of the
 239 .I pmatch
 240 array is filled in to indicate what substring of
 241 .I string
 242 was matched by the entire RE.
 243 Remaining members report what substring was matched by parenthesized
 244 subexpressions within the RE;
 245 member
 246 .I i
 247 reports subexpression
 248 .IR i ,
 249 with subexpressions counted (starting at 1) by the order of their opening
 250 parentheses in the RE, left to right.
 251 Unused entries in the array\(emcorresponding either to subexpressions that
 252 did not participate in the match at all, or to subexpressions that do not
 253 exist in the RE (that is, \fIi\fR\ > \fIpreg\fR\->\fIre_nsub\fR)\(emhave both
 254 .I rm_so
 255 and
 256 .I rm_eo
 257 set to \-1.
 258 If a subexpression participated in the match several times,
 259 the reported substring is the last one it matched.
 260 (Note, as an example in particular, that when the RE `(b*)+' matches `bbb',
 261 the parenthesized subexpression matches each of the three `b's and then
 262 an infinite number of empty strings following the last `b',
 263 so the reported substring is one of the empties.)
 264 .PP
 265 If REG_STARTEND is specified,
 266 .I pmatch
 267 must point to at least one
 268 .I regmatch_t
 269 (even if
 270 .I nmatch
 271 is 0 or REG_NOSUB was specified),
 272 to hold the input offsets for REG_STARTEND.
 273 Use for output is still entirely controlled by
 274 .IR nmatch ;
 275 if
 276 .I nmatch
 277 is 0 or REG_NOSUB was specified,
 278 the value of
 279 .IR pmatch [0]
 280 will not be changed by a successful
 281 .IR regexec .
 282 .PP
 283 .I Regerror
 284 maps a non-zero
 285 .I errcode
 286 from either
 287 .I regcomp
 288 or
 289 .I regexec
 290 to a human-readable, printable message.
 291 If
 292 .I preg
 293 is non-NULL,
 294 the error code should have arisen from use of
 295 the
 296 .I regex_t
 297 pointed to by
 298 .IR preg ,
 299 and if the error code came from
 300 .IR regcomp ,
 301 it should have been the result from the most recent
 302 .I regcomp
 303 using that
 304 .IR regex_t .
 305 .RI ( Regerror
 306 may be able to supply a more detailed message using information
 307 from the
 308 .IR regex_t .)
 309 .I Regerror
 310 places the NUL-terminated message into the buffer pointed to by
 311 .IR errbuf ,
 312 limiting the length (including the NUL) to at most
 313 .I errbuf_size
 314 bytes.
 315 If the whole message won't fit,
 316 as much of it as will fit before the terminating NUL is supplied.
 317 In any case,
 318 the returned value is the size of buffer needed to hold the whole
 319 message (including terminating NUL).
 320 If
 321 .I errbuf_size
 322 is 0,
 323 .I errbuf
 324 is ignored but the return value is still correct.
 325 .PP
 326 If the
 327 .I errcode
 328 given to
 329 .I regerror
 330 is first ORed with REG_ITOA,
 331 the ``message'' that results is the printable name of the error code,
 332 e.g. ``REG_NOMATCH'',
 333 rather than an explanation thereof.
 334 If
 335 .I errcode
 336 is REG_ATOI,
 337 then
 338 .I preg
 339 shall be non-NULL and the
 340 .I re_endp
 341 member of the structure it points to
 342 must point to the printable name of an error code;
 343 in this case, the result in
 344 .I errbuf
 345 is the decimal digits of
 346 the numeric value of the error code
 347 (0 if the name is not recognized).
 348 REG_ITOA and REG_ATOI are intended primarily as debugging facilities;
 349 they are extensions,
 350 compatible with but not specified by POSIX 1003.2,
 351 and should be used with
 352 caution in software intended to be portable to other systems.
 353 Be warned also that they are considered experimental and changes are possible.
 354 .PP
 355 .I Regfree
 356 frees any dynamically-allocated storage associated with the compiled RE
 357 pointed to by
 358 .IR preg .
 359 The remaining
 360 .I regex_t
 361 is no longer a valid compiled RE
 362 and the effect of supplying it to
 363 .I regexec
 364 or
 365 .I regerror
 366 is undefined.
 367 .PP
 368 None of these functions references global variables except for tables
 369 of constants;
 370 all are safe for use from multiple threads if the arguments are safe.
 371 .SH IMPLEMENTATION CHOICES
 372 There are a number of decisions that 1003.2 leaves up to the implementor,
 373 either by explicitly saying ``undefined'' or by virtue of them being
 374 forbidden by the RE grammar.
 375 This implementation treats them as follows.
 376 .PP
 377 See
 378 .ZR
 379 for a discussion of the definition of case-independent matching.
 380 .PP
 381 There is no particular limit on the length of REs,
 382 except insofar as memory is limited.
 383 Memory usage is approximately linear in RE size, and largely insensitive
 384 to RE complexity, except for bounded repetitions.
 385 See BUGS for one short RE using them
 386 that will run almost any system out of memory.
 387 .PP
 388 A backslashed character other than one specifically given a magic meaning
 389 by 1003.2 (such magic meanings occur only in obsolete [``basic''] REs)
 390 is taken as an ordinary character.
 391 .PP
 392 Any unmatched [ is a REG_EBRACK error.
 393 .PP
 394 Equivalence classes cannot begin or end bracket-expression ranges.
 395 The endpoint of one range cannot begin another.
 396 .PP
 397 RE_DUP_MAX, the limit on repetition counts in bounded repetitions, is 255.
 398 .PP
 399 A repetition operator (?, *, +, or bounds) cannot follow another
 400 repetition operator.
 401 A repetition operator cannot begin an expression or subexpression
 402 or follow `^' or `|'.
 403 .PP
 404 `|' cannot appear first or last in a (sub)expression or after another `|',
 405 i.e. an operand of `|' cannot be an empty subexpression.
 406 An empty parenthesized subexpression, `()', is legal and matches an
 407 empty (sub)string.
 408 An empty string is not a legal RE.
 409 .PP
 410 A `{' followed by a digit is considered the beginning of bounds for a
 411 bounded repetition, which must then follow the syntax for bounds.
 412 A `{' \fInot\fR followed by a digit is considered an ordinary character.
 413 .PP
 414 `^' and `$' beginning and ending subexpressions in obsolete (``basic'')
 415 REs are anchors, not ordinary characters.
 416 .SH SEE ALSO
 417 grep(1), regex(7)
 418 .PP
 419 POSIX 1003.2, sections 2.8 (Regular Expression Notation)
 420 and
 421 B.5 (C Binding for Regular Expression Matching).
 422 .SH DIAGNOSTICS
 423 Non-zero error codes from
 424 .I regcomp
 425 and
 426 .I regexec
 427 include the following:
 428 .PP
 429 .nf
 430 .ta \w'REG_ECOLLATE'u+3n
 431 REG_NOMATCH     regexec() failed to match
 432 REG_BADPAT      invalid regular expression
 433 REG_ECOLLATE    invalid collating element
 434 REG_ECTYPE      invalid character class
 435 REG_EESCAPE     \e applied to unescapable character
 436 REG_ESUBREG     invalid backreference number
 437 REG_EBRACK      brackets [ ] not balanced
 438 REG_EPAREN      parentheses ( ) not balanced
 439 REG_EBRACE      braces { } not balanced
 440 REG_BADBR       invalid repetition count(s) in { }
 441 REG_ERANGE      invalid character range in [ ]
 442 REG_ESPACE      ran out of memory
 443 REG_BADRPT      ?, *, or + operand invalid
 444 REG_EMPTY       empty (sub)expression
 445 REG_ASSERT      ``can't happen''\(emyou found a bug
 446 REG_INVARG      invalid argument, e.g. negative-length string
 447 .fi
 448 .SH HISTORY
 449 Written by Henry Spencer at University of Toronto,
 450 henry@zoo.toronto.edu.
 451 .SH BUGS
 452 This is an alpha release with known defects.
 453 Please report problems.
 454 .PP
 455 There is one known functionality bug.
 456 The implementation of internationalization is incomplete:
 457 the locale is always assumed to be the default one of 1003.2,
 458 and only the collating elements etc. of that locale are available.
 459 .PP
 460 The back-reference code is subtle and doubts linger about its correctness
 461 in complex cases.
 462 .PP
 463 .I Regexec
 464 performance is poor.
 465 This will improve with later releases.
 466 .I Nmatch
 467 exceeding 0 is expensive;
 468 .I nmatch
 469 exceeding 1 is worse.
 470 .I Regexec
 471 is largely insensitive to RE complexity \fIexcept\fR that back
 472 references are massively expensive.
 473 RE length does matter; in particular, there is a strong speed bonus
 474 for keeping RE length under about 30 characters,
 475 with most special characters counting roughly double.
 476 .PP
 477 .I Regcomp
 478 implements bounded repetitions by macro expansion,
 479 which is costly in time and space if counts are large
 480 or bounded repetitions are nested.
 481 An RE like, say,
 482 `((((a{1,100}){1,100}){1,100}){1,100}){1,100}'
 483 will (eventually) run almost any existing machine out of swap space.
 484 .PP
 485 There are suspected problems with response to obscure error conditions.
 486 Notably,
 487 certain kinds of internal overflow,
 488 produced only by truly enormous REs or by multiply nested bounded repetitions,
 489 are probably not handled well.
 490 .PP
 491 Due to a mistake in 1003.2, things like `a)b' are legal REs because `)' is
 492 a special character only in the presence of a previous unmatched `('.
 493 This can't be fixed until the spec is fixed.
 494 .PP
 495 The standard's definition of back references is vague.
 496 For example, does
 497 `a\e(\e(b\e)*\e2\e)*d' match `abbbd'?
 498 Until the standard is clarified,
 499 behavior in such cases should not be relied on.
 500 .PP
 501 The implementation of word-boundary matching is a bit of a kludge,
 502 and bugs may lurk in combinations of word-boundary matching and anchoring.