doc/regex.texi

   1 @node Overview
   2 @chapter Overview
   3
   4 A @dfn{regular expression} (or @dfn{regexp}, or @dfn{pattern}) is a text
   5 string that describes some (mathematical) set of strings.  A regexp
   6 @var{r} @dfn{matches} a string @var{s} if @var{s} is in the set of
   7 strings described by @var{r}.
   8
   9 Using the Regex library, you can:
  10
  11 @itemize @bullet
  12
  13 @item
  14 see if a string matches a specified pattern as a whole, and
  15
  16 @item
  17 search within a string for a substring matching a specified pattern.
  18
  19 @end itemize
  20
  21 Some regular expressions match only one string, i.e., the set they
  22 describe has only one member.  For example, the regular expression
  23 @samp{foo} matches the string @samp{foo} and no others.  Other regular
  24 expressions match more than one string, i.e., the set they describe has
  25 more than one member.  For example, the regular expression @samp{f*}
  26 matches the set of strings made up of any number (including zero) of
  27 @samp{f}s.  As you can see, some characters in regular expressions match
  28 themselves (such as @samp{f}) and some don't (such as @samp{*}); the
  29 ones that don't match themselves instead let you specify patterns that
  30 describe many different strings.
  31
  32 To either match or search for a regular expression with the Regex
  33 library functions, you must first compile it with a Regex pattern
  34 compiling function.  A @dfn{compiled pattern} is a regular expression
  35 converted to the internal format used by the library functions.  Once
  36 you've compiled a pattern, you can use it for matching or searching any
  37 number of times.
  38
  39 The Regex library is used by including @file{regex.h}.
  40 @pindex regex.h
  41 Regex provides three groups of functions with which you can operate on
  42 regular expressions.  One group---the GNU group---is more
  43 powerful but not completely compatible with the other two, namely the
  44 POSIX and Berkeley Unix groups; its interface was designed
  45 specifically for GNU.
  46
  47 We wrote this chapter with programmers in mind, not users of
  48 programs---such as Emacs---that use Regex.  We describe the Regex
  49 library in its entirety, not how to write regular expressions that a
  50 particular program understands.
  51
  52
  53 @node Regular Expression Syntax
  54 @chapter Regular Expression Syntax
  55
  56 @cindex regular expressions, syntax of
  57 @cindex syntax of regular expressions
  58
  59 @dfn{Characters} are things you can type.  @dfn{Operators} are things in
  60 a regular expression that match one or more characters.  You compose
  61 regular expressions from operators, which in turn you specify using one
  62 or more characters.
  63
  64 Most characters represent what we call the match-self operator, i.e.,
  65 they match themselves; we call these characters @dfn{ordinary}.  Other
  66 characters represent either all or parts of fancier operators; e.g.,
  67 @samp{.} represents what we call the match-any-character operator
  68 (which, no surprise, matches (almost) any character); we call these
  69 characters @dfn{special}.  Two different things determine what
  70 characters represent what operators:
  71
  72 @enumerate
  73 @item
  74 the regular expression syntax your program has told the Regex library to
  75 recognize, and
  76
  77 @item
  78 the context of the character in the regular expression.
  79 @end enumerate
  80
  81 In the following sections, we describe these things in more detail.
  82
  83 @menu
  84 * Syntax Bits::
  85 * Predefined Syntaxes::
  86 * Collating Elements vs. Characters::
  87 * The Backslash Character::
  88 @end menu
  89
  90
  91 @node Syntax Bits
  92 @section Syntax Bits
  93
  94 @cindex syntax bits
  95
  96 In any particular syntax for regular expressions, some characters are
  97 always special, others are sometimes special, and others are never
  98 special.  The particular syntax that Regex recognizes for a given
  99 regular expression depends on the current syntax (as set by
 100 @code{re_set_syntax}) when the pattern buffer of that regular expression
 101 was compiled.
 102
 103 You get a pattern buffer by compiling a regular expression.  @xref{GNU
 104 Pattern Buffers}, for more information on pattern buffers.  @xref{GNU
 105 Regular Expression Compiling}, and @ref{BSD Regular Expression
 106 Compiling}, for more information on compiling.
 107
 108 Regex considers the current syntax to be a collection of bits; we refer
 109 to these bits as @dfn{syntax bits}.  In most cases, they affect what
 110 characters represent what operators.  We describe the meanings of the
 111 operators to which we refer in @ref{Common Operators} and @ref{GNU
 112 Operators}.
 113
 114 For reference, here is the complete list of syntax bits, in alphabetical
 115 order:
 116
 117 @table @code
 118
 119 @cnindex RE_BACKSLASH_ESCAPE_IN_LIST
 120 @item RE_BACKSLASH_ESCAPE_IN_LISTS
 121 If this bit is set, then @samp{\} inside a list (@pxref{List Operators})
 122 quotes (makes ordinary, if it's special) the following character; if
 123 this bit isn't set, then @samp{\} is an ordinary character inside lists.
 124 (@xref{The Backslash Character}, for what @samp{\} does outside of lists.)
 125
 126 @cnindex RE_BK_PLUS_QM
 127 @item RE_BK_PLUS_QM
 128 If this bit is set, then @samp{\+} represents the match-one-or-more
 129 operator and @samp{\?} represents the match-zero-or-more operator; if
 130 this bit isn't set, then @samp{+} represents the match-one-or-more
 131 operator and @samp{?} represents the match-zero-or-one operator.  This
 132 bit is irrelevant if @code{RE_LIMITED_OPS} is set.
 133
 134 @cnindex RE_CHAR_CLASSES
 135 @item RE_CHAR_CLASSES
 136 If this bit is set, then you can use character classes in lists; if this
 137 bit isn't set, then you can't.
 138
 139 @cnindex RE_CONTEXT_INDEP_ANCHORS
 140 @item RE_CONTEXT_INDEP_ANCHORS
 141 If this bit is set, then @samp{^} and @samp{$} are special anywhere outside
 142 a list; if this bit isn't set, then these characters are special only in
 143 certain contexts.  @xref{Match-beginning-of-line Operator}, and
 144 @ref{Match-end-of-line Operator}.
 145
 146 @cnindex RE_CONTEXT_INDEP_OPS
 147 @item RE_CONTEXT_INDEP_OPS
 148 If this bit is set, then certain characters are special anywhere outside
 149 a list; if this bit isn't set, then those characters are special only in
 150 some contexts and are ordinary elsewhere.  Specifically, if this bit
 151 isn't set then @samp{*}, and (if the syntax bit @code{RE_LIMITED_OPS}
 152 isn't set) @samp{+} and @samp{?} (or @samp{\+} and @samp{\?}, depending
 153 on the syntax bit @code{RE_BK_PLUS_QM}) represent repetition operators
 154 only if they're not first in a regular expression or just after an
 155 open-group or alternation operator.  The same holds for @samp{@{} (or
 156 @samp{\@{}, depending on the syntax bit @code{RE_NO_BK_BRACES}) if
 157 it is the beginning of a valid interval and the syntax bit
 158 @code{RE_INTERVALS} is set.
 159
 160 @cnindex RE_CONTEXT_INVALID_DUP
 161 @item RE_CONTEXT_INVALID_DUP
 162 If this bit is set, then an open-interval operator cannot occur at the
 163 start of a regular expression, or immediately after an alternation,
 164 open-group or close-interval operator.
 165
 166 @cnindex RE_CONTEXT_INVALID_OPS
 167 @item RE_CONTEXT_INVALID_OPS
 168 If this bit is set, then repetition and alternation operators can't be
 169 in certain positions within a regular expression.  Specifically, the
 170 regular expression is invalid if it has:
 171
 172 @itemize @bullet
 173
 174 @item
 175 a repetition operator first in the regular expression or just after a
 176 match-beginning-of-line, open-group, or alternation operator; or
 177
 178 @item
 179 an alternation operator first or last in the regular expression, just
 180 before a match-end-of-line operator, or just after an alternation or
 181 open-group operator.
 182
 183 @end itemize
 184
 185 If this bit isn't set, then you can put the characters representing the
 186 repetition and alternation characters anywhere in a regular expression.
 187 Whether or not they will in fact be operators in certain positions
 188 depends on other syntax bits.
 189
 190 @cnindex RE_DEBUG
 191 @item RE_DEBUG
 192 If this bit is set, and the regex library was compiled with
 193 @code{-DDEBUG}, then internal debugging is turned on; if unset, then
 194 it is turned off.
 195
 196 @cnindex RE_DOT_NEWLINE
 197 @item RE_DOT_NEWLINE
 198 If this bit is set, then the match-any-character operator matches
 199 a newline; if this bit isn't set, then it doesn't.
 200
 201 @cnindex RE_DOT_NOT_NULL
 202 @item RE_DOT_NOT_NULL
 203 If this bit is set, then the match-any-character operator doesn't match
 204 a null character; if this bit isn't set, then it does.
 205
 206 @cnindex RE_HAT_LISTS_NOT_NEWLINE
 207 @item RE_HAT_LISTS_NOT_NEWLINE
 208 If this bit is set, nonmatching lists @samp{[^...]} do not match
 209 newline; if not set, they do.
 210
 211 @cnindex RE_ICASE
 212 @item RE_ICASE
 213 If this bit is set, then ignore case when matching; otherwise, case is
 214 significant.
 215
 216 @cnindex RE_INTERVALS
 217 @item RE_INTERVALS
 218 If this bit is set, then Regex recognizes interval operators; if this bit
 219 isn't set, then it doesn't.
 220
 221 @cnindex RE_INVALID_INTERVAL_ORD
 222 @item RE_INVALID_INTERVAL_ORD
 223 If this bit is set, a syntactically invalid interval is treated as a
 224 string of ordinary characters.  For example, the extended regular
 225 expression @samp{a@{1} is treated as @samp{a\@{1}.
 226
 227 @cnindex RE_LIMITED_OPS
 228 @item RE_LIMITED_OPS
 229 If this bit is set, then Regex doesn't recognize the match-one-or-more,
 230 match-zero-or-one or alternation operators; if this bit isn't set, then
 231 it does.
 232
 233 @cnindex RE_NEWLINE_ALT
 234 @item RE_NEWLINE_ALT
 235 If this bit is set, then newline represents the alternation operator; if
 236 this bit isn't set, then newline is ordinary.
 237
 238 @cnindex RE_NO_BK_BRACES
 239 @item RE_NO_BK_BRACES
 240 If this bit is set, then @samp{@{} represents the open-interval operator
 241 and @samp{@}} represents the close-interval operator; if this bit isn't
 242 set, then @samp{\@{} represents the open-interval operator and
 243 @samp{\@}} represents the close-interval operator.  This bit is relevant
 244 only if @code{RE_INTERVALS} is set.
 245
 246 @cnindex RE_NO_BK_PARENS
 247 @item RE_NO_BK_PARENS
 248 If this bit is set, then @samp{(} represents the open-group operator and
 249 @samp{)} represents the close-group operator; if this bit isn't set, then
 250 @samp{\(} represents the open-group operator and @samp{\)} represents
 251 the close-group operator.
 252
 253 @cnindex RE_NO_BK_REFS
 254 @item RE_NO_BK_REFS
 255 If this bit is set, then Regex doesn't recognize @samp{\}@var{digit} as
 256 the back-reference operator; if this bit isn't set, then it does.
 257
 258 @cnindex RE_NO_BK_VBAR
 259 @item RE_NO_BK_VBAR
 260 If this bit is set, then @samp{|} represents the alternation operator;
 261 if this bit isn't set, then @samp{\|} represents the alternation
 262 operator.  This bit is irrelevant if @code{RE_LIMITED_OPS} is set.
 263
 264 @cnindex RE_NO_EMPTY_RANGES
 265 @item RE_NO_EMPTY_RANGES
 266 If this bit is set, then a regular expression with a range whose ending
 267 point collates lower than its starting point is invalid; if this bit
 268 isn't set, then Regex considers such a range to be empty.
 269
 270 @cnindex RE_NO_GNU_OPS
 271 @item RE_NO_GNU_OPS
 272 If this bit is set, GNU regex operators are not recognized; otherwise,
 273 they are.
 274
 275 @cnindex RE_NO_POSIX_BACKTRACKING
 276 @item RE_NO_POSIX_BACKTRACKING
 277 If this bit is set, succeed as soon as we match the whole pattern,
 278 without further backtracking.  This means that a match may not be
 279 the leftmost longest; @pxref{What Gets Matched?} for what this means.
 280
 281 @cnindex RE_NO_SUB
 282 @item RE_NO_SUB
 283 If this bit is set, then @code{no_sub} will be set to one during
 284 @code{re_compile_pattern}.  This causes matching and searching routines
 285 not to record substring match information.
 286
 287 @cnindex RE_UNMATCHED_RIGHT_PAREN_ORD
 288 @item RE_UNMATCHED_RIGHT_PAREN_ORD
 289 If this bit is set and the regular expression has no matching open-group
 290 operator, then Regex considers what would otherwise be a close-group
 291 operator (based on how @code{RE_NO_BK_PARENS} is set) to match @samp{)}.
 292
 293 @end table
 294
 295
 296 @node Predefined Syntaxes
 297 @section Predefined Syntaxes
 298
 299 If you're programming with Regex, you can set a pattern buffer's
 300 (@pxref{GNU Pattern Buffers})
 301 syntax either to an arbitrary combination of syntax bits
 302 (@pxref{Syntax Bits}) or else to the configurations defined by Regex.
 303 These configurations define the syntaxes used by certain
 304 programs---GNU Emacs,
 305 @cindex Emacs
 306 POSIX Awk,
 307 @cindex POSIX Awk
 308 traditional Awk,
 309 @cindex Awk
 310 Grep,
 311 @cindex Grep
 312 @cindex Egrep
 313 Egrep---in addition to syntaxes for POSIX basic and extended
 314 regular expressions.
 315
 316 The predefined syntaxes---taken directly from @file{regex.h}---are:
 317
 318 @smallexample
 319 #define RE_SYNTAX_EMACS 0
 320
 321 #define RE_SYNTAX_AWK                                                   \
 322   (RE_BACKSLASH_ESCAPE_IN_LISTS | RE_DOT_NOT_NULL                       \
 323    | RE_NO_BK_PARENS            | RE_NO_BK_REFS                         \
 324    | RE_NO_BK_VBAR               | RE_NO_EMPTY_RANGES                   \
 325    | RE_UNMATCHED_RIGHT_PAREN_ORD)
 326
 327 #define RE_SYNTAX_POSIX_AWK                                             \
 328   (RE_SYNTAX_POSIX_EXTENDED | RE_BACKSLASH_ESCAPE_IN_LISTS)
 329
 330 #define RE_SYNTAX_GREP                                                  \
 331   (RE_BK_PLUS_QM              | RE_CHAR_CLASSES                         \
 332    | RE_HAT_LISTS_NOT_NEWLINE | RE_INTERVALS                            \
 333    | RE_NEWLINE_ALT)
 334
 335 #define RE_SYNTAX_EGREP                                                 \
 336   (RE_CHAR_CLASSES        | RE_CONTEXT_INDEP_ANCHORS                    \
 337    | RE_CONTEXT_INDEP_OPS | RE_HAT_LISTS_NOT_NEWLINE                    \
 338    | RE_NEWLINE_ALT       | RE_NO_BK_PARENS                             \
 339    | RE_NO_BK_VBAR)
 340
 341 #define RE_SYNTAX_POSIX_EGREP                                           \
 342   (RE_SYNTAX_EGREP | RE_INTERVALS | RE_NO_BK_BRACES)
 343
 344 /* P1003.2/D11.2, section 4.20.7.1, lines 5078ff.  */
 345 #define RE_SYNTAX_ED RE_SYNTAX_POSIX_BASIC
 346
 347 #define RE_SYNTAX_SED RE_SYNTAX_POSIX_BASIC
 348
 349 /* Syntax bits common to both basic and extended POSIX regex syntax.  */
 350 #define _RE_SYNTAX_POSIX_COMMON                                         \
 351   (RE_CHAR_CLASSES | RE_DOT_NEWLINE      | RE_DOT_NOT_NULL              \
 352    | RE_INTERVALS  | RE_NO_EMPTY_RANGES)
 353
 354 #define RE_SYNTAX_POSIX_BASIC                                           \
 355   (_RE_SYNTAX_POSIX_COMMON | RE_BK_PLUS_QM)
 356
 357 /* Differs from ..._POSIX_BASIC only in that RE_BK_PLUS_QM becomes
 358    RE_LIMITED_OPS, i.e., \? \+ \| are not recognized.  Actually, this
 359    isn't minimal, since other operators, such as \`, aren't disabled.  */
 360 #define RE_SYNTAX_POSIX_MINIMAL_BASIC                                   \
 361   (_RE_SYNTAX_POSIX_COMMON | RE_LIMITED_OPS)
 362
 363 #define RE_SYNTAX_POSIX_EXTENDED                                        \
 364   (_RE_SYNTAX_POSIX_COMMON | RE_CONTEXT_INDEP_ANCHORS                   \
 365    | RE_CONTEXT_INDEP_OPS  | RE_NO_BK_BRACES                            \
 366    | RE_NO_BK_PARENS       | RE_NO_BK_VBAR                              \
 367    | RE_UNMATCHED_RIGHT_PAREN_ORD)
 368
 369 /* Differs from ..._POSIX_EXTENDED in that RE_CONTEXT_INVALID_OPS
 370    replaces RE_CONTEXT_INDEP_OPS and RE_NO_BK_REFS is added.  */
 371 #define RE_SYNTAX_POSIX_MINIMAL_EXTENDED                                \
 372   (_RE_SYNTAX_POSIX_COMMON  | RE_CONTEXT_INDEP_ANCHORS                  \
 373    | RE_CONTEXT_INVALID_OPS | RE_NO_BK_BRACES                           \
 374    | RE_NO_BK_PARENS        | RE_NO_BK_REFS                             \
 375    | RE_NO_BK_VBAR          | RE_UNMATCHED_RIGHT_PAREN_ORD)
 376 @end smallexample
 377
 378 @node Collating Elements vs. Characters
 379 @section Collating Elements vs.@: Characters
 380
 381 POSIX generalizes the notion of a character to that of a
 382 collating element.  It defines a @dfn{collating element} to be ``a
 383 sequence of one or more bytes defined in the current collating sequence
 384 as a unit of collation.''
 385
 386 This generalizes the notion of a character in
 387 two ways.  First, a single character can map into two or more collating
 388 elements.  For example, the German ``ß''
 389 collates as the collating element @samp{s} followed by another collating
 390 element @samp{s}.  Second, two or more characters can map into one
 391 collating element.  For example, the Czech @samp{ch} collates after
 392 @samp{h} and before @samp{i}.
 393
 394 Since POSIX's ``collating element'' preserves the essential idea of
 395 a ``character,'' we use the latter, more familiar, term in this document.
 396
 397 @node The Backslash Character
 398 @section The Backslash Character
 399
 400 @cindex \
 401 The @samp{\} character has one of four different meanings, depending on
 402 the context in which you use it and what syntax bits are set
 403 (@pxref{Syntax Bits}).  It can: 1) stand for itself, 2) quote the next
 404 character, 3) introduce an operator, or 4) do nothing.
 405
 406 @enumerate
 407 @item
 408 It stands for itself inside a list
 409 (@pxref{List Operators}) if the syntax bit
 410 @code{RE_BACKSLASH_ESCAPE_IN_LISTS} is not set.  For example, @samp{[\]}
 411 would match @samp{\}.
 412
 413 @item
 414 It quotes (makes ordinary, if it's special) the next character when you
 415 use it either:
 416
 417 @itemize @bullet
 418 @item
 419 outside a list,@footnote{Sometimes
 420 you don't have to explicitly quote special characters to make
 421 them ordinary.  For instance, most characters lose any special meaning
 422 inside a list (@pxref{List Operators}).  In addition, if the syntax bits
 423 @code{RE_CONTEXT_INVALID_OPS} and @code{RE_CONTEXT_INDEP_OPS}
 424 aren't set, then (for historical reasons) the matcher considers special
 425 characters ordinary if they are in contexts where the operations they
 426 represent make no sense; for example, then the match-zero-or-more
 427 operator (represented by @samp{*}) matches itself in the regular
 428 expression @samp{*foo} because there is no preceding expression on which
 429 it can operate.  It is poor practice, however, to depend on this
 430 behavior; if you want a special character to be ordinary outside a list,
 431 it's better to always quote it, regardless.} or
 432
 433 @item
 434 inside a list and the syntax bit @code{RE_BACKSLASH_ESCAPE_IN_LISTS} is set.
 435
 436 @end itemize
 437
 438 @item
 439 It introduces an operator when followed by certain ordinary
 440 characters---sometimes only when certain syntax bits are set.  See the
 441 cases @code{RE_BK_PLUS_QM}, @code{RE_NO_BK_BRACES}, @code{RE_NO_BK_VAR},
 442 @code{RE_NO_BK_PARENS}, @code{RE_NO_BK_REF} in @ref{Syntax Bits}.  Also:
 443
 444 @itemize @bullet
 445 @item
 446 @samp{\b} represents the match-word-boundary operator
 447 (@pxref{Match-word-boundary Operator}).
 448
 449 @item
 450 @samp{\B} represents the match-within-word operator
 451 (@pxref{Match-within-word Operator}).
 452
 453 @item
 454 @samp{\<} represents the match-beginning-of-word operator @*
 455 (@pxref{Match-beginning-of-word Operator}).
 456
 457 @item
 458 @samp{\>} represents the match-end-of-word operator
 459 (@pxref{Match-end-of-word Operator}).
 460
 461 @item
 462 @samp{\w} represents the match-word-constituent operator
 463 (@pxref{Match-word-constituent Operator}).
 464
 465 @item
 466 @samp{\W} represents the match-non-word-constituent operator
 467 (@pxref{Match-non-word-constituent Operator}).
 468
 469 @item
 470 @samp{\s@var{class}} is equivalent to @code{[[:space:]]}
 471 (@pxref{Match-space Operator}).
 472
 473 @item
 474 @samp{\S@var{class}} is equivalent to @code{[^[:space]]}
 475 (@pxref{Match-non-space Operator}).
 476
 477 @item
 478 @samp{\`} represents the match-beginning-of-string
 479 operator and @samp{\'} represents the match-end-of-string operator
 480 (@pxref{Whole-string Operators}).
 481
 482 @end itemize
 483
 484 @item
 485 In all other cases, Regex ignores @samp{\}.  For example,
 486 @samp{\n} matches @samp{n}.
 487
 488 @end enumerate
 489
 490 @node Common Operators
 491 @chapter Common Operators
 492
 493 You compose regular expressions from operators.  In the following
 494 sections, we describe the regular expression operators specified by
 495 POSIX; GNU also uses these.  Most operators have more than one
 496 representation as characters.  @xref{Regular Expression Syntax}, for
 497 what characters represent what operators under what circumstances.
 498
 499 For most operators that can be represented in two ways, one
 500 representation is a single character and the other is that character
 501 preceded by @samp{\}.  For example, either @samp{(} or @samp{\(}
 502 represents the open-group operator.  Which one does depends on the
 503 setting of a syntax bit, in this case @code{RE_NO_BK_PARENS}.  Why is
 504 this so?  Historical reasons dictate some of the varying
 505 representations, while POSIX dictates others.
 506
 507 Finally, almost all characters lose any special meaning inside a list
 508 (@pxref{List Operators}).
 509
 510 @menu
 511 * Match-self Operator::                 Ordinary characters.
 512 * Match-any-character Operator::        .
 513 * Concatenation Operator::              Juxtaposition.
 514 * Repetition Operators::                *  +  ? @{@}
 515 * Alternation Operator::                |
 516 * List Operators::                      [...]  [^...]
 517 * Grouping Operators::                  (...)
 518 * Back-reference Operator::             \digit
 519 * Anchoring Operators::                 ^  $
 520 @end menu
 521
 522 @node Match-self Operator
 523 @section The Match-self Operator (@var{ordinary character})
 524
 525 This operator matches the character itself.  All ordinary characters
 526 (@pxref{Regular Expression Syntax}) represent this operator.  For
 527 example, @samp{f} is always an ordinary character, so the regular
 528 expression @samp{f} matches only the string @samp{f}.  In
 529 particular, it does @emph{not} match the string @samp{ff}.
 530
 531 @node Match-any-character Operator
 532 @section The Match-any-character Operator (@code{.})
 533
 534 @cindex @samp{.}
 535
 536 This operator matches any single printing or nonprinting character
 537 except it won't match a:
 538
 539 @table @asis
 540 @item newline
 541 if the syntax bit @code{RE_DOT_NEWLINE} isn't set.
 542
 543 @item null
 544 if the syntax bit @code{RE_DOT_NOT_NULL} is set.
 545
 546 @end table
 547
 548 The @samp{.} (period) character represents this operator.  For example,
 549 @samp{a.b} matches any three-character string beginning with @samp{a}
 550 and ending with @samp{b}.
 551
 552 @node Concatenation Operator
 553 @section The Concatenation Operator
 554
 555 This operator concatenates two regular expressions @var{a} and @var{b}.
 556 No character represents this operator; you simply put @var{b} after
 557 @var{a}.  The result is a regular expression that will match a string if
 558 @var{a} matches its first part and @var{b} matches the rest.  For
 559 example, @samp{xy} (two match-self operators) matches @samp{xy}.
 560
 561 @node Repetition Operators
 562 @section Repetition Operators
 563
 564 Repetition operators repeat the preceding regular expression a specified
 565 number of times.
 566
 567 @menu
 568 * Match-zero-or-more Operator::  *
 569 * Match-one-or-more Operator::   +
 570 * Match-zero-or-one Operator::   ?
 571 * Interval Operators::           @{@}
 572 @end menu
 573
 574 @node Match-zero-or-more Operator
 575 @subsection The Match-zero-or-more Operator (@code{*})
 576
 577 @cindex @samp{*}
 578
 579 This operator repeats the smallest possible preceding regular expression
 580 as many times as necessary (including zero) to match the pattern.
 581 @samp{*} represents this operator.  For example, @samp{o*}
 582 matches any string made up of zero or more @samp{o}s.  Since this
 583 operator operates on the smallest preceding regular expression,
 584 @samp{fo*} has a repeating @samp{o}, not a repeating @samp{fo}.  So,
 585 @samp{fo*} matches @samp{f}, @samp{fo}, @samp{foo}, and so on.
 586
 587 Since the match-zero-or-more operator is a suffix operator, it may be
 588 useless as such when no regular expression precedes it.  This is the
 589 case when it:
 590
 591 @itemize @bullet
 592 @item
 593 is first in a regular expression, or
 594
 595 @item
 596 follows a match-beginning-of-line, open-group, or alternation
 597 operator.
 598
 599 @end itemize
 600
 601 @noindent
 602 Three different things can happen in these cases:
 603
 604 @enumerate
 605 @item
 606 If the syntax bit @code{RE_CONTEXT_INVALID_OPS} is set, then the
 607 regular expression is invalid.
 608
 609 @item
 610 If @code{RE_CONTEXT_INVALID_OPS} isn't set, but
 611 @code{RE_CONTEXT_INDEP_OPS} is, then @samp{*} represents the
 612 match-zero-or-more operator (which then operates on the empty string).
 613
 614 @item
 615 Otherwise, @samp{*} is ordinary.
 616
 617 @end enumerate
 618
 619 @cindex backtracking
 620 The matcher processes a match-zero-or-more operator by first matching as
 621 many repetitions of the smallest preceding regular expression as it can.
 622 Then it continues to match the rest of the pattern.
 623
 624 If it can't match the rest of the pattern, it backtracks (as many times
 625 as necessary), each time discarding one of the matches until it can
 626 either match the entire pattern or be certain that it cannot get a
 627 match.  For example, when matching @samp{ca*ar} against @samp{caaar},
 628 the matcher first matches all three @samp{a}s of the string with the
 629 @samp{a*} of the regular expression.  However, it cannot then match the
 630 final @samp{ar} of the regular expression against the final @samp{r} of
 631 the string.  So it backtracks, discarding the match of the last @samp{a}
 632 in the string.  It can then match the remaining @samp{ar}.
 633
 634
 635 @node Match-one-or-more Operator
 636 @subsection The Match-one-or-more Operator (@code{+} or @code{\+})
 637
 638 @cindex @samp{+}
 639
 640 If the syntax bit @code{RE_LIMITED_OPS} is set, then Regex doesn't recognize
 641 this operator.  Otherwise, if the syntax bit @code{RE_BK_PLUS_QM} isn't
 642 set, then @samp{+} represents this operator; if it is, then @samp{\+}
 643 does.
 644
 645 This operator is similar to the match-zero-or-more operator except that
 646 it repeats the preceding regular expression at least once;
 647 @pxref{Match-zero-or-more Operator}, for what it operates on, how some
 648 syntax bits affect it, and how Regex backtracks to match it.
 649
 650 For example, supposing that @samp{+} represents the match-one-or-more
 651 operator; then @samp{ca+r} matches, e.g., @samp{car} and
 652 @samp{caaaar}, but not @samp{cr}.
 653
 654 @node Match-zero-or-one Operator
 655 @subsection The Match-zero-or-one Operator (@code{?} or @code{\?})
 656 @cindex @samp{?}
 657
 658 If the syntax bit @code{RE_LIMITED_OPS} is set, then Regex doesn't
 659 recognize this operator.  Otherwise, if the syntax bit
 660 @code{RE_BK_PLUS_QM} isn't set, then @samp{?} represents this operator;
 661 if it is, then @samp{\?} does.
 662
 663 This operator is similar to the match-zero-or-more operator except that
 664 it repeats the preceding regular expression once or not at all;
 665 @pxref{Match-zero-or-more Operator}, to see what it operates on, how
 666 some syntax bits affect it, and how Regex backtracks to match it.
 667
 668 For example, supposing that @samp{?} represents the match-zero-or-one
 669 operator; then @samp{ca?r} matches both @samp{car} and @samp{cr}, but
 670 nothing else.
 671
 672 @node Interval Operators
 673 @subsection Interval Operators (@code{@{} @dots{} @code{@}} or @code{\@{} @dots{} @code{\@}})
 674
 675 @cindex interval expression
 676 @cindex @samp{@{}
 677 @cindex @samp{@}}
 678 @cindex @samp{\@{}
 679 @cindex @samp{\@}}
 680
 681 If the syntax bit @code{RE_INTERVALS} is set, then Regex recognizes
 682 @dfn{interval expressions}.  They repeat the smallest possible preceding
 683 regular expression a specified number of times.
 684
 685 If the syntax bit @code{RE_NO_BK_BRACES} is set, @samp{@{} represents
 686 the @dfn{open-interval operator} and @samp{@}} represents the
 687 @dfn{close-interval operator} ; otherwise, @samp{\@{} and @samp{\@}} do.
 688
 689 Specifically, supposing that @samp{@{} and @samp{@}} represent the
 690 open-interval and close-interval operators; then:
 691
 692 @table @code
 693 @item  @{@var{count}@}
 694 matches exactly @var{count} occurrences of the preceding regular
 695 expression.
 696
 697 @item @{@var{min},@}
 698 matches @var{min} or more occurrences of the preceding regular
 699 expression.
 700
 701 @item  @{@var{min}, @var{max}@}
 702 matches at least @var{min} but no more than @var{max} occurrences of
 703 the preceding regular expression.
 704
 705 @end table
 706
 707 The interval expression (but not necessarily the regular expression that
 708 contains it) is invalid if:
 709
 710 @itemize @bullet
 711 @item
 712 @var{min} is greater than @var{max}, or
 713
 714 @item
 715 any of @var{count}, @var{min}, or @var{max} are outside the range
 716 zero to @code{RE_DUP_MAX} (which symbol @file{regex.h}
 717 defines).
 718
 719 @end itemize
 720
 721 If the interval expression is invalid and the syntax bit
 722 @code{RE_NO_BK_BRACES} is set, then Regex considers all the
 723 characters in the would-be interval to be ordinary.  If that bit
 724 isn't set, then the regular expression is invalid.
 725
 726 If the interval expression is valid but there is no preceding regular
 727 expression on which to operate, then if the syntax bit
 728 @code{RE_CONTEXT_INVALID_OPS} is set, the regular expression is invalid.
 729 If that bit isn't set, then Regex considers all the characters---other
 730 than backslashes, which it ignores---in the would-be interval to be
 731 ordinary.
 732
 733
 734 @node Alternation Operator
 735 @section The Alternation Operator (@code{|} or @code{\|})
 736
 737 @kindex |
 738 @kindex \|
 739 @cindex alternation operator
 740 @cindex or operator
 741
 742 If the syntax bit @code{RE_LIMITED_OPS} is set, then Regex doesn't
 743 recognize this operator.  Otherwise, if the syntax bit
 744 @code{RE_NO_BK_VBAR} is set, then @samp{|} represents this operator;
 745 otherwise, @samp{\|} does.
 746
 747 Alternatives match one of a choice of regular expressions:
 748 if you put the character(s) representing the alternation operator between
 749 any two regular expressions @var{a} and @var{b}, the result matches
 750 the union of the strings that @var{a} and @var{b} match.  For
 751 example, supposing that @samp{|} is the alternation operator, then
 752 @samp{foo|bar|quux} would match any of @samp{foo}, @samp{bar} or
 753 @samp{quux}.
 754
 755 The alternation operator operates on the @emph{largest} possible
 756 surrounding regular expressions.  (Put another way, it has the lowest
 757 precedence of any regular expression operator.)
 758 Thus, the only way you can
 759 delimit its arguments is to use grouping.  For example, if @samp{(} and
 760 @samp{)} are the open and close-group operators, then @samp{fo(o|b)ar}
 761 would match either @samp{fooar} or @samp{fobar}.  (@samp{foo|bar} would
 762 match @samp{foo} or @samp{bar}.)
 763
 764 @cindex backtracking
 765 The matcher usually tries all combinations of alternatives so as to
 766 match the longest possible string.  For example, when matching
 767 @samp{(fooq|foo)*(qbarquux|bar)} against @samp{fooqbarquux}, it cannot
 768 take, say, the first (``depth-first'') combination it could match, since
 769 then it would be content to match just @samp{fooqbar}.
 770
 771 Note that since the default behavior is to return the leftmost longest
 772 match, when more than one of a series of alternatives matches the actual
 773 match will be the longest matching alternative, not necessarily the
 774 first in the list.
 775
 776
 777 @node List Operators
 778 @section List Operators (@code{[} @dots{} @code{]} and @code{[^} @dots{} @code{]})
 779
 780 @cindex matching list
 781 @cindex @samp{[}
 782 @cindex @samp{]}
 783 @cindex @samp{^}
 784 @cindex @samp{-}
 785 @cindex @samp{\}
 786 @cindex @samp{[^}
 787 @cindex nonmatching list
 788 @cindex matching newline
 789 @cindex bracket expression
 790
 791 @dfn{Lists}, also called @dfn{bracket expressions}, are a set of one or
 792 more items.  An @dfn{item} is a character,
 793 a collating symbol, an equivalence class expression,
 794 a character class expression, or a range expression.  The syntax bits
 795 affect which kinds of items you can put in a list.  We explain the last
 796 four items in subsections below.  Empty lists are invalid.
 797
 798 A @dfn{matching list} matches a single character represented by one of
 799 the list items.  You form a matching list by enclosing one or more items
 800 within an @dfn{open-matching-list operator} (represented by @samp{[})
 801 and a @dfn{close-list operator} (represented by @samp{]}).
 802
 803 For example, @samp{[ab]} matches either @samp{a} or @samp{b}.
 804 @samp{[ad]*} matches the empty string and any string composed of just
 805 @samp{a}s and @samp{d}s in any order.  Regex considers invalid a regular
 806 expression with a @samp{[} but no matching
 807 @samp{]}.
 808
 809 @dfn{Nonmatching lists} are similar to matching lists except that they
 810 match a single character @emph{not} represented by one of the list
 811 items.  You use an @dfn{open-nonmatching-list operator} (represented by
 812 @samp{[^}@footnote{Regex therefore doesn't consider the @samp{^} to be
 813 the first character in the list.  If you put a @samp{^} character first
 814 in (what you think is) a matching list, you'll turn it into a
 815 nonmatching list.}) instead of an open-matching-list operator to start a
 816 nonmatching list.
 817
 818 For example, @samp{[^ab]} matches any character except @samp{a} or
 819 @samp{b}.
 820
 821 If the syntax bit @code{RE_HAT_LISTS_NOT_NEWLINE} is set, then
 822 nonmatching lists do not match a newline.
 823
 824 Most characters lose any special meaning inside a list.  The special
 825 characters inside a list follow.
 826
 827 @table @samp
 828 @item ]
 829 ends the list if it's not the first list item.  So, if you want to make
 830 the @samp{]} character a list item, you must put it first.
 831
 832 @item \
 833 quotes the next character if the syntax bit @code{RE_BACKSLASH_ESCAPE_IN_LISTS} is
 834 set.
 835
 836 @item [.
 837 represents the open-collating-symbol operator (@pxref{Collating Symbol
 838 Operators}).
 839
 840 @item .]
 841 represents the close-collating-symbol operator.
 842
 843 @item [=
 844 represents the open-equivalence-class operator (@pxref{Equivalence Class
 845 Operators}).
 846
 847 @item =]
 848 represents the close-equivalence-class operator.
 849
 850 @item [:
 851 represents the open-character-class operator (@pxref{Character Class
 852 Operators}) if the syntax bit @code{RE_CHAR_CLASSES} is set and what
 853 follows is a valid character class expression.
 854
 855 @item :]
 856 represents the close-character-class operator if the syntax bit
 857 @code{RE_CHAR_CLASSES} is set and what precedes it is an
 858 open-character-class operator followed by a valid character class name.
 859
 860 @item -
 861 represents the range operator (@pxref{Range Operator}) if it's
 862 not first or last in a list or the ending point of a range.
 863
 864 @end table
 865
 866 @noindent
 867 All other characters are ordinary.  For example, @samp{[.*]} matches
 868 @samp{.} and @samp{*}.
 869
 870 @menu
 871 * Collating Symbol Operators::  [.elem.]
 872 * Equivalence Class Operators:: [=class=]
 873 * Character Class Operators::   [:class:]
 874 * Range Operator::          start-end
 875 @end menu
 876
 877
 878 @node Collating Symbol Operators
 879 @subsection Collating Symbol Operators (@code{[.} @dots{} @code{.]})
 880
 881 Collating symbols can be represented inside lists.
 882 You form a @dfn{collating symbol} by
 883 putting a collating element between an @dfn{open-collating-symbol
 884 operator} and a @dfn{close-collating-symbol operator}.  @samp{[.}
 885 represents the open-collating-symbol operator and @samp{.]} represents
 886 the close-collating-symbol operator.  For example, if @samp{ll} is a
 887 collating element, then @samp{[[.ll.]]} would match @samp{ll}.
 888
 889 @node Equivalence Class Operators
 890 @subsection Equivalence Class Operators (@code{[=} @dots{} @code{=]})
 891 @cindex equivalence class expression in regex
 892 @cindex @samp{[=} in regex
 893 @cindex @samp{=]} in regex
 894
 895 Regex recognizes equivalence class
 896 expressions inside lists.  A @dfn{equivalence class expression} is a set
 897 of collating elements which all belong to the same equivalence class.
 898 You form an equivalence class expression by putting a collating
 899 element between an @dfn{open-equivalence-class operator} and a
 900 @dfn{close-equivalence-class operator}.  @samp{[=} represents the
 901 open-equivalence-class operator and @samp{=]} represents the
 902 close-equivalence-class operator.  For example, if @samp{a} and @samp{A}
 903 were an equivalence class, then both @samp{[[=a=]]} and @samp{[[=A=]]}
 904 would match both @samp{a} and @samp{A}.  If the collating element in an
 905 equivalence class expression isn't part of an equivalence class, then
 906 the matcher considers the equivalence class expression to be a collating
 907 symbol.
 908
 909 @node Character Class Operators
 910 @subsection Character Class Operators (@code{[:} @dots{} @code{:]})
 911
 912 @cindex character classes
 913 @cindex @samp{[colon} in regex
 914 @cindex @samp{colon]} in regex
 915
 916 If the syntax bit @code{RE_CHAR_CLASSES} is set, then Regex recognizes
 917 character class expressions inside lists.  A @dfn{character class
 918 expression} matches one character from a given class.  You form a
 919 character class expression by putting a character class name between
 920 an @dfn{open-character-class operator} (represented by @samp{[:}) and
 921 a @dfn{close-character-class operator} (represented by @samp{:]}).
 922 The character class names and their meanings are:
 923
 924 @table @code
 925
 926 @item alnum
 927 letters and digits
 928
 929 @item alpha
 930 letters
 931
 932 @item blank
 933 system-dependent; for GNU, a space or tab
 934
 935 @item cntrl
 936 control characters (in the ASCII encoding, code 0177 and codes
 937 less than 040)
 938
 939 @item digit
 940 digits
 941
 942 @item graph
 943 same as @code{print} except omits space
 944
 945 @item lower
 946 lowercase letters
 947
 948 @item print
 949 printable characters (in the ASCII encoding, space
 950 tilde---codes 040 through 0176)
 951
 952 @item punct
 953 neither control nor alphanumeric characters
 954
 955 @item space
 956 space, carriage return, newline, vertical tab, and form feed
 957
 958 @item upper
 959 uppercase letters
 960
 961 @item xdigit
 962 hexadecimal digits: @code{0}--@code{9}, @code{a}--@code{f}, @code{A}--@code{F}
 963
 964 @end table
 965
 966 @noindent
 967 These correspond to the definitions in the C library's @file{<ctype.h>}
 968 facility.  For example, @samp{[:alpha:]} corresponds to the standard
 969 facility @code{isalpha}.  Regex recognizes character class expressions
 970 only inside of lists; so @samp{[[:alpha:]]} matches any letter, but
 971 @samp{[:alpha:]} outside of a bracket expression and not followed by a
 972 repetition operator matches just itself.
 973
 974 @node Range Operator
 975 @subsection The Range Operator (@code{-})
 976
 977 Regex recognizes @dfn{range expressions} inside a list. They represent
 978 those characters
 979 that fall between two elements in the current collating sequence.  You
 980 form a range expression by putting a @dfn{range operator} between two
 981 of any of the following: characters, collating elements, collating symbols,
 982 and equivalence class expressions.  The starting point of the range and
 983 the ending point of the range don't have to be the same kind of item,
 984 e.g., the starting point could be a collating element and the ending
 985 point could be an equivalence class expression.  If a range's ending
 986 point is an equivalence class, then all the collating elements in that
 987 class will be in the range.@footnote{You can't use a character class for the starting
 988 or ending point of a range, since a character class is not a single
 989 character.} @samp{-} represents the range operator.  For example,
 990 @samp{a-f} within a list represents all the characters from @samp{a}
 991 through @samp{f}
 992 inclusively.
 993
 994 If the syntax bit @code{RE_NO_EMPTY_RANGES} is set, then if the range's
 995 ending point collates less than its starting point, the range (and the
 996 regular expression containing it) is invalid.  For example, the regular
 997 expression @samp{[z-a]} would be invalid.  If this bit isn't set, then
 998 Regex considers such a range to be empty.
 999
1000 Since @samp{-} represents the range operator, if you want to make a
1001 @samp{-} character itself
1002 a list item, you must do one of the following:
1003
1004 @itemize @bullet
1005 @item
1006 Put the @samp{-} either first or last in the list.
1007
1008 @item
1009 Include a range whose starting point collates strictly lower than
1010 @samp{-} and whose ending point collates equal or higher.  Unless a
1011 range is the first item in a list, a @samp{-} can't be its starting
1012 point, but @emph{can} be its ending point.  That is because Regex
1013 considers @samp{-} to be the range operator unless it is preceded by
1014 another @samp{-}.  For example, in the ASCII encoding, @samp{)},
1015 @samp{*}, @samp{+}, @samp{,}, @samp{-}, @samp{.}, and @samp{/} are
1016 contiguous characters in the collating sequence.  You might think that
1017 @samp{[)-+--/]} has two ranges: @samp{)-+} and @samp{--/}.  Rather, it
1018 has the ranges @samp{)-+} and @samp{+--}, plus the character @samp{/}, so
1019 it matches, e.g., @samp{,}, not @samp{.}.
1020
1021 @item
1022 Put a range whose starting point is @samp{-} first in the list.
1023
1024 @end itemize
1025
1026 For example, @samp{[-a-z]} matches a lowercase letter or a hyphen (in
1027 English, in ASCII).
1028
1029
1030 @node Grouping Operators
1031 @section Grouping Operators (@code{(} @dots{} @code{)} or @code{\(} @dots{} @code{\)})
1032
1033 @kindex (
1034 @kindex )
1035 @kindex \(
1036 @kindex \)
1037 @cindex grouping
1038 @cindex subexpressions
1039 @cindex parenthesizing
1040
1041 A @dfn{group}, also known as a @dfn{subexpression}, consists of an
1042 @dfn{open-group operator}, any number of other operators, and a
1043 @dfn{close-group operator}.  Regex treats this sequence as a unit, just
1044 as mathematics and programming languages treat a parenthesized
1045 expression as a unit.
1046
1047 Therefore, using @dfn{groups}, you can:
1048
1049 @itemize @bullet
1050 @item
1051 delimit the argument(s) to an alternation operator (@pxref{Alternation
1052 Operator}) or a repetition operator (@pxref{Repetition
1053 Operators}).
1054
1055 @item
1056 keep track of the indices of the substring that matched a given group.
1057 @xref{Using Registers}, for a precise explanation.
1058 This lets you:
1059
1060 @itemize @bullet
1061 @item
1062 use the back-reference operator (@pxref{Back-reference Operator}).
1063
1064 @item
1065 use registers (@pxref{Using Registers}).
1066
1067 @end itemize
1068
1069 @end itemize
1070
1071 If the syntax bit @code{RE_NO_BK_PARENS} is set, then @samp{(} represents
1072 the open-group operator and @samp{)} represents the
1073 close-group operator; otherwise, @samp{\(} and @samp{\)} do.
1074
1075 If the syntax bit @code{RE_UNMATCHED_RIGHT_PAREN_ORD} is set and a
1076 close-group operator has no matching open-group operator, then Regex
1077 considers it to match @samp{)}.
1078
1079
1080 @node Back-reference Operator
1081 @section The Back-reference Operator (@dfn{\}@var{digit})
1082
1083 @cindex back-references
1084
1085 If the syntax bit @code{RE_NO_BK_REF} isn't set, then Regex recognizes
1086 back-references.  A back-reference matches a specified preceding group.
1087 The back-reference operator is represented by @samp{\@var{digit}}
1088 anywhere after the end of a regular expression's @w{@var{digit}-th}
1089 group (@pxref{Grouping Operators}).
1090
1091 @var{digit} must be between @samp{1} and @samp{9}.  The matcher assigns
1092 numbers 1 through 9 to the first nine groups it encounters.  By using
1093 one of @samp{\1} through @samp{\9} after the corresponding group's
1094 close-group operator, you can match a substring identical to the
1095 one that the group does.
1096
1097 Back-references match according to the following (in all examples below,
1098 @samp{(} represents the open-group, @samp{)} the close-group, @samp{@{}
1099 the open-interval and @samp{@}} the close-interval operator):
1100
1101 @itemize @bullet
1102 @item
1103 If the group matches a substring, the back-reference matches an
1104 identical substring.  For example, @samp{(a)\1} matches @samp{aa} and
1105 @samp{(bana)na\1bo\1} matches @samp{bananabanabobana}.  Likewise,
1106 @samp{(.*)\1} matches any (newline-free if the syntax bit
1107 @code{RE_DOT_NEWLINE} isn't set) string that is composed of two
1108 identical halves; the @samp{(.*)} matches the first half and the
1109 @samp{\1} matches the second half.
1110
1111 @item
1112 If the group matches more than once (as it might if followed
1113 by, e.g., a repetition operator), then the back-reference matches the
1114 substring the group @emph{last} matched.  For example,
1115 @samp{((a*)b)*\1\2} matches @samp{aabababa}; first @w{group 1} (the
1116 outer one) matches @samp{aab} and @w{group 2} (the inner one) matches
1117 @samp{aa}.  Then @w{group 1} matches @samp{ab} and @w{group 2} matches
1118 @samp{a}.  So, @samp{\1} matches @samp{ab} and @samp{\2} matches
1119 @samp{a}.
1120
1121 @item
1122 If the group doesn't participate in a match, i.e., it is part of an
1123 alternative not taken or a repetition operator allows zero repetitions
1124 of it, then the back-reference makes the whole match fail.  For example,
1125 @samp{(one()|two())-and-(three\2|four\3)} matches @samp{one-and-three}
1126 and @samp{two-and-four}, but not @samp{one-and-four} or
1127 @samp{two-and-three}.  For example, if the pattern matches
1128 @samp{one-and-}, then its @w{group 2} matches the empty string and its
1129 @w{group 3} doesn't participate in the match.  So, if it then matches
1130 @samp{four}, then when it tries to back-reference @w{group 3}---which it
1131 will attempt to do because @samp{\3} follows the @samp{four}---the match
1132 will fail because @w{group 3} didn't participate in the match.
1133
1134 @end itemize
1135
1136 You can use a back-reference as an argument to a repetition operator.  For
1137 example, @samp{(a(b))\2*} matches @samp{a} followed by two or more
1138 @samp{b}s.  Similarly, @samp{(a(b))\2@{3@}} matches @samp{abbbb}.
1139
1140 If there is no preceding @w{@var{digit}-th} subexpression, the regular
1141 expression is invalid.
1142
1143 Back-references can greatly slow down matching, as they can generate
1144 exponentially many matching possibilities that can consume both time
1145 and memory to explore.  Also, the POSIX specification for
1146 back-references is at times unclear.  Furthermore, many regular
1147 expression implementations have back-reference bugs that can cause
1148 programs to return incorrect answers or even crash, and fixing these
1149 bugs has often been low-priority: for example, as of 2020 the
1150 @url{https://sourceware.org/bugzilla/,GNU C library bug database}
1151 contained back-reference bugs
1152 @url{https://sourceware.org/bugzilla/show_bug.cgi?id=52,,52},
1153 @url{https://sourceware.org/bugzilla/show_bug.cgi?id=10844,,10844},
1154 @url{https://sourceware.org/bugzilla/show_bug.cgi?id=11053,,11053},
1155 @url{https://sourceware.org/bugzilla/show_bug.cgi?id=24269,,24269}
1156 and @url{https://sourceware.org/bugzilla/show_bug.cgi?id=25322,,25322},
1157 with little sign of forthcoming fixes.  Luckily,
1158 back-references are rarely useful and it should be little trouble to
1159 avoid them in practical applications.
1160
1161
1162 @node Anchoring Operators
1163 @section Anchoring Operators
1164
1165 @cindex anchoring
1166 @cindex regexp anchoring
1167
1168 These operators can constrain a pattern to match only at the beginning or
1169 end of the entire string or at the beginning or end of a line.
1170
1171 @menu
1172 * Match-beginning-of-line Operator::  ^
1173 * Match-end-of-line Operator::        $
1174 @end menu
1175
1176
1177 @node Match-beginning-of-line Operator
1178 @subsection The Match-beginning-of-line Operator (@code{^})
1179
1180 @kindex ^
1181 @cindex beginning-of-line operator
1182 @cindex anchors
1183
1184 This operator can match the empty string either at the beginning of the
1185 string or after a newline character.  Thus, it is said to @dfn{anchor}
1186 the pattern to the beginning of a line.
1187
1188 In the cases following, @samp{^} represents this operator.  (Otherwise,
1189 @samp{^} is ordinary.)
1190
1191 @itemize @bullet
1192
1193 @item
1194 It (the @samp{^}) is first in the pattern, as in @samp{^foo}.
1195
1196 @cnindex RE_CONTEXT_INDEP_ANCHORS @r{(and @samp{^})}
1197 @item
1198 The syntax bit @code{RE_CONTEXT_INDEP_ANCHORS} is set, and it is outside
1199 a bracket expression.
1200
1201 @cindex open-group operator and @samp{^}
1202 @cindex alternation operator and @samp{^}
1203 @item
1204 It follows an open-group or alternation operator, as in @samp{a\(^b\)}
1205 and @samp{a\|^b}.  @xref{Grouping Operators}, and @ref{Alternation
1206 Operator}.
1207
1208 @end itemize
1209
1210 These rules imply that some valid patterns containing @samp{^} cannot be
1211 matched; for example, @samp{foo^bar} if @code{RE_CONTEXT_INDEP_ANCHORS}
1212 is set.
1213
1214 @vindex not_bol @r{field in pattern buffer}
1215 If the @code{not_bol} field is set in the pattern buffer (@pxref{GNU
1216 Pattern Buffers}), then @samp{^} fails to match at the beginning of the
1217 string.  This lets you match against pieces of a line, as you would need to if,
1218 say, searching for repeated instances of a given pattern in a line; it
1219 would work correctly for patterns both with and without
1220 match-beginning-of-line operators.
1221
1222
1223 @node Match-end-of-line Operator
1224 @subsection The Match-end-of-line Operator (@code{$})
1225
1226 @kindex $
1227 @cindex end-of-line operator
1228 @cindex anchors
1229
1230 This operator can match the empty string either at the end of
1231 the string or before a newline character in the string.  Thus, it is
1232 said to @dfn{anchor} the pattern to the end of a line.
1233
1234 It is always represented by @samp{$}.  For example, @samp{foo$} usually
1235 matches, e.g., @samp{foo} and, e.g., the first three characters of
1236 @samp{foo\nbar}.
1237
1238 Its interaction with the syntax bits and pattern buffer fields is
1239 exactly the dual of @samp{^}'s; see the previous section.  (That is,
1240 ``@samp{^}'' becomes ``@samp{$}'', ``beginning'' becomes ``end'',
1241 ``next'' becomes ``previous'', ``after'' becomes ``before'', and
1242 ``@code{not_bol}'' becomes ``@code{not_eol}''.)
1243
1244
1245 @node GNU Operators
1246 @chapter GNU Operators
1247
1248 The following are operators that GNU defines (and POSIX doesn't) that
1249 you can use unless the syntax bit @code{RE_NO_GNU_OPS} is set.
1250
1251 @menu
1252 * Word Operators::
1253 * Space Operators::
1254 * Whole-string Operators::
1255 @end menu
1256
1257 @node Word Operators
1258 @section Word Operators
1259
1260 The operators in this section require Regex to recognize parts of words.
1261 Characters that are part of words, which are called
1262 @dfn{word-constituent}, are letters, digits, and the underscore
1263 (@samp{_}); more precisely, any character in the POSIX class
1264 @code{alnum} in the current locale, or underscore.
1265
1266 @menu
1267 * Match-word-boundary Operator::        \b
1268 * Match-within-word Operator::          \B
1269 * Match-beginning-of-word Operator::    \<
1270 * Match-end-of-word Operator::          \>
1271 * Match-word-constituent Operator::     \w
1272 * Match-non-word-constituent Operator:: \W
1273 @end menu
1274
1275 @node Match-word-boundary Operator
1276 @subsection The Match-word-boundary Operator (@code{\b})
1277
1278 @cindex @samp{\b}
1279 @cindex word boundaries, matching
1280
1281 This operator (represented by @samp{\b}) matches the empty string at
1282 either the beginning or the end of a word.  For example, @samp{\brat\b}
1283 matches the separate word @samp{rat}.
1284
1285 @node Match-within-word Operator
1286 @subsection The Match-within-word Operator (@code{\B})
1287
1288 @cindex @samp{\B}
1289
1290 This operator (represented by @samp{\B}) matches the empty string within
1291 a word. For example, @samp{c\Brat\Be} matches @samp{crate}, but
1292 @samp{dirty \Brat} doesn't match @samp{dirty rat}.
1293
1294 @node Match-beginning-of-word Operator
1295 @subsection The Match-beginning-of-word Operator (@code{\<})
1296
1297 @cindex @samp{\<}
1298
1299 This operator (represented by @samp{\<}) matches the empty string at the
1300 beginning of a word.
1301
1302 @node Match-end-of-word Operator
1303 @subsection The Match-end-of-word Operator (@code{\>})
1304
1305 @cindex @samp{\>}
1306
1307 This operator (represented by @samp{\>}) matches the empty string at the
1308 end of a word.
1309
1310 @node Match-word-constituent Operator
1311 @subsection The Match-word-constituent Operator (@code{\w})
1312
1313 @cindex @samp{\w}
1314
1315 This operator (represented by @samp{\w}) matches any word-constituent
1316 character.
1317
1318 @node Match-non-word-constituent Operator
1319 @subsection The Match-non-word-constituent Operator (@code{\W})
1320
1321 @cindex @samp{\W}
1322
1323 This operator (represented by @samp{\W}) matches any character that is
1324 not word-constituent.
1325
1326
1327 @node Space Operators
1328 @section Space Operators
1329
1330 @node Match-space Operator
1331 @subsection The Match-space Operator (@code{\s})
1332
1333 @cindex @samp{\s}
1334
1335 This operator (represented by @samp{\s}) matches any space
1336 character (that is, in the POSIX class @code{[:space:]}).
1337
1338 @node Match-non-space Operator
1339 @subsection The Match-non-space Operator (@code{\S})
1340
1341 @cindex @samp{\S}
1342
1343 This operator (represented by @samp{\S}) matches any character
1344 that is not a space (that is, in the POSIX class @code{[:space:]}).
1345
1346
1347 @node Whole-string Operators
1348 @section Whole-string Operators
1349
1350 Following are operators which work on the whole string.
1351
1352 @menu
1353 * Match-beginning-of-string Operator::  \`
1354 * Match-end-of-string Operator::        \'
1355 @end menu
1356
1357
1358 @node Match-beginning-of-string Operator
1359 @subsection The Match-beginning-of-string Operator (@code{\`})
1360
1361 @cindex @samp{\`}
1362
1363 This operator (represented by @samp{\`}) matches the empty string at the
1364 beginning of the string.
1365
1366 @node Match-end-of-string Operator
1367 @subsection The Match-end-of-string Operator (@code{\'})
1368
1369 @cindex @samp{\'}
1370
1371 This operator (represented by @samp{\'}) matches the empty string at the
1372 end of the string.
1373
1374
1375 @node What Gets Matched?
1376 @chapter What Gets Matched?
1377
1378 Regex usually matches strings according to the ``leftmost longest''
1379 rule; that is, it chooses the longest of the leftmost matches.  This
1380 does not mean that for a regular expression containing subexpressions
1381 that it simply chooses the longest match for each subexpression, left to
1382 right; the overall match must also be the longest possible one.
1383
1384 For example, @samp{(ac*)(c*d[ac]*)\1} matches @samp{acdacaaa}, not
1385 @samp{acdac}, as it would if it were to choose the longest match for the
1386 first subexpression.
1387
1388
1389 @node Programming with Regex
1390 @chapter Programming with Regex
1391
1392 Here we describe how you use the Regex data structures and functions in
1393 C programs.  Regex has three interfaces: one designed for GNU, one
1394 compatible with POSIX (as specified by POSIX, draft
1395 1003.2/D11.2), and one compatible with Berkeley Unix.  The
1396 POSIX interface is not documented here; see the documentation of
1397 GNU libc, or the POSIX man pages.  The Berkeley Unix interface is
1398 documented here for convenience, since its documentation is not
1399 otherwise readily available on GNU systems.
1400
1401 @menu
1402 * GNU Regex Functions::
1403 * BSD Regex Functions::
1404 @end menu
1405
1406
1407 @node GNU Regex Functions
1408 @section GNU Regex Functions
1409
1410 If you're writing code that doesn't need to be compatible with either
1411 POSIX or Berkeley Unix, you can use these functions.  They
1412 provide more options than the other interfaces.
1413
1414 @menu
1415 * GNU Pattern Buffers::         The re_pattern_buffer type.
1416 * GNU Regular Expression Compiling::  re_compile_pattern ()
1417 * GNU Matching::                re_match ()
1418 * GNU Searching::               re_search ()
1419 * Matching/Searching with Split Data::  re_match_2 (), re_search_2 ()
1420 * Searching with Fastmaps::     re_compile_fastmap ()
1421 * GNU Translate Tables::        The @code{translate} field.
1422 * Using Registers::             The re_registers type and related fns.
1423 * Freeing GNU Pattern Buffers::  regfree ()
1424 @end menu
1425
1426
1427 @node GNU Pattern Buffers
1428 @subsection GNU Pattern Buffers
1429
1430 @cindex pattern buffer, definition of
1431 @tindex re_pattern_buffer @r{definition}
1432 @tindex struct re_pattern_buffer @r{definition}
1433
1434 To compile, match, or search for a given regular expression, you must
1435 supply a pattern buffer.  A @dfn{pattern buffer} holds one compiled
1436 regular expression.@footnote{Regular expressions are also referred to as
1437 ``patterns,'' hence the name ``pattern buffer.''}
1438
1439 You can have several different pattern buffers simultaneously, each
1440 holding a compiled pattern for a different regular expression.
1441
1442 @file{regex.h} defines the pattern buffer @code{struct} with the
1443 following public fields:
1444
1445 @example
1446   unsigned char *buffer;
1447   unsigned long allocated;
1448   char *fastmap;
1449   char *translate;
1450   size_t re_nsub;
1451   unsigned no_sub : 1;
1452   unsigned not_bol : 1;
1453   unsigned not_eol : 1;
1454 @end example
1455
1456
1457 @node GNU Regular Expression Compiling
1458 @subsection GNU Regular Expression Compiling
1459
1460 In GNU, you can both match and search for a given regular
1461 expression.  To do either, you must first compile it in a pattern buffer
1462 (@pxref{GNU Pattern Buffers}).
1463
1464 @cindex syntax initialization
1465 @vindex re_syntax_options @r{initialization}
1466 Regular expressions match according to the syntax with which they were
1467 compiled; with GNU, you indicate what syntax you want by setting
1468 the variable @code{re_syntax_options} (declared in @file{regex.h})
1469 before calling the compiling function, @code{re_compile_pattern} (see
1470 below).  @xref{Syntax Bits}, and @ref{Predefined Syntaxes}.
1471
1472 You can change the value of @code{re_syntax_options} at any time.
1473 Usually, however, you set its value once and then never change it.
1474
1475 @cindex pattern buffer initialization
1476 @code{re_compile_pattern} takes a pattern buffer as an argument.  You
1477 must initialize the following fields:
1478
1479 @table @code
1480
1481 @item translate @r{initialization}
1482
1483 @item translate
1484 @vindex translate @r{initialization}
1485 Initialize this to point to a translate table if you want one, or to
1486 zero if you don't.  We explain translate tables in @ref{GNU Translate
1487 Tables}.
1488
1489 @item fastmap
1490 @vindex fastmap @r{initialization}
1491 Initialize this to nonzero if you want a fastmap, or to zero if you
1492 don't.
1493
1494 @item buffer
1495 @itemx allocated
1496 @vindex buffer @r{initialization}
1497 @vindex allocated @r{initialization}
1498 @findex malloc
1499 If you want @code{re_compile_pattern} to allocate memory for the
1500 compiled pattern, set both of these to zero.  If you have an existing
1501 block of memory (allocated with @code{malloc}) you want Regex to use,
1502 set @code{buffer} to its address and @code{allocated} to its size (in
1503 bytes).
1504
1505 @code{re_compile_pattern} uses @code{realloc} to extend the space for
1506 the compiled pattern as necessary.
1507
1508 @end table
1509
1510 To compile a pattern buffer, use:
1511
1512 @findex re_compile_pattern
1513 @example
1514 char *
1515 re_compile_pattern (const char *@var{regex}, const int @var{regex_size},
1516                     struct re_pattern_buffer *@var{pattern_buffer})
1517 @end example
1518
1519 @noindent
1520 @var{regex} is the regular expression's address, @var{regex_size} is its
1521 length, and @var{pattern_buffer} is the pattern buffer's address.
1522
1523 If @code{re_compile_pattern} successfully compiles the regular
1524 expression, it returns zero and sets @code{*@var{pattern_buffer}} to the
1525 compiled pattern.  It sets the pattern buffer's fields as follows:
1526
1527 @table @code
1528 @item buffer
1529 @vindex buffer @r{field, set by @code{re_compile_pattern}}
1530 to the compiled pattern.
1531
1532 @item syntax
1533 @vindex syntax @r{field, set by @code{re_compile_pattern}}
1534 to the current value of @code{re_syntax_options}.
1535
1536 @item re_nsub
1537 @vindex re_nsub @r{field, set by @code{re_compile_pattern}}
1538 to the number of subexpressions in @var{regex}.
1539
1540 @end table
1541
1542 If @code{re_compile_pattern} can't compile @var{regex}, it returns an
1543 error string corresponding to a POSIX error code.
1544
1545
1546 @node GNU Matching
1547 @subsection GNU Matching
1548
1549 @cindex matching with GNU functions
1550
1551 Matching the GNU way means trying to match as much of a string as
1552 possible starting at a position within it you specify.  Once you've compiled
1553 a pattern into a pattern buffer (@pxref{GNU Regular Expression
1554 Compiling}), you can ask the matcher to match that pattern against a
1555 string using:
1556
1557 @findex re_match
1558 @example
1559 int
1560 re_match (struct re_pattern_buffer *@var{pattern_buffer},
1561           const char *@var{string}, const int @var{size},
1562           const int @var{start}, struct re_registers *@var{regs})
1563 @end example
1564
1565 @noindent
1566 @var{pattern_buffer} is the address of a pattern buffer containing a
1567 compiled pattern.  @var{string} is the string you want to match; it can
1568 contain newline and null characters.  @var{size} is the length of that
1569 string.  @var{start} is the string index at which you want to
1570 begin matching; the first character of @var{string} is at index zero.
1571 @xref{Using Registers}, for an explanation of @var{regs}; you can safely
1572 pass zero.
1573
1574 @code{re_match} matches the regular expression in @var{pattern_buffer}
1575 against the string @var{string} according to the syntax of
1576 @var{pattern_buffer}.  (@xref{GNU Regular Expression Compiling}, for how
1577 to set it.)  The function returns @math{-1} if the compiled pattern does
1578 not match any part of @var{string} and @math{-2} if an internal error
1579 happens; otherwise, it returns how many (possibly zero) characters of
1580 @var{string} the pattern matched.
1581
1582 An example: suppose @var{pattern_buffer} points to a pattern buffer
1583 containing the compiled pattern for @samp{a*}, and @var{string} points
1584 to @samp{aaaaab} (whereupon @var{size} should be 6). Then if @var{start}
1585 is 2, @code{re_match} returns 3, i.e., @samp{a*} would have matched the
1586 last three @samp{a}s in @var{string}.  If @var{start} is 0,
1587 @code{re_match} returns 5, i.e., @samp{a*} would have matched all the
1588 @samp{a}s in @var{string}.  If @var{start} is either 5 or 6, it returns
1589 zero.
1590
1591 If @var{start} is not between zero and @var{size}, then
1592 @code{re_match} returns @math{-1}.
1593
1594
1595 @node GNU Searching
1596 @subsection GNU Searching
1597
1598 @cindex searching with GNU functions
1599
1600 @dfn{Searching} means trying to match starting at successive positions
1601 within a string.  The function @code{re_search} does this.
1602
1603 Before calling @code{re_search}, you must compile your regular
1604 expression.  @xref{GNU Regular Expression Compiling}.
1605
1606 Here is the function declaration:
1607
1608 @findex re_search
1609 @example
1610 int
1611 re_search (struct re_pattern_buffer *@var{pattern_buffer},
1612            const char *@var{string}, const int @var{size},
1613            const int @var{start}, const int @var{range},
1614            struct re_registers *@var{regs})
1615 @end example
1616
1617 @noindent
1618 @vindex start @r{argument to @code{re_search}}
1619 @vindex range @r{argument to @code{re_search}}
1620 whose arguments are the same as those to @code{re_match} (@pxref{GNU
1621 Matching}) except that the two arguments @var{start} and @var{range}
1622 replace @code{re_match}'s argument @var{start}.
1623
1624 If @var{range} is positive, then @code{re_search} attempts a match
1625 starting first at index @var{start}, then at @math{@var{start} + 1} if
1626 that fails, and so on, up to @math{@var{start} + @var{range}}; if
1627 @var{range} is negative, then it attempts a match starting first at
1628 index @var{start}, then at @math{@var{start} -1} if that fails, and so
1629 on.
1630
1631 If @var{start} is not between zero and @var{size}, then @code{re_search}
1632 returns @math{-1}.  When @var{range} is positive, @code{re_search}
1633 adjusts @var{range} so that @math{@var{start} + @var{range} - 1} is
1634 between zero and @var{size}, if necessary; that way it won't search
1635 outside of @var{string}.  Similarly, when @var{range} is negative,
1636 @code{re_search} adjusts @var{range} so that @math{@var{start} +
1637 @var{range} + 1} is between zero and @var{size}, if necessary.
1638
1639 If the @code{fastmap} field of @var{pattern_buffer} is zero,
1640 @code{re_search} matches starting at consecutive positions; otherwise,
1641 it uses @code{fastmap} to make the search more efficient.
1642 @xref{Searching with Fastmaps}.
1643
1644 If no match is found, @code{re_search} returns @math{-1}.  If
1645 a match is found, it returns the index where the match began.  If an
1646 internal error happens, it returns @math{-2}.
1647
1648
1649 @node Matching/Searching with Split Data
1650 @subsection Matching and Searching with Split Data
1651
1652 Using the functions @code{re_match_2} and @code{re_search_2}, you can
1653 match or search in data that is divided into two strings.
1654
1655 The function:
1656
1657 @findex re_match_2
1658 @example
1659 int
1660 re_match_2 (struct re_pattern_buffer *@var{buffer},
1661             const char *@var{string1}, const int @var{size1},
1662             const char *@var{string2}, const int @var{size2},
1663             const int @var{start},
1664             struct re_registers *@var{regs},
1665             const int @var{stop})
1666 @end example
1667
1668 @noindent
1669 is similar to @code{re_match} (@pxref{GNU Matching}) except that you
1670 pass @emph{two} data strings and sizes, and an index @var{stop} beyond
1671 which you don't want the matcher to try matching.  As with
1672 @code{re_match}, if it succeeds, @code{re_match_2} returns how many
1673 characters of @var{string} it matched.  Regard @var{string1} and
1674 @var{string2} as concatenated when you set the arguments @var{start} and
1675 @var{stop} and use the contents of @var{regs}; @code{re_match_2} never
1676 returns a value larger than @math{@var{size1} + @var{size2}}.
1677
1678 The function:
1679
1680 @findex re_search_2
1681 @example
1682 int
1683 re_search_2 (struct re_pattern_buffer *@var{buffer},
1684              const char *@var{string1}, const int @var{size1},
1685              const char *@var{string2}, const int @var{size2},
1686              const int @var{start}, const int @var{range},
1687              struct re_registers *@var{regs},
1688              const int @var{stop})
1689 @end example
1690
1691 @noindent
1692 is similarly related to @code{re_search}.
1693
1694
1695 @node Searching with Fastmaps
1696 @subsection Searching with Fastmaps
1697
1698 @cindex fastmaps
1699 If you're searching through a long string, you should use a fastmap.
1700 Without one, the searcher tries to match at consecutive positions in the
1701 string.  Generally, most of the characters in the string could not start
1702 a match.  It takes much longer to try matching at a given position in the
1703 string than it does to check in a table whether or not the character at
1704 that position could start a match.  A @dfn{fastmap} is such a table.
1705
1706 More specifically, a fastmap is an array indexed by the characters in
1707 your character set.  Under the ASCII encoding, therefore, a fastmap
1708 has 256 elements.  If you want the searcher to use a fastmap with a
1709 given pattern buffer, you must allocate the array and assign the array's
1710 address to the pattern buffer's @code{fastmap} field.  You either can
1711 compile the fastmap yourself or have @code{re_search} do it for you;
1712 when @code{fastmap} is nonzero, it automatically compiles a fastmap the
1713 first time you search using a particular compiled pattern.
1714
1715 By setting the buffer's @code{fastmap} field before calling
1716 @code{re_compile_pattern}, you can reuse a buffer data structure across
1717 multiple searches with different patterns, and allocate the fastmap only
1718 once.  Nonetheless, the fastmap must be recompiled each time the buffer
1719 has a new pattern compiled into it.
1720
1721 To compile a fastmap yourself, use:
1722
1723 @findex re_compile_fastmap
1724 @example
1725 int
1726 re_compile_fastmap (struct re_pattern_buffer *@var{pattern_buffer})
1727 @end example
1728
1729 @noindent
1730 @var{pattern_buffer} is the address of a pattern buffer.  If the
1731 character @var{c} could start a match for the pattern,
1732 @code{re_compile_fastmap} makes
1733 @code{@var{pattern_buffer}->fastmap[@var{c}]} nonzero.  It returns
1734 @math{0} if it can compile a fastmap and @math{-2} if there is an
1735 internal error.  For example, if @samp{|} is the alternation operator
1736 and @var{pattern_buffer} holds the compiled pattern for @samp{a|b}, then
1737 @code{re_compile_fastmap} sets @code{fastmap['a']} and
1738 @code{fastmap['b']} (and no others).
1739
1740 @code{re_search} uses a fastmap as it moves along in the string: it
1741 checks the string's characters until it finds one that's in the fastmap.
1742 Then it tries matching at that character.  If the match fails, it
1743 repeats the process.  So, by using a fastmap, @code{re_search} doesn't
1744 waste time trying to match at positions in the string that couldn't
1745 start a match.
1746
1747 If you don't want @code{re_search} to use a fastmap,
1748 store zero in the @code{fastmap} field of the pattern buffer before
1749 calling @code{re_search}.
1750
1751 Once you've initialized a pattern buffer's @code{fastmap} field, you
1752 need never do so again---even if you compile a new pattern in
1753 it---provided the way the field is set still reflects whether or not you
1754 want a fastmap.  @code{re_search} will still either do nothing if
1755 @code{fastmap} is null or, if it isn't, compile a new fastmap for the
1756 new pattern.
1757
1758 @node GNU Translate Tables
1759 @subsection GNU Translate Tables
1760
1761 If you set the @code{translate} field of a pattern buffer to a translate
1762 table, then the GNU Regex functions to which you've passed that
1763 pattern buffer use it to apply a simple transformation
1764 to all the regular expression and string characters at which they look.
1765
1766 A @dfn{translate table} is an array indexed by the characters in your
1767 character set.  Under the ASCII encoding, therefore, a translate
1768 table has 256 elements.  The array's elements are also characters in
1769 your character set.  When the Regex functions see a character @var{c},
1770 they use @code{translate[@var{c}]} in its place, with one exception: the
1771 character after a @samp{\} is not translated.  (This ensures that, the
1772 operators, e.g., @samp{\B} and @samp{\b}, are always distinguishable.)
1773
1774 For example, a table that maps all lowercase letters to the
1775 corresponding uppercase ones would cause the matcher to ignore
1776 differences in case.@footnote{A table that maps all uppercase letters to
1777 the corresponding lowercase ones would work just as well for this
1778 purpose.}  Such a table would map all characters except lowercase letters
1779 to themselves, and lowercase letters to the corresponding uppercase
1780 ones.  Under the ASCII encoding, here's how you could initialize
1781 such a table (we'll call it @code{case_fold}):
1782
1783 @example
1784 for (i = 0; i < 256; i++)
1785   case_fold[i] = i;
1786 for (i = 'a'; i <= 'z'; i++)
1787   case_fold[i] = i - ('a' - 'A');
1788 @end example
1789
1790 You tell Regex to use a translate table on a given pattern buffer by
1791 assigning that table's address to the @code{translate} field of that
1792 buffer.  If you don't want Regex to do any translation, put zero into
1793 this field.  You'll get weird results if you change the table's contents
1794 anytime between compiling the pattern buffer, compiling its fastmap, and
1795 matching or searching with the pattern buffer.
1796
1797 @node Using Registers
1798 @subsection Using Registers
1799
1800 A group in a regular expression can match a (possibly empty) substring
1801 of the string that regular expression as a whole matched.  The matcher
1802 remembers the beginning and end of the substring matched by
1803 each group.
1804
1805 To find out what they matched, pass a nonzero @var{regs} argument to a
1806 GNU matching or searching function (@pxref{GNU Matching} and
1807 @ref{GNU Searching}), i.e., the address of a structure of this type, as
1808 defined in @file{regex.h}:
1809
1810 @c We don't bother to include this directly from regex.h,
1811 @c since it changes so rarely.
1812 @example
1813 @tindex re_registers
1814 @vindex num_regs @r{in @code{struct re_registers}}
1815 @vindex start @r{in @code{struct re_registers}}
1816 @vindex end @r{in @code{struct re_registers}}
1817 struct re_registers
1818 @{
1819   unsigned num_regs;
1820   regoff_t *start;
1821   regoff_t *end;
1822 @};
1823 @end example
1824
1825 Except for (possibly) the @var{num_regs}'th element (see below), the
1826 @var{i}th element of the @code{start} and @code{end} arrays records
1827 information about the @var{i}th group in the pattern.  (They're declared
1828 as C pointers, but this is only because not all C compilers accept
1829 zero-length arrays; conceptually, it is simplest to think of them as
1830 arrays.)
1831
1832 The @code{start} and @code{end} arrays are allocated in one of two ways.
1833 The simplest and perhaps most useful is to let the matcher (re)allocate
1834 enough space to record information for all the groups in the regular
1835 expression.  If @code{re_set_registers} is not called before searching
1836 or matching, then the matcher allocates two arrays each of @math{1 +
1837 @var{re_nsub}} elements (@var{re_nsub} is another field in the pattern
1838 buffer; @pxref{GNU Pattern Buffers}).  The extra element is set to
1839 @math{-1}.  Then on subsequent calls with the same pattern buffer and
1840 @var{regs} arguments, the matcher reallocates more space if necessary.
1841
1842 The function:
1843
1844 @findex re_set_registers
1845 @example
1846 void
1847 re_set_registers (struct re_pattern_buffer *@var{buffer},
1848                   struct re_registers *@var{regs},
1849                   size_t @var{num_regs},
1850                   regoff_t *@var{starts}, regoff_t *@var{ends})
1851 @end example
1852
1853 @noindent sets @var{regs} to hold @var{num_regs} registers, storing
1854 them in @var{starts} and @var{ends}.  Subsequent matches using
1855 @var{buffer} and @var{regs} will use this memory for recording
1856 register information.  @var{starts} and @var{ends} must be allocated
1857 with malloc, and must each be at least @math{@var{num_regs} *
1858 @code{sizeof (regoff_t)}} bytes long.
1859
1860 If @var{num_regs} is zero, then subsequent matches should allocate
1861 their own register data.
1862
1863 Unless this function is called, the first search or match using
1864 @var{buffer} will allocate its own register data, without freeing the
1865 old data.
1866
1867 The following examples illustrate the information recorded in the
1868 @code{re_registers} structure.  (In all of them, @samp{(} represents the
1869 open-group and @samp{)} the close-group operator.  The first character
1870 in the string @var{string} is at index 0.)
1871
1872 @itemize @bullet
1873
1874 @item
1875 If the regular expression has an @w{@var{i}-th}
1876 group that matches a
1877 substring of @var{string}, then the function sets
1878 @code{@w{@var{regs}->}start[@var{i}]} to the index in @var{string} where
1879 the substring matched by the @w{@var{i}-th} group begins, and
1880 @code{@w{@var{regs}->}end[@var{i}]} to the index just beyond that
1881 substring's end.  The function sets @code{@w{@var{regs}->}start[0]} and
1882 @code{@w{@var{regs}->}end[0]} to analogous information about the entire
1883 pattern.
1884
1885 For example, when you match @samp{((a)(b))} against @samp{ab}, you get:
1886
1887 @itemize
1888 @item
1889 0 in @code{@w{@var{regs}->}start[0]} and 2 in @code{@w{@var{regs}->}end[0]}
1890
1891 @item
1892 0 in @code{@w{@var{regs}->}start[1]} and 2 in @code{@w{@var{regs}->}end[1]}
1893
1894 @item
1895 0 in @code{@w{@var{regs}->}start[2]} and 1 in @code{@w{@var{regs}->}end[2]}
1896
1897 @item
1898 1 in @code{@w{@var{regs}->}start[3]} and 2 in @code{@w{@var{regs}->}end[3]}
1899 @end itemize
1900
1901 @item
1902 If a group matches more than once (as it might if followed by,
1903 e.g., a repetition operator), then the function reports the information
1904 about what the group @emph{last} matched.
1905
1906 For example, when you match the pattern @samp{(a)*} against the string
1907 @samp{aa}, you get:
1908
1909 @itemize
1910 @item
1911 0 in @code{@w{@var{regs}->}start[0]} and 2 in @code{@w{@var{regs}->}end[0]}
1912
1913 @item
1914 1 in @code{@w{@var{regs}->}start[1]} and 2 in @code{@w{@var{regs}->}end[1]}
1915 @end itemize
1916
1917 @item
1918 If the @w{@var{i}-th} group does not participate in a
1919 successful match, e.g., it is an alternative not taken or a
1920 repetition operator allows zero repetitions of it, then the function
1921 sets @code{@w{@var{regs}->}start[@var{i}]} and
1922 @code{@w{@var{regs}->}end[@var{i}]} to @math{-1}.
1923
1924 For example, when you match the pattern @samp{(a)*b} against
1925 the string @samp{b}, you get:
1926
1927 @itemize
1928 @item
1929 0 in @code{@w{@var{regs}->}start[0]} and 1 in @code{@w{@var{regs}->}end[0]}
1930
1931 @item
1932 @math{-1} in @code{@w{@var{regs}->}start[1]} and @math{-1} in @code{@w{@var{regs}->}end[1]}
1933 @end itemize
1934
1935 @item
1936 If the @w{@var{i}-th} group matches a zero-length string, then the
1937 function sets @code{@w{@var{regs}->}start[@var{i}]} and
1938 @code{@w{@var{regs}->}end[@var{i}]} to the index just beyond that
1939 zero-length string.
1940
1941 For example, when you match the pattern @samp{(a*)b} against the string
1942 @samp{b}, you get:
1943
1944 @itemize
1945 @item
1946 0 in @code{@w{@var{regs}->}start[0]} and 1 in @code{@w{@var{regs}->}end[0]}
1947
1948 @item
1949 0 in @code{@w{@var{regs}->}start[1]} and 0 in @code{@w{@var{regs}->}end[1]}
1950 @end itemize
1951
1952 @item
1953 If an @w{@var{i}-th} group contains a @w{@var{j}-th} group
1954 in turn not contained within any other group within group @var{i} and
1955 the function reports a match of the @w{@var{i}-th} group, then it
1956 records in @code{@w{@var{regs}->}start[@var{j}]} and
1957 @code{@w{@var{regs}->}end[@var{j}]} the last match (if it matched) of
1958 the @w{@var{j}-th} group.
1959
1960 For example, when you match the pattern @samp{((a*)b)*} against the
1961 string @samp{abb}, @w{group 2} last matches the empty string, so you
1962 get what it previously matched:
1963
1964 @itemize
1965 @item
1966 0 in @code{@w{@var{regs}->}start[0]} and 3 in @code{@w{@var{regs}->}end[0]}
1967
1968 @item
1969 2 in @code{@w{@var{regs}->}start[1]} and 3 in @code{@w{@var{regs}->}end[1]}
1970
1971 @item
1972 2 in @code{@w{@var{regs}->}start[2]} and 2 in @code{@w{@var{regs}->}end[2]}
1973 @end itemize
1974
1975 When you match the pattern @samp{((a)*b)*} against the string
1976 @samp{abb}, @w{group 2} doesn't participate in the last match, so you
1977 get:
1978
1979 @itemize
1980 @item
1981 0 in @code{@w{@var{regs}->}start[0]} and 3 in @code{@w{@var{regs}->}end[0]}
1982
1983 @item
1984 2 in @code{@w{@var{regs}->}start[1]} and 3 in @code{@w{@var{regs}->}end[1]}
1985
1986 @item
1987 0 in @code{@w{@var{regs}->}start[2]} and 1 in @code{@w{@var{regs}->}end[2]}
1988 @end itemize
1989
1990 @item
1991 If an @w{@var{i}-th} group contains a @w{@var{j}-th} group
1992 in turn not contained within any other group within group @var{i}
1993 and the function sets
1994 @code{@w{@var{regs}->}start[@var{i}]} and
1995 @code{@w{@var{regs}->}end[@var{i}]} to @math{-1}, then it also sets
1996 @code{@w{@var{regs}->}start[@var{j}]} and
1997 @code{@w{@var{regs}->}end[@var{j}]} to @math{-1}.
1998
1999 For example, when you match the pattern @samp{((a)*b)*c} against the
2000 string @samp{c}, you get:
2001
2002 @itemize
2003 @item
2004 0 in @code{@w{@var{regs}->}start[0]} and 1 in @code{@w{@var{regs}->}end[0]}
2005
2006 @item
2007 @math{-1} in @code{@w{@var{regs}->}start[1]} and @math{-1} in @code{@w{@var{regs}->}end[1]}
2008
2009 @item
2010 @math{-1} in @code{@w{@var{regs}->}start[2]} and @math{-1} in @code{@w{@var{regs}->}end[2]}
2011 @end itemize
2012
2013 @end itemize
2014
2015 @node Freeing GNU Pattern Buffers
2016 @subsection Freeing GNU Pattern Buffers
2017
2018 To free any allocated fields of a pattern buffer, use the POSIX
2019 function @code{regfree}:
2020
2021 @findex regfree
2022 @example
2023 void
2024 regfree (regex_t *@var{preg})
2025 @end example
2026
2027 @noindent
2028 @var{preg} is the pattern buffer whose allocated fields you want freed;
2029 this works because since the type @code{regex_t}---the type for
2030 POSIX pattern buffers---is equivalent to the type
2031 @code{re_pattern_buffer}.
2032
2033 @code{regfree} also sets @var{preg}'s @code{allocated} field to zero.
2034 After a buffer has been freed, it must have a regular expression
2035 compiled in it before passing it to a matching or searching function.
2036
2037
2038 @node BSD Regex Functions
2039 @section BSD Regex Functions
2040
2041 If you're writing code that has to be Berkeley Unix compatible,
2042 you'll need to use these functions whose interfaces are the same as those
2043 in Berkeley Unix.
2044
2045 @menu
2046 * BSD Regular Expression Compiling::    re_comp ()
2047 * BSD Searching::                       re_exec ()
2048 @end menu
2049
2050 @node BSD Regular Expression Compiling
2051 @subsection  BSD Regular Expression Compiling
2052
2053 With Berkeley Unix, you can only search for a given regular
2054 expression; you can't match one.  To search for it, you must first
2055 compile it.  Before you compile it, you must indicate the regular
2056 expression syntax you want it compiled according to by setting the
2057 variable @code{re_syntax_options} (declared in @file{regex.h}) to some
2058 syntax (@pxref{Regular Expression Syntax}).
2059
2060 To compile a regular expression use:
2061
2062 @findex re_comp
2063 @example
2064 char *
2065 re_comp (char *@var{regex})
2066 @end example
2067
2068 @noindent
2069 @var{regex} is the address of a null-terminated regular expression.
2070 @code{re_comp} uses an internal pattern buffer, so you can use only the
2071 most recently compiled pattern buffer.  This means that if you want to
2072 use a given regular expression that you've already compiled---but it
2073 isn't the latest one you've compiled---you'll have to recompile it.  If
2074 you call @code{re_comp} with the null string (@emph{not} the empty
2075 string) as the argument, it doesn't change the contents of the pattern
2076 buffer.
2077
2078 If @code{re_comp} successfully compiles the regular expression, it
2079 returns zero.  If it can't compile the regular expression, it returns
2080 an error string.  @code{re_comp}'s error messages are identical to those
2081 of @code{re_compile_pattern} (@pxref{GNU Regular Expression
2082 Compiling}).
2083
2084 @node BSD Searching
2085 @subsection BSD Searching
2086
2087 Searching the Berkeley Unix way means searching in a string
2088 starting at its first character and trying successive positions within
2089 it to find a match.  Once you've compiled a pattern using @code{re_comp}
2090 (@pxref{BSD Regular Expression Compiling}), you can ask Regex
2091 to search for that pattern in a string using:
2092
2093 @findex re_exec
2094 @example
2095 int
2096 re_exec (char *@var{string})
2097 @end example
2098
2099 @noindent
2100 @var{string} is the address of the null-terminated string in which you
2101 want to search.
2102
2103 @code{re_exec} returns either 1 for success or 0 for failure.  It
2104 automatically uses a GNU fastmap (@pxref{Searching with Fastmaps}).