sample/java.util.regex.Pattern.txt

   1 *java.util.regex.Pattern* *Pattern* A compiled representation of a regular expre
   2
   3 public final class Pattern
   4   extends    |java.lang.Object|
   5   implements |java.io.Serializable|
   6
   7 |java.util.regex.Pattern_Description|
   8 |java.util.regex.Pattern_Fields|
   9 |java.util.regex.Pattern_Constructors|
  10 |java.util.regex.Pattern_Methods|
  11
  12 ================================================================================
  13
  14 *java.util.regex.Pattern_Fields*
  15 |int_java.util.regex.Pattern.CANON_EQ|
  16 |int_java.util.regex.Pattern.CASE_INSENSITIVE|
  17 |int_java.util.regex.Pattern.COMMENTS|
  18 |int_java.util.regex.Pattern.DOTALL|
  19 |int_java.util.regex.Pattern.LITERAL|
  20 |int_java.util.regex.Pattern.MULTILINE|
  21 |int_java.util.regex.Pattern.UNICODE_CASE|
  22 |int_java.util.regex.Pattern.UNIX_LINES|
  23
  24 *java.util.regex.Pattern_Methods*
  25 |java.util.regex.Pattern.compile(String)|Compiles the given regular expression
  26 |java.util.regex.Pattern.compile(String,int)|Compiles the given regular express
  27 |java.util.regex.Pattern.flags()|Returns this pattern's match flags.
  28 |java.util.regex.Pattern.matcher(CharSequence)|Creates a matcher that will matc
  29 |java.util.regex.Pattern.matches(String,CharSequence)|Compiles the given regula
  30 |java.util.regex.Pattern.pattern()|Returns the regular expression from which th
  31 |java.util.regex.Pattern.quote(String)|Returns a literal pattern String for the
  32 |java.util.regex.Pattern.split(CharSequence)|Splits the given input sequence ar
  33 |java.util.regex.Pattern.split(CharSequence,int)|Splits the given input sequenc
  34 |java.util.regex.Pattern.toString()|Returns the string representation of this p
  35
  36 *java.util.regex.Pattern_Description*
  37
  38 A compiled representation of a regular expression.
  39
  40 A regular expression, specified as a string, must first be compiled into an
  41 instance of this class. The resulting pattern can then be used to create a
  42 (|java.util.regex.Matcher|) object that can match arbitrary </code>character
  43 sequences<code>(|java.lang.CharSequence|) against the regular expression. All
  44 of the state involved in performing a match resides in the matcher, so many
  45 matchers can share the same pattern.
  46
  47 A typical invocation sequence is thus
  48
  49
  50
  51 Pattern p = Pattern. compile(|java.util.regex.Pattern|) ("a*b"); Matcher m = p.
  52 matcher(|java.util.regex.Pattern|) ("aaaaab"); boolean b = m.
  53 matches(|java.util.regex.Matcher|) ();
  54
  55 A matches(|java.util.regex.Pattern|) method is defined by this class as a
  56 convenience for when a regular expression is used just once. This method
  57 compiles an expression and matches an input sequence against it in a single
  58 invocation. The statement
  59
  60
  61
  62 boolean b = Pattern.matches("a*b", "aaaaab");
  63
  64 is equivalent to the three statements above, though for repeated matches it is
  65 less efficient since it does not allow the compiled pattern to be reused.
  66
  67 Instances of this class are immutable and are safe for use by multiple
  68 concurrent threads. Instances of the (|java.util.regex.Matcher|) class are not
  69 safe for such use.
  70
  71 Summary of regular-expression constructs
  72
  73
  74
  75 Construct Matches
  76
  77 Characters
  78
  79 x The character x \\ The backslash character \0n The character with octal value
  80 0n (0<=n<=7) \0nn The character with octal value 0nn (0<=n<=7) \0mnn The
  81 character with octal value 0mnn (0<=m<=3, 0<=n<=7) \xhh The character with
  82 hexadecimalvalue0xhh uhhhh The character with hexadecimalvalue0xhhhh \t The tab
  83 character ('u0009') \n The newline (line feed) character ('u000A') \r The
  84 carriage-return character ('u000D') \f The form-feed character ('u000C') \a The
  85 alert (bell) character ('u0007') \e The escape character ('u001B') \cx The
  86 control character corresponding to x
  87
  88 Character classes
  89
  90 [abc] a, b, or c (simple class) [^abc] Any character except a, b, or c
  91 (negation) [a-zA-Z] a through z or A through Z, inclusive (range) [a-d[m-p]] a
  92 through d, or m through p: [a-dm-p] (union) [a-z d, e, or f (intersection) [a-z
  93 a through z, except for b and c: [ad-z] (subtraction) [a-z a through z, and not
  94 m through p: [a-lq-z](subtraction)
  95
  96 Predefined character classes
  97
  98 . Any character (may or may not match line terminators) \d A digit: [0-9] \D A
  99 non-digit: [^0-9] \s A whitespace character: [ \t\n\x0B\f\r] \S A
 100 non-whitespace character: [^\s] \w A word character: [a-zA-Z_0-9] \W A non-word
 101 character: [^\w]
 102
 103 POSIX character classes (US-ASCII only)
 104
 105 \p{Lower} A lower-case alphabetic character: [a-z] \p{Upper} An upper-case
 106 alphabetic character:[A-Z] \p{ASCII} All ASCII:[\x00-\x7F] \p{Alpha} An
 107 alphabetic character:[\p{Lower}\p{Upper}] \p{Digit} A decimal digit: [0-9]
 108 \p{Alnum} An alphanumeric character:[\p{Alpha}\p{Digit}] \p{Punct} Punctuation:
 109 One of !"#$%?@[\]^_`{|}~ [\!"#\$%\\?@\[\\\]\^_`\{\|\}~]
 110 [\X21-\X2F\X31-\X40\X5B-\X60\X7B-\X7E] --> \p{Graph} A visible character:
 111 [\p{Alnum}\p{Punct}] \p{Print} A printable character: [\p{Graph}\x20] \p{Blank}
 112 A space or a tab: [ \t] \p{Cntrl} A control character: [\x00-\x1F\x7F]
 113 \p{XDigit} A hexadecimal digit: [0-9a-fA-F] \p{Space} A whitespace character: [
 114 \t\n\x0B\f\r]
 115
 116 java.lang.Character classes (simple java character type)
 117
 118 \p{javaLowerCase} Equivalent to java.lang.Character.isLowerCase()
 119 \p{javaUpperCase} Equivalent to java.lang.Character.isUpperCase()
 120 \p{javaWhitespace} Equivalent to java.lang.Character.isWhitespace()
 121 \p{javaMirrored} Equivalent to java.lang.Character.isMirrored()
 122
 123 Classes for Unicode blocks and categories
 124
 125 \p{InGreek} A character in the Greekblock (simple block) \p{Lu} An uppercase
 126 letter (simple category) \p{Sc} A currency symbol \P{InGreek} Any character
 127 except one in the Greek block (negation) [\p{L} Any letter except an uppercase
 128 letter (subtraction)
 129
 130 Boundary matchers
 131
 132 ^ The beginning of a line $ The end of a line \b A word boundary \B A non-word
 133 boundary \A The beginning of the input \G The end of the previous match \Z The
 134 end of the input but for the final terminator, ifany \z The end of the input
 135
 136 Greedy quantifiers
 137
 138 X? X, once or not at all X* X, zero or more times X+ X, one or more times X{n}
 139 X, exactly n times X{n,} X, at least n times X{n,m} X, at least n but not more
 140 than m times
 141
 142 Reluctant quantifiers
 143
 144 X?? X, once or not at all X*? X, zero or more times X+? X, one or more times
 145 X{n}? X, exactly n times X{n,}? X, at least n times X{n,m}? X, at least n but
 146 not more than m times
 147
 148 Possessive quantifiers
 149
 150 X?+ X, once or not at all X*+ X, zero or more times X++ X, one or more times
 151 X{n}+ X, exactly n times X{n,}+ X, at least n times X{n,m}+ X, at least n but
 152 not more than m times
 153
 154 Logical operators
 155
 156 XY X followed by Y X|Y Either X or Y (X) X, as a capturing group
 157
 158 Back references
 159
 160 \n Whatever the nth capturing group matched
 161
 162 Quotation
 163
 164 \ Nothing, but quotes the following character \Q Nothing, but quotes all
 165 characters until \E \E Nothing, but ends quoting started by \Q ?[\]^{|} -->
 166
 167 Special constructs (non-capturing)
 168
 169 (?:X) X, as a non-capturing group (?idmsux-idmsux) Nothing, but turns match
 170 flags on - off (?idmsux-idmsux:X) X, as a non-capturing group with the given
 171 flags on - off (?=X) X, via zero-width positive lookahead (?!X) X, via
 172 zero-width negative lookahead (?<=X) X, via zero-width positive lookbehind
 173 (?<!X) X, via zero-width negative lookbehind (?>X) X, as an independent,
 174 non-capturing group
 175
 176
 177
 178
 179
 180 Backslashes, escapes, and quoting
 181
 182 The backslash character ('\') serves to introduce escaped constructs, as
 183 defined in the table above, as well as to quote characters that otherwise would
 184 be interpreted as unescaped constructs. Thus the expression \\ matches a single
 185 backslash and \{ matches a left brace.
 186
 187 It is an error to use a backslash prior to any alphabetic character that does
 188 not denote an escaped construct; these are reserved for future extensions to
 189 the regular-expression language. A backslash may be used prior to a
 190 non-alphabetic character regardless of whether that character is part of an
 191 unescaped construct.
 192
 193 Backslashes within string literals in Java source code are interpreted as
 194 required by the Java Language Specification as either Unicode escapes or other
 195 character escapes. It is therefore necessary to double backslashes in string
 196 literals that represent regular expressions to protect them from interpretation
 197 by the Java bytecode compiler. The string literal "b", for example, matches a
 198 single backspace character when interpreted as a regular expression, while "b"
 199 matches a word boundary. The string literal "(hello)" is illegal and leads to a
 200 compile-time error; in order to match the string (hello) the string literal
 201 "(hello)" must be used.
 202
 203 Character Classes
 204
 205 Character classes may appear within other character classes, and may be
 206 composed by the union operator (implicit) and the intersection operator ( and
 207 and ). The union operator denotes a class that contains every character that is
 208 in at least one of its operand classes. The intersection operator denotes a
 209 class that contains every character that is in both of its operand classes.
 210
 211 The precedence of character-class operators is as follows, from highest to
 212 lowest:
 213
 214 1 Literal escape \x 2 Grouping [...] 3 Range a-z 4 Union [a-e][i-u] 5
 215 Intersection [a-z
 216
 217 Note that a different set of metacharacters are in effect inside a character
 218 class than outside a character class. For instance, the regular expression .
 219 loses its special meaning inside a character class, while the expression -
 220 becomes a range forming metacharacter.
 221
 222 Line terminators
 223
 224 A line terminator is a one- or two-character sequence that marks the end of a
 225 line of the input character sequence. The following are recognized as line
 226 terminators:
 227
 228
 229
 230 A newline (line feed) character('\n'),
 231
 232 A carriage-return character followed immediately by a newline
 233 character("\r\n"),
 234
 235 A standalone carriage-return character('\r'),
 236
 237 A next-line character('u0085'),
 238
 239 A line-separator character('u2028'), or
 240
 241 A paragraph-separator character('u2029).
 242
 243 If (|java.util.regex.Pattern|) mode is activated, then the only line
 244 terminators recognized are newline characters.
 245
 246 The regular expression . matches any character except a line terminator unless
 247 the (|java.util.regex.Pattern|) flag is specified.
 248
 249 By default, the regular expressions ^ and $ ignore line terminators and only
 250 match at the beginning and the end, respectively, of the entire input sequence.
 251 If (|java.util.regex.Pattern|) mode is activated then ^ matches at the
 252 beginning of input and after any line terminator except at the end of input.
 253 When in (|java.util.regex.Pattern|) mode $ matches just before a line
 254 terminator or the end of the input sequence.
 255
 256 Groups and capturing
 257
 258 Capturing groups are numbered by counting their opening parentheses from left
 259 to right. In the expression ((A)(B(C))), for example, there are four such
 260 groups:
 261
 262 1 ((A)(B(C))) 2 (A) 3 (B(C)) 4 (C)
 263
 264 Group zero always stands for the entire expression.
 265
 266 Capturing groups are so named because, during a match, each subsequence of the
 267 input sequence that matches such a group is saved. The captured subsequence may
 268 be used later in the expression, via a back reference, and may also be
 269 retrieved from the matcher once the match operation is complete.
 270
 271 The captured input associated with a group is always the subsequence that the
 272 group most recently matched. If a group is evaluated a second time because of
 273 quantification then its previously-captured value, if any, will be retained if
 274 the second evaluation fails. Matching the string "aba" against the expression
 275 (a(b)?)+, for example, leaves group two set to "b". All captured input is
 276 discarded at the beginning of each match.
 277
 278 Groups beginning with (? are pure, non-capturing groups that do not capture
 279 text and do not count towards the group total.
 280
 281 Unicode support
 282
 283 This class is in conformance with Level 1 of Unicode Technical Standard #18:
 284 Unicode Regular Expression Guidelines, plus RL2.1 Canonical Equivalents.
 285
 286 Unicode escape sequences such as u2014 in Java source code are processed as
 287 described in ¤3.3 of the Java Language Specification. Such escape sequences are
 288 also implemented directly by the regular-expression parser so that Unicode
 289 escapes can be used in expressions that are read from files or from the
 290 keyboard. Thus the strings "u2014" and "\\u2014", while not equal, compile into
 291 the same pattern, which matches the character with hexadecimal value 0x2014.
 292
 293 Unicode blocks and categories are written with the \p and \P constructs as in
 294 Perl. \p{prop} matches if the input has the property prop, while \P{prop} does
 295 not match if the input has that property. Blocks are specified with the prefix
 296 In, as in InMongolian. Categories may be specified with the optional prefix Is:
 297 Both \p{L} and \p{IsL} denote the category of Unicode letters. Blocks and
 298 categories can be used both inside and outside of a character class.
 299
 300 The supported categories are those of
 301
 302 The Unicode Standard in the version specified by the
 303 Character(|java.lang.Character|) class. The category names are those defined in
 304 the Standard, both normative and informative. The block names supported by
 305 Pattern are the valid block names accepted and defined by
 306 UnicodeBlock.forName(|java.lang.Character.UnicodeBlock|) .
 307
 308 Categories that behave like the java.lang.Character boolean ismethodname
 309 methods (except for the deprecated ones) are available through the same
 310 \p{prop} syntax where the specified property has the name javamethodname.
 311
 312 Comparison to Perl 5
 313
 314 The Pattern engine performs traditional NFA-based matching with ordered
 315 alternation as occurs in Perl 5.
 316
 317 Perl constructs not supported by this class:
 318
 319
 320
 321 The conditional constructs (?{X}) and (?(condition)X|Y),
 322
 323 The embedded code constructs (?{code}) and (??{code}),
 324
 325 The embedded comment syntax (?#comment), and
 326
 327 The preprocessing operations \l u, \L, and \U.
 328
 329
 330
 331 Constructs supported by this class but not by Perl:
 332
 333
 334
 335 Possessive quantifiers, which greedily match as much as they can and do not
 336 back off, even when doing so would allow the overall match to succeed.
 337
 338 Character-class union and intersection as described above.
 339
 340
 341
 342 Notable differences from Perl:
 343
 344
 345
 346 In Perl, \1 through \9 are always interpreted as back references; a
 347 backslash-escaped number greater than 9 is treated as a back reference if at
 348 least that many subexpressions exist, otherwise it is interpreted, if possible,
 349 as an octal escape. In this class octal escapes must always begin with a zero.
 350 In this class, \1 through \9 are always interpreted as back references, and a
 351 larger number is accepted as a back reference if at least that many
 352 subexpressions exist at that point in the regular expression, otherwise the
 353 parser will drop digits until the number is smaller or equal to the existing
 354 number of groups or it is one digit.
 355
 356 Perl uses the g flag to request a match that resumes where the last match left
 357 off. This functionality is provided implicitly by the
 358 (|java.util.regex.Matcher|) class: Repeated invocations of the
 359 find(|java.util.regex.Matcher|) method will resume where the last match left
 360 off, unless the matcher is reset.
 361
 362 In Perl, embedded flags at the top level of an expression affect the whole
 363 expression. In this class, embedded flags always take effect at the point at
 364 which they appear, whether they are at the top level or within a group; in the
 365 latter case, flags are restored at the end of the group just as in Perl.
 366
 367 Perl is forgiving about malformed matching constructs, as in the expression *a,
 368 as well as dangling brackets, as in the expression abc], and treats them as
 369 literals. This class also accepts dangling brackets but is strict about
 370 dangling metacharacters like +, ? and *, and will throw a
 371 (|java.util.regex.PatternSyntaxException|) if it encounters them.
 372
 373
 374
 375 For a more precise description of the behavior of regular expression
 376 constructs, please see Mastering Regular Expressions, 2nd Edition, Jeffrey E.
 377 F. Friedl, O'Reilly and Associates, 2002.
 378
 379
 380 *int_java.util.regex.Pattern.CANON_EQ*
 381
 382 A compiled representation of a regular expression.
 383
 384 A regular expression, specified as a string, must first be compiled into an
 385 instance of this class. The resulting pattern can then be used to create a
 386 (|java.util.regex.Matcher|) object that can match arbitrary </code>character
 387 sequences<code>(|java.lang.CharSequence|) against the regular expression. All
 388 of the state involved in performing a match resides in the matcher, so many
 389 matchers can share the same pattern.
 390
 391 A typical invocation sequence is thus
 392
 393
 394
 395 Pattern p = Pattern. compile(|java.util.regex.Pattern|) ("a*b"); Matcher m = p.
 396 matcher(|java.util.regex.Pattern|) ("aaaaab"); boolean b = m.
 397 matches(|java.util.regex.Matcher|) ();
 398
 399 A matches(|java.util.regex.Pattern|) method is defined by this class as a
 400 convenience for when a regular expression is used just once. This method
 401 compiles an expression and matches an input sequence against it in a single
 402 invocation. The statement
 403
 404
 405
 406 boolean b = Pattern.matches("a*b", "aaaaab");
 407
 408 is equivalent to the three statements above, though for repeated matches it is
 409 less efficient since it does not allow the compiled pattern to be reused.
 410
 411 Instances of this class are immutable and are safe for use by multiple
 412 concurrent threads. Instances of the (|java.util.regex.Matcher|) class are not
 413 safe for such use.
 414
 415 Summary of regular-expression constructs
 416
 417
 418
 419 Construct Matches
 420
 421 Characters
 422
 423 x The character x \\ The backslash character \0n The character with octal value
 424 0n (0<=n<=7) \0nn The character with octal value 0nn (0<=n<=7) \0mnn The
 425 character with octal value 0mnn (0<=m<=3, 0<=n<=7) \xhh The character with
 426 hexadecimalvalue0xhh uhhhh The character with hexadecimalvalue0xhhhh \t The tab
 427 character ('u0009') \n The newline (line feed) character ('u000A') \r The
 428 carriage-return character ('u000D') \f The form-feed character ('u000C') \a The
 429 alert (bell) character ('u0007') \e The escape character ('u001B') \cx The
 430 control character corresponding to x
 431
 432 Character classes
 433
 434 [abc] a, b, or c (simple class) [^abc] Any character except a, b, or c
 435 (negation) [a-zA-Z] a through z or A through Z, inclusive (range) [a-d[m-p]] a
 436 through d, or m through p: [a-dm-p] (union) [a-z d, e, or f (intersection) [a-z
 437 a through z, except for b and c: [ad-z] (subtraction) [a-z a through z, and not
 438 m through p: [a-lq-z](subtraction)
 439
 440 Predefined character classes
 441
 442 . Any character (may or may not match line terminators) \d A digit: [0-9] \D A
 443 non-digit: [^0-9] \s A whitespace character: [ \t\n\x0B\f\r] \S A
 444 non-whitespace character: [^\s] \w A word character: [a-zA-Z_0-9] \W A non-word
 445 character: [^\w]
 446
 447 POSIX character classes (US-ASCII only)
 448
 449 \p{Lower} A lower-case alphabetic character: [a-z] \p{Upper} An upper-case
 450 alphabetic character:[A-Z] \p{ASCII} All ASCII:[\x00-\x7F] \p{Alpha} An
 451 alphabetic character:[\p{Lower}\p{Upper}] \p{Digit} A decimal digit: [0-9]
 452 \p{Alnum} An alphanumeric character:[\p{Alpha}\p{Digit}] \p{Punct} Punctuation:
 453 One of !"#$%?@[\]^_`{|}~ [\!"#\$%\\?@\[\\\]\^_`\{\|\}~]
 454 [\X21-\X2F\X31-\X40\X5B-\X60\X7B-\X7E] --> \p{Graph} A visible character:
 455 [\p{Alnum}\p{Punct}] \p{Print} A printable character: [\p{Graph}\x20] \p{Blank}
 456 A space or a tab: [ \t] \p{Cntrl} A control character: [\x00-\x1F\x7F]
 457 \p{XDigit} A hexadecimal digit: [0-9a-fA-F] \p{Space} A whitespace character: [
 458 \t\n\x0B\f\r]
 459
 460 java.lang.Character classes (simple java character type)
 461
 462 \p{javaLowerCase} Equivalent to java.lang.Character.isLowerCase()
 463 \p{javaUpperCase} Equivalent to java.lang.Character.isUpperCase()
 464 \p{javaWhitespace} Equivalent to java.lang.Character.isWhitespace()
 465 \p{javaMirrored} Equivalent to java.lang.Character.isMirrored()
 466
 467 Classes for Unicode blocks and categories
 468
 469 \p{InGreek} A character in the Greekblock (simple block) \p{Lu} An uppercase
 470 letter (simple category) \p{Sc} A currency symbol \P{InGreek} Any character
 471 except one in the Greek block (negation) [\p{L} Any letter except an uppercase
 472 letter (subtraction)
 473
 474 Boundary matchers
 475
 476 ^ The beginning of a line $ The end of a line \b A word boundary \B A non-word
 477 boundary \A The beginning of the input \G The end of the previous match \Z The
 478 end of the input but for the final terminator, ifany \z The end of the input
 479
 480 Greedy quantifiers
 481
 482 X? X, once or not at all X* X, zero or more times X+ X, one or more times X{n}
 483 X, exactly n times X{n,} X, at least n times X{n,m} X, at least n but not more
 484 than m times
 485
 486 Reluctant quantifiers
 487
 488 X?? X, once or not at all X*? X, zero or more times X+? X, one or more times
 489 X{n}? X, exactly n times X{n,}? X, at least n times X{n,m}? X, at least n but
 490 not more than m times
 491
 492 Possessive quantifiers
 493
 494 X?+ X, once or not at all X*+ X, zero or more times X++ X, one or more times
 495 X{n}+ X, exactly n times X{n,}+ X, at least n times X{n,m}+ X, at least n but
 496 not more than m times
 497
 498 Logical operators
 499
 500 XY X followed by Y X|Y Either X or Y (X) X, as a capturing group
 501
 502 Back references
 503
 504 \n Whatever the nth capturing group matched
 505
 506 Quotation
 507
 508 \ Nothing, but quotes the following character \Q Nothing, but quotes all
 509 characters until \E \E Nothing, but ends quoting started by \Q ?[\]^{|} -->
 510
 511 Special constructs (non-capturing)
 512
 513 (?:X) X, as a non-capturing group (?idmsux-idmsux) Nothing, but turns match
 514 flags on - off (?idmsux-idmsux:X) X, as a non-capturing group with the given
 515 flags on - off (?=X) X, via zero-width positive lookahead (?!X) X, via
 516 zero-width negative lookahead (?<=X) X, via zero-width positive lookbehind
 517 (?<!X) X, via zero-width negative lookbehind (?>X) X, as an independent,
 518 non-capturing group
 519
 520
 521
 522
 523
 524 Backslashes, escapes, and quoting
 525
 526 The backslash character ('\') serves to introduce escaped constructs, as
 527 defined in the table above, as well as to quote characters that otherwise would
 528 be interpreted as unescaped constructs. Thus the expression \\ matches a single
 529 backslash and \{ matches a left brace.
 530
 531 It is an error to use a backslash prior to any alphabetic character that does
 532 not denote an escaped construct; these are reserved for future extensions to
 533 the regular-expression language. A backslash may be used prior to a
 534 non-alphabetic character regardless of whether that character is part of an
 535 unescaped construct.
 536
 537 Backslashes within string literals in Java source code are interpreted as
 538 required by the Java Language Specification as either Unicode escapes or other
 539 character escapes. It is therefore necessary to double backslashes in string
 540 literals that represent regular expressions to protect them from interpretation
 541 by the Java bytecode compiler. The string literal "b", for example, matches a
 542 single backspace character when interpreted as a regular expression, while "b"
 543 matches a word boundary. The string literal "(hello)" is illegal and leads to a
 544 compile-time error; in order to match the string (hello) the string literal
 545 "(hello)" must be used.
 546
 547 Character Classes
 548
 549 Character classes may appear within other character classes, and may be
 550 composed by the union operator (implicit) and the intersection operator ( and
 551 and ). The union operator denotes a class that contains every character that is
 552 in at least one of its operand classes. The intersection operator denotes a
 553 class that contains every character that is in both of its operand classes.
 554
 555 The precedence of character-class operators is as follows, from highest to
 556 lowest:
 557
 558 1 Literal escape \x 2 Grouping [...] 3 Range a-z 4 Union [a-e][i-u] 5
 559 Intersection [a-z
 560
 561 Note that a different set of metacharacters are in effect inside a character
 562 class than outside a character class. For instance, the regular expression .
 563 loses its special meaning inside a character class, while the expression -
 564 becomes a range forming metacharacter.
 565
 566 Line terminators
 567
 568 A line terminator is a one- or two-character sequence that marks the end of a
 569 line of the input character sequence. The following are recognized as line
 570 terminators:
 571
 572
 573
 574 A newline (line feed) character('\n'),
 575
 576 A carriage-return character followed immediately by a newline
 577 character("\r\n"),
 578
 579 A standalone carriage-return character('\r'),
 580
 581 A next-line character('u0085'),
 582
 583 A line-separator character('u2028'), or
 584
 585 A paragraph-separator character('u2029).
 586
 587 If (|java.util.regex.Pattern|) mode is activated, then the only line
 588 terminators recognized are newline characters.
 589
 590 The regular expression . matches any character except a line terminator unless
 591 the (|java.util.regex.Pattern|) flag is specified.
 592
 593 By default, the regular expressions ^ and $ ignore line terminators and only
 594 match at the beginning and the end, respectively, of the entire input sequence.
 595 If (|java.util.regex.Pattern|) mode is activated then ^ matches at the
 596 beginning of input and after any line terminator except at the end of input.
 597 When in (|java.util.regex.Pattern|) mode $ matches just before a line
 598 terminator or the end of the input sequence.
 599
 600 Groups and capturing
 601
 602 Capturing groups are numbered by counting their opening parentheses from left
 603 to right. In the expression ((A)(B(C))), for example, there are four such
 604 groups:
 605
 606 1 ((A)(B(C))) 2 (A) 3 (B(C)) 4 (C)
 607
 608 Group zero always stands for the entire expression.
 609
 610 Capturing groups are so named because, during a match, each subsequence of the
 611 input sequence that matches such a group is saved. The captured subsequence may
 612 be used later in the expression, via a back reference, and may also be
 613 retrieved from the matcher once the match operation is complete.
 614
 615 The captured input associated with a group is always the subsequence that the
 616 group most recently matched. If a group is evaluated a second time because of
 617 quantification then its previously-captured value, if any, will be retained if
 618 the second evaluation fails. Matching the string "aba" against the expression
 619 (a(b)?)+, for example, leaves group two set to "b". All captured input is
 620 discarded at the beginning of each match.
 621
 622 Groups beginning with (? are pure, non-capturing groups that do not capture
 623 text and do not count towards the group total.
 624
 625 Unicode support
 626
 627 This class is in conformance with Level 1 of Unicode Technical Standard #18:
 628 Unicode Regular Expression Guidelines, plus RL2.1 Canonical Equivalents.
 629
 630 Unicode escape sequences such as u2014 in Java source code are processed as
 631 described in ¤3.3 of the Java Language Specification. Such escape sequences are
 632 also implemented directly by the regular-expression parser so that Unicode
 633 escapes can be used in expressions that are read from files or from the
 634 keyboard. Thus the strings "u2014" and "\\u2014", while not equal, compile into
 635 the same pattern, which matches the character with hexadecimal value 0x2014.
 636
 637 Unicode blocks and categories are written with the \p and \P constructs as in
 638 Perl. \p{prop} matches if the input has the property prop, while \P{prop} does
 639 not match if the input has that property. Blocks are specified with the prefix
 640 In, as in InMongolian. Categories may be specified with the optional prefix Is:
 641 Both \p{L} and \p{IsL} denote the category of Unicode letters. Blocks and
 642 categories can be used both inside and outside of a character class.
 643
 644 The supported categories are those of
 645
 646 The Unicode Standard in the version specified by the
 647 Character(|java.lang.Character|) class. The category names are those defined in
 648 the Standard, both normative and informative. The block names supported by
 649 Pattern are the valid block names accepted and defined by
 650 UnicodeBlock.forName(|java.lang.Character.UnicodeBlock|) .
 651
 652 Categories that behave like the java.lang.Character boolean ismethodname
 653 methods (except for the deprecated ones) are available through the same
 654 \p{prop} syntax where the specified property has the name javamethodname.
 655
 656 Comparison to Perl 5
 657
 658 The Pattern engine performs traditional NFA-based matching with ordered
 659 alternation as occurs in Perl 5.
 660
 661 Perl constructs not supported by this class:
 662
 663
 664
 665 The conditional constructs (?{X}) and (?(condition)X|Y),
 666
 667 The embedded code constructs (?{code}) and (??{code}),
 668
 669 The embedded comment syntax (?#comment), and
 670
 671 The preprocessing operations \l u, \L, and \U.
 672
 673
 674
 675 Constructs supported by this class but not by Perl:
 676
 677
 678
 679 Possessive quantifiers, which greedily match as much as they can and do not
 680 back off, even when doing so would allow the overall match to succeed.
 681
 682 Character-class union and intersection as described above.
 683
 684
 685
 686 Notable differences from Perl:
 687
 688
 689
 690 In Perl, \1 through \9 are always interpreted as back references; a
 691 backslash-escaped number greater than 9 is treated as a back reference if at
 692 least that many subexpressions exist, otherwise it is interpreted, if possible,
 693 as an octal escape. In this class octal escapes must always begin with a zero.
 694 In this class, \1 through \9 are always interpreted as back references, and a
 695 larger number is accepted as a back reference if at least that many
 696 subexpressions exist at that point in the regular expression, otherwise the
 697 parser will drop digits until the number is smaller or equal to the existing
 698 number of groups or it is one digit.
 699
 700 Perl uses the g flag to request a match that resumes where the last match left
 701 off. This functionality is provided implicitly by the
 702 (|java.util.regex.Matcher|) class: Repeated invocations of the
 703 find(|java.util.regex.Matcher|) method will resume where the last match left
 704 off, unless the matcher is reset.
 705
 706 In Perl, embedded flags at the top level of an expression affect the whole
 707 expression. In this class, embedded flags always take effect at the point at
 708 which they appear, whether they are at the top level or within a group; in the
 709 latter case, flags are restored at the end of the group just as in Perl.
 710
 711 Perl is forgiving about malformed matching constructs, as in the expression *a,
 712 as well as dangling brackets, as in the expression abc], and treats them as
 713 literals. This class also accepts dangling brackets but is strict about
 714 dangling metacharacters like +, ? and *, and will throw a
 715 (|java.util.regex.PatternSyntaxException|) if it encounters them.
 716
 717
 718
 719 For a more precise description of the behavior of regular expression
 720 constructs, please see Mastering Regular Expressions, 2nd Edition, Jeffrey E.
 721 F. Friedl, O'Reilly and Associates, 2002.
 722
 723
 724 *int_java.util.regex.Pattern.CASE_INSENSITIVE*
 725
 726 A compiled representation of a regular expression.
 727
 728 A regular expression, specified as a string, must first be compiled into an
 729 instance of this class. The resulting pattern can then be used to create a
 730 (|java.util.regex.Matcher|) object that can match arbitrary </code>character
 731 sequences<code>(|java.lang.CharSequence|) against the regular expression. All
 732 of the state involved in performing a match resides in the matcher, so many
 733 matchers can share the same pattern.
 734
 735 A typical invocation sequence is thus
 736
 737
 738
 739 Pattern p = Pattern. compile(|java.util.regex.Pattern|) ("a*b"); Matcher m = p.
 740 matcher(|java.util.regex.Pattern|) ("aaaaab"); boolean b = m.
 741 matches(|java.util.regex.Matcher|) ();
 742
 743 A matches(|java.util.regex.Pattern|) method is defined by this class as a
 744 convenience for when a regular expression is used just once. This method
 745 compiles an expression and matches an input sequence against it in a single
 746 invocation. The statement
 747
 748
 749
 750 boolean b = Pattern.matches("a*b", "aaaaab");
 751
 752 is equivalent to the three statements above, though for repeated matches it is
 753 less efficient since it does not allow the compiled pattern to be reused.
 754
 755 Instances of this class are immutable and are safe for use by multiple
 756 concurrent threads. Instances of the (|java.util.regex.Matcher|) class are not
 757 safe for such use.
 758
 759 Summary of regular-expression constructs
 760
 761
 762
 763 Construct Matches
 764
 765 Characters
 766
 767 x The character x \\ The backslash character \0n The character with octal value
 768 0n (0<=n<=7) \0nn The character with octal value 0nn (0<=n<=7) \0mnn The
 769 character with octal value 0mnn (0<=m<=3, 0<=n<=7) \xhh The character with
 770 hexadecimalvalue0xhh uhhhh The character with hexadecimalvalue0xhhhh \t The tab
 771 character ('u0009') \n The newline (line feed) character ('u000A') \r The
 772 carriage-return character ('u000D') \f The form-feed character ('u000C') \a The
 773 alert (bell) character ('u0007') \e The escape character ('u001B') \cx The
 774 control character corresponding to x
 775
 776 Character classes
 777
 778 [abc] a, b, or c (simple class) [^abc] Any character except a, b, or c
 779 (negation) [a-zA-Z] a through z or A through Z, inclusive (range) [a-d[m-p]] a
 780 through d, or m through p: [a-dm-p] (union) [a-z d, e, or f (intersection) [a-z
 781 a through z, except for b and c: [ad-z] (subtraction) [a-z a through z, and not
 782 m through p: [a-lq-z](subtraction)
 783
 784 Predefined character classes
 785
 786 . Any character (may or may not match line terminators) \d A digit: [0-9] \D A
 787 non-digit: [^0-9] \s A whitespace character: [ \t\n\x0B\f\r] \S A
 788 non-whitespace character: [^\s] \w A word character: [a-zA-Z_0-9] \W A non-word
 789 character: [^\w]
 790
 791 POSIX character classes (US-ASCII only)
 792
 793 \p{Lower} A lower-case alphabetic character: [a-z] \p{Upper} An upper-case
 794 alphabetic character:[A-Z] \p{ASCII} All ASCII:[\x00-\x7F] \p{Alpha} An
 795 alphabetic character:[\p{Lower}\p{Upper}] \p{Digit} A decimal digit: [0-9]
 796 \p{Alnum} An alphanumeric character:[\p{Alpha}\p{Digit}] \p{Punct} Punctuation:
 797 One of !"#$%?@[\]^_`{|}~ [\!"#\$%\\?@\[\\\]\^_`\{\|\}~]
 798 [\X21-\X2F\X31-\X40\X5B-\X60\X7B-\X7E] --> \p{Graph} A visible character:
 799 [\p{Alnum}\p{Punct}] \p{Print} A printable character: [\p{Graph}\x20] \p{Blank}
 800 A space or a tab: [ \t] \p{Cntrl} A control character: [\x00-\x1F\x7F]
 801 \p{XDigit} A hexadecimal digit: [0-9a-fA-F] \p{Space} A whitespace character: [
 802 \t\n\x0B\f\r]
 803
 804 java.lang.Character classes (simple java character type)
 805
 806 \p{javaLowerCase} Equivalent to java.lang.Character.isLowerCase()
 807 \p{javaUpperCase} Equivalent to java.lang.Character.isUpperCase()
 808 \p{javaWhitespace} Equivalent to java.lang.Character.isWhitespace()
 809 \p{javaMirrored} Equivalent to java.lang.Character.isMirrored()
 810
 811 Classes for Unicode blocks and categories
 812
 813 \p{InGreek} A character in the Greekblock (simple block) \p{Lu} An uppercase
 814 letter (simple category) \p{Sc} A currency symbol \P{InGreek} Any character
 815 except one in the Greek block (negation) [\p{L} Any letter except an uppercase
 816 letter (subtraction)
 817
 818 Boundary matchers
 819
 820 ^ The beginning of a line $ The end of a line \b A word boundary \B A non-word
 821 boundary \A The beginning of the input \G The end of the previous match \Z The
 822 end of the input but for the final terminator, ifany \z The end of the input
 823
 824 Greedy quantifiers
 825
 826 X? X, once or not at all X* X, zero or more times X+ X, one or more times X{n}
 827 X, exactly n times X{n,} X, at least n times X{n,m} X, at least n but not more
 828 than m times
 829
 830 Reluctant quantifiers
 831
 832 X?? X, once or not at all X*? X, zero or more times X+? X, one or more times
 833 X{n}? X, exactly n times X{n,}? X, at least n times X{n,m}? X, at least n but
 834 not more than m times
 835
 836 Possessive quantifiers
 837
 838 X?+ X, once or not at all X*+ X, zero or more times X++ X, one or more times
 839 X{n}+ X, exactly n times X{n,}+ X, at least n times X{n,m}+ X, at least n but
 840 not more than m times
 841
 842 Logical operators
 843
 844 XY X followed by Y X|Y Either X or Y (X) X, as a capturing group
 845
 846 Back references
 847
 848 \n Whatever the nth capturing group matched
 849
 850 Quotation
 851
 852 \ Nothing, but quotes the following character \Q Nothing, but quotes all
 853 characters until \E \E Nothing, but ends quoting started by \Q ?[\]^{|} -->
 854
 855 Special constructs (non-capturing)
 856
 857 (?:X) X, as a non-capturing group (?idmsux-idmsux) Nothing, but turns match
 858 flags on - off (?idmsux-idmsux:X) X, as a non-capturing group with the given
 859 flags on - off (?=X) X, via zero-width positive lookahead (?!X) X, via
 860 zero-width negative lookahead (?<=X) X, via zero-width positive lookbehind
 861 (?<!X) X, via zero-width negative lookbehind (?>X) X, as an independent,
 862 non-capturing group
 863
 864
 865
 866
 867
 868 Backslashes, escapes, and quoting
 869
 870 The backslash character ('\') serves to introduce escaped constructs, as
 871 defined in the table above, as well as to quote characters that otherwise would
 872 be interpreted as unescaped constructs. Thus the expression \\ matches a single
 873 backslash and \{ matches a left brace.
 874
 875 It is an error to use a backslash prior to any alphabetic character that does
 876 not denote an escaped construct; these are reserved for future extensions to
 877 the regular-expression language. A backslash may be used prior to a
 878 non-alphabetic character regardless of whether that character is part of an
 879 unescaped construct.
 880
 881 Backslashes within string literals in Java source code are interpreted as
 882 required by the Java Language Specification as either Unicode escapes or other
 883 character escapes. It is therefore necessary to double backslashes in string
 884 literals that represent regular expressions to protect them from interpretation
 885 by the Java bytecode compiler. The string literal "b", for example, matches a
 886 single backspace character when interpreted as a regular expression, while "b"
 887 matches a word boundary. The string literal "(hello)" is illegal and leads to a
 888 compile-time error; in order to match the string (hello) the string literal
 889 "(hello)" must be used.
 890
 891 Character Classes
 892
 893 Character classes may appear within other character classes, and may be
 894 composed by the union operator (implicit) and the intersection operator ( and
 895 and ). The union operator denotes a class that contains every character that is
 896 in at least one of its operand classes. The intersection operator denotes a
 897 class that contains every character that is in both of its operand classes.
 898
 899 The precedence of character-class operators is as follows, from highest to
 900 lowest:
 901
 902 1 Literal escape \x 2 Grouping [...] 3 Range a-z 4 Union [a-e][i-u] 5
 903 Intersection [a-z
 904
 905 Note that a different set of metacharacters are in effect inside a character
 906 class than outside a character class. For instance, the regular expression .
 907 loses its special meaning inside a character class, while the expression -
 908 becomes a range forming metacharacter.
 909
 910 Line terminators
 911
 912 A line terminator is a one- or two-character sequence that marks the end of a
 913 line of the input character sequence. The following are recognized as line
 914 terminators:
 915
 916
 917
 918 A newline (line feed) character('\n'),
 919
 920 A carriage-return character followed immediately by a newline
 921 character("\r\n"),
 922
 923 A standalone carriage-return character('\r'),
 924
 925 A next-line character('u0085'),
 926
 927 A line-separator character('u2028'), or
 928
 929 A paragraph-separator character('u2029).
 930
 931 If (|java.util.regex.Pattern|) mode is activated, then the only line
 932 terminators recognized are newline characters.
 933
 934 The regular expression . matches any character except a line terminator unless
 935 the (|java.util.regex.Pattern|) flag is specified.
 936
 937 By default, the regular expressions ^ and $ ignore line terminators and only
 938 match at the beginning and the end, respectively, of the entire input sequence.
 939 If (|java.util.regex.Pattern|) mode is activated then ^ matches at the
 940 beginning of input and after any line terminator except at the end of input.
 941 When in (|java.util.regex.Pattern|) mode $ matches just before a line
 942 terminator or the end of the input sequence.
 943
 944 Groups and capturing
 945
 946 Capturing groups are numbered by counting their opening parentheses from left
 947 to right. In the expression ((A)(B(C))), for example, there are four such
 948 groups:
 949
 950 1 ((A)(B(C))) 2 (A) 3 (B(C)) 4 (C)
 951
 952 Group zero always stands for the entire expression.
 953
 954 Capturing groups are so named because, during a match, each subsequence of the
 955 input sequence that matches such a group is saved. The captured subsequence may
 956 be used later in the expression, via a back reference, and may also be
 957 retrieved from the matcher once the match operation is complete.
 958
 959 The captured input associated with a group is always the subsequence that the
 960 group most recently matched. If a group is evaluated a second time because of
 961 quantification then its previously-captured value, if any, will be retained if
 962 the second evaluation fails. Matching the string "aba" against the expression
 963 (a(b)?)+, for example, leaves group two set to "b". All captured input is
 964 discarded at the beginning of each match.
 965
 966 Groups beginning with (? are pure, non-capturing groups that do not capture
 967 text and do not count towards the group total.
 968
 969 Unicode support
 970
 971 This class is in conformance with Level 1 of Unicode Technical Standard #18:
 972 Unicode Regular Expression Guidelines, plus RL2.1 Canonical Equivalents.
 973
 974 Unicode escape sequences such as u2014 in Java source code are processed as
 975 described in ¤3.3 of the Java Language Specification. Such escape sequences are
 976 also implemented directly by the regular-expression parser so that Unicode
 977 escapes can be used in expressions that are read from files or from the
 978 keyboard. Thus the strings "u2014" and "\\u2014", while not equal, compile into
 979 the same pattern, which matches the character with hexadecimal value 0x2014.
 980
 981 Unicode blocks and categories are written with the \p and \P constructs as in
 982 Perl. \p{prop} matches if the input has the property prop, while \P{prop} does
 983 not match if the input has that property. Blocks are specified with the prefix
 984 In, as in InMongolian. Categories may be specified with the optional prefix Is:
 985 Both \p{L} and \p{IsL} denote the category of Unicode letters. Blocks and
 986 categories can be used both inside and outside of a character class.
 987
 988 The supported categories are those of
 989
 990 The Unicode Standard in the version specified by the
 991 Character(|java.lang.Character|) class. The category names are those defined in
 992 the Standard, both normative and informative. The block names supported by
 993 Pattern are the valid block names accepted and defined by
 994 UnicodeBlock.forName(|java.lang.Character.UnicodeBlock|) .
 995
 996 Categories that behave like the java.lang.Character boolean ismethodname
 997 methods (except for the deprecated ones) are available through the same
 998 \p{prop} syntax where the specified property has the name javamethodname.
 999
1000 Comparison to Perl 5
1001
1002 The Pattern engine performs traditional NFA-based matching with ordered
1003 alternation as occurs in Perl 5.
1004
1005 Perl constructs not supported by this class:
1006
1007
1008
1009 The conditional constructs (?{X}) and (?(condition)X|Y),
1010
1011 The embedded code constructs (?{code}) and (??{code}),
1012
1013 The embedded comment syntax (?#comment), and
1014
1015 The preprocessing operations \l u, \L, and \U.
1016
1017
1018
1019 Constructs supported by this class but not by Perl:
1020
1021
1022
1023 Possessive quantifiers, which greedily match as much as they can and do not
1024 back off, even when doing so would allow the overall match to succeed.
1025
1026 Character-class union and intersection as described above.
1027
1028
1029
1030 Notable differences from Perl:
1031
1032
1033
1034 In Perl, \1 through \9 are always interpreted as back references; a
1035 backslash-escaped number greater than 9 is treated as a back reference if at
1036 least that many subexpressions exist, otherwise it is interpreted, if possible,
1037 as an octal escape. In this class octal escapes must always begin with a zero.
1038 In this class, \1 through \9 are always interpreted as back references, and a
1039 larger number is accepted as a back reference if at least that many
1040 subexpressions exist at that point in the regular expression, otherwise the
1041 parser will drop digits until the number is smaller or equal to the existing
1042 number of groups or it is one digit.
1043
1044 Perl uses the g flag to request a match that resumes where the last match left
1045 off. This functionality is provided implicitly by the
1046 (|java.util.regex.Matcher|) class: Repeated invocations of the
1047 find(|java.util.regex.Matcher|) method will resume where the last match left
1048 off, unless the matcher is reset.
1049
1050 In Perl, embedded flags at the top level of an expression affect the whole
1051 expression. In this class, embedded flags always take effect at the point at
1052 which they appear, whether they are at the top level or within a group; in the
1053 latter case, flags are restored at the end of the group just as in Perl.
1054
1055 Perl is forgiving about malformed matching constructs, as in the expression *a,
1056 as well as dangling brackets, as in the expression abc], and treats them as
1057 literals. This class also accepts dangling brackets but is strict about
1058 dangling metacharacters like +, ? and *, and will throw a
1059 (|java.util.regex.PatternSyntaxException|) if it encounters them.
1060
1061
1062
1063 For a more precise description of the behavior of regular expression
1064 constructs, please see Mastering Regular Expressions, 2nd Edition, Jeffrey E.
1065 F. Friedl, O'Reilly and Associates, 2002.
1066
1067
1068 *int_java.util.regex.Pattern.COMMENTS*
1069
1070 A compiled representation of a regular expression.
1071
1072 A regular expression, specified as a string, must first be compiled into an
1073 instance of this class. The resulting pattern can then be used to create a
1074 (|java.util.regex.Matcher|) object that can match arbitrary </code>character
1075 sequences<code>(|java.lang.CharSequence|) against the regular expression. All
1076 of the state involved in performing a match resides in the matcher, so many
1077 matchers can share the same pattern.
1078
1079 A typical invocation sequence is thus
1080
1081
1082
1083 Pattern p = Pattern. compile(|java.util.regex.Pattern|) ("a*b"); Matcher m = p.
1084 matcher(|java.util.regex.Pattern|) ("aaaaab"); boolean b = m.
1085 matches(|java.util.regex.Matcher|) ();
1086
1087 A matches(|java.util.regex.Pattern|) method is defined by this class as a
1088 convenience for when a regular expression is used just once. This method
1089 compiles an expression and matches an input sequence against it in a single
1090 invocation. The statement
1091
1092
1093
1094 boolean b = Pattern.matches("a*b", "aaaaab");
1095
1096 is equivalent to the three statements above, though for repeated matches it is
1097 less efficient since it does not allow the compiled pattern to be reused.
1098
1099 Instances of this class are immutable and are safe for use by multiple
1100 concurrent threads. Instances of the (|java.util.regex.Matcher|) class are not
1101 safe for such use.
1102
1103 Summary of regular-expression constructs
1104
1105
1106
1107 Construct Matches
1108
1109 Characters
1110
1111 x The character x \\ The backslash character \0n The character with octal value
1112 0n (0<=n<=7) \0nn The character with octal value 0nn (0<=n<=7) \0mnn The
1113 character with octal value 0mnn (0<=m<=3, 0<=n<=7) \xhh The character with
1114 hexadecimalvalue0xhh uhhhh The character with hexadecimalvalue0xhhhh \t The tab
1115 character ('u0009') \n The newline (line feed) character ('u000A') \r The
1116 carriage-return character ('u000D') \f The form-feed character ('u000C') \a The
1117 alert (bell) character ('u0007') \e The escape character ('u001B') \cx The
1118 control character corresponding to x
1119
1120 Character classes
1121
1122 [abc] a, b, or c (simple class) [^abc] Any character except a, b, or c
1123 (negation) [a-zA-Z] a through z or A through Z, inclusive (range) [a-d[m-p]] a
1124 through d, or m through p: [a-dm-p] (union) [a-z d, e, or f (intersection) [a-z
1125 a through z, except for b and c: [ad-z] (subtraction) [a-z a through z, and not
1126 m through p: [a-lq-z](subtraction)
1127
1128 Predefined character classes
1129
1130 . Any character (may or may not match line terminators) \d A digit: [0-9] \D A
1131 non-digit: [^0-9] \s A whitespace character: [ \t\n\x0B\f\r] \S A
1132 non-whitespace character: [^\s] \w A word character: [a-zA-Z_0-9] \W A non-word
1133 character: [^\w]
1134
1135 POSIX character classes (US-ASCII only)
1136
1137 \p{Lower} A lower-case alphabetic character: [a-z] \p{Upper} An upper-case
1138 alphabetic character:[A-Z] \p{ASCII} All ASCII:[\x00-\x7F] \p{Alpha} An
1139 alphabetic character:[\p{Lower}\p{Upper}] \p{Digit} A decimal digit: [0-9]
1140 \p{Alnum} An alphanumeric character:[\p{Alpha}\p{Digit}] \p{Punct} Punctuation:
1141 One of !"#$%?@[\]^_`{|}~ [\!"#\$%\\?@\[\\\]\^_`\{\|\}~]
1142 [\X21-\X2F\X31-\X40\X5B-\X60\X7B-\X7E] --> \p{Graph} A visible character:
1143 [\p{Alnum}\p{Punct}] \p{Print} A printable character: [\p{Graph}\x20] \p{Blank}
1144 A space or a tab: [ \t] \p{Cntrl} A control character: [\x00-\x1F\x7F]
1145 \p{XDigit} A hexadecimal digit: [0-9a-fA-F] \p{Space} A whitespace character: [
1146 \t\n\x0B\f\r]
1147
1148 java.lang.Character classes (simple java character type)
1149
1150 \p{javaLowerCase} Equivalent to java.lang.Character.isLowerCase()
1151 \p{javaUpperCase} Equivalent to java.lang.Character.isUpperCase()
1152 \p{javaWhitespace} Equivalent to java.lang.Character.isWhitespace()
1153 \p{javaMirrored} Equivalent to java.lang.Character.isMirrored()
1154
1155 Classes for Unicode blocks and categories
1156
1157 \p{InGreek} A character in the Greekblock (simple block) \p{Lu} An uppercase
1158 letter (simple category) \p{Sc} A currency symbol \P{InGreek} Any character
1159 except one in the Greek block (negation) [\p{L} Any letter except an uppercase
1160 letter (subtraction)
1161
1162 Boundary matchers
1163
1164 ^ The beginning of a line $ The end of a line \b A word boundary \B A non-word
1165 boundary \A The beginning of the input \G The end of the previous match \Z The
1166 end of the input but for the final terminator, ifany \z The end of the input
1167
1168 Greedy quantifiers
1169
1170 X? X, once or not at all X* X, zero or more times X+ X, one or more times X{n}
1171 X, exactly n times X{n,} X, at least n times X{n,m} X, at least n but not more
1172 than m times
1173
1174 Reluctant quantifiers
1175
1176 X?? X, once or not at all X*? X, zero or more times X+? X, one or more times
1177 X{n}? X, exactly n times X{n,}? X, at least n times X{n,m}? X, at least n but
1178 not more than m times
1179
1180 Possessive quantifiers
1181
1182 X?+ X, once or not at all X*+ X, zero or more times X++ X, one or more times
1183 X{n}+ X, exactly n times X{n,}+ X, at least n times X{n,m}+ X, at least n but
1184 not more than m times
1185
1186 Logical operators
1187
1188 XY X followed by Y X|Y Either X or Y (X) X, as a capturing group
1189
1190 Back references
1191
1192 \n Whatever the nth capturing group matched
1193
1194 Quotation
1195
1196 \ Nothing, but quotes the following character \Q Nothing, but quotes all
1197 characters until \E \E Nothing, but ends quoting started by \Q ?[\]^{|} -->
1198
1199 Special constructs (non-capturing)
1200
1201 (?:X) X, as a non-capturing group (?idmsux-idmsux) Nothing, but turns match
1202 flags on - off (?idmsux-idmsux:X) X, as a non-capturing group with the given
1203 flags on - off (?=X) X, via zero-width positive lookahead (?!X) X, via
1204 zero-width negative lookahead (?<=X) X, via zero-width positive lookbehind
1205 (?<!X) X, via zero-width negative lookbehind (?>X) X, as an independent,
1206 non-capturing group
1207
1208
1209
1210
1211
1212 Backslashes, escapes, and quoting
1213
1214 The backslash character ('\') serves to introduce escaped constructs, as
1215 defined in the table above, as well as to quote characters that otherwise would
1216 be interpreted as unescaped constructs. Thus the expression \\ matches a single
1217 backslash and \{ matches a left brace.
1218
1219 It is an error to use a backslash prior to any alphabetic character that does
1220 not denote an escaped construct; these are reserved for future extensions to
1221 the regular-expression language. A backslash may be used prior to a
1222 non-alphabetic character regardless of whether that character is part of an
1223 unescaped construct.
1224
1225 Backslashes within string literals in Java source code are interpreted as
1226 required by the Java Language Specification as either Unicode escapes or other
1227 character escapes. It is therefore necessary to double backslashes in string
1228 literals that represent regular expressions to protect them from interpretation
1229 by the Java bytecode compiler. The string literal "b", for example, matches a
1230 single backspace character when interpreted as a regular expression, while "b"
1231 matches a word boundary. The string literal "(hello)" is illegal and leads to a
1232 compile-time error; in order to match the string (hello) the string literal
1233 "(hello)" must be used.
1234
1235 Character Classes
1236
1237 Character classes may appear within other character classes, and may be
1238 composed by the union operator (implicit) and the intersection operator ( and
1239 and ). The union operator denotes a class that contains every character that is
1240 in at least one of its operand classes. The intersection operator denotes a
1241 class that contains every character that is in both of its operand classes.
1242
1243 The precedence of character-class operators is as follows, from highest to
1244 lowest:
1245
1246 1 Literal escape \x 2 Grouping [...] 3 Range a-z 4 Union [a-e][i-u] 5
1247 Intersection [a-z
1248
1249 Note that a different set of metacharacters are in effect inside a character
1250 class than outside a character class. For instance, the regular expression .
1251 loses its special meaning inside a character class, while the expression -
1252 becomes a range forming metacharacter.
1253
1254 Line terminators
1255
1256 A line terminator is a one- or two-character sequence that marks the end of a
1257 line of the input character sequence. The following are recognized as line
1258 terminators:
1259
1260
1261
1262 A newline (line feed) character('\n'),
1263
1264 A carriage-return character followed immediately by a newline
1265 character("\r\n"),
1266
1267 A standalone carriage-return character('\r'),
1268
1269 A next-line character('u0085'),
1270
1271 A line-separator character('u2028'), or
1272
1273 A paragraph-separator character('u2029).
1274
1275 If (|java.util.regex.Pattern|) mode is activated, then the only line
1276 terminators recognized are newline characters.
1277
1278 The regular expression . matches any character except a line terminator unless
1279 the (|java.util.regex.Pattern|) flag is specified.
1280
1281 By default, the regular expressions ^ and $ ignore line terminators and only
1282 match at the beginning and the end, respectively, of the entire input sequence.
1283 If (|java.util.regex.Pattern|) mode is activated then ^ matches at the
1284 beginning of input and after any line terminator except at the end of input.
1285 When in (|java.util.regex.Pattern|) mode $ matches just before a line
1286 terminator or the end of the input sequence.
1287
1288 Groups and capturing
1289
1290 Capturing groups are numbered by counting their opening parentheses from left
1291 to right. In the expression ((A)(B(C))), for example, there are four such
1292 groups:
1293
1294 1 ((A)(B(C))) 2 (A) 3 (B(C)) 4 (C)
1295
1296 Group zero always stands for the entire expression.
1297
1298 Capturing groups are so named because, during a match, each subsequence of the
1299 input sequence that matches such a group is saved. The captured subsequence may
1300 be used later in the expression, via a back reference, and may also be
1301 retrieved from the matcher once the match operation is complete.
1302
1303 The captured input associated with a group is always the subsequence that the
1304 group most recently matched. If a group is evaluated a second time because of
1305 quantification then its previously-captured value, if any, will be retained if
1306 the second evaluation fails. Matching the string "aba" against the expression
1307 (a(b)?)+, for example, leaves group two set to "b". All captured input is
1308 discarded at the beginning of each match.
1309
1310 Groups beginning with (? are pure, non-capturing groups that do not capture
1311 text and do not count towards the group total.
1312
1313 Unicode support
1314
1315 This class is in conformance with Level 1 of Unicode Technical Standard #18:
1316 Unicode Regular Expression Guidelines, plus RL2.1 Canonical Equivalents.
1317
1318 Unicode escape sequences such as u2014 in Java source code are processed as
1319 described in ¤3.3 of the Java Language Specification. Such escape sequences are
1320 also implemented directly by the regular-expression parser so that Unicode
1321 escapes can be used in expressions that are read from files or from the
1322 keyboard. Thus the strings "u2014" and "\\u2014", while not equal, compile into
1323 the same pattern, which matches the character with hexadecimal value 0x2014.
1324
1325 Unicode blocks and categories are written with the \p and \P constructs as in
1326 Perl. \p{prop} matches if the input has the property prop, while \P{prop} does
1327 not match if the input has that property. Blocks are specified with the prefix
1328 In, as in InMongolian. Categories may be specified with the optional prefix Is:
1329 Both \p{L} and \p{IsL} denote the category of Unicode letters. Blocks and
1330 categories can be used both inside and outside of a character class.
1331
1332 The supported categories are those of
1333
1334 The Unicode Standard in the version specified by the
1335 Character(|java.lang.Character|) class. The category names are those defined in
1336 the Standard, both normative and informative. The block names supported by
1337 Pattern are the valid block names accepted and defined by
1338 UnicodeBlock.forName(|java.lang.Character.UnicodeBlock|) .
1339
1340 Categories that behave like the java.lang.Character boolean ismethodname
1341 methods (except for the deprecated ones) are available through the same
1342 \p{prop} syntax where the specified property has the name javamethodname.
1343
1344 Comparison to Perl 5
1345
1346 The Pattern engine performs traditional NFA-based matching with ordered
1347 alternation as occurs in Perl 5.
1348
1349 Perl constructs not supported by this class:
1350
1351
1352
1353 The conditional constructs (?{X}) and (?(condition)X|Y),
1354
1355 The embedded code constructs (?{code}) and (??{code}),
1356
1357 The embedded comment syntax (?#comment), and
1358
1359 The preprocessing operations \l u, \L, and \U.
1360
1361
1362
1363 Constructs supported by this class but not by Perl:
1364
1365
1366
1367 Possessive quantifiers, which greedily match as much as they can and do not
1368 back off, even when doing so would allow the overall match to succeed.
1369
1370 Character-class union and intersection as described above.
1371
1372
1373
1374 Notable differences from Perl:
1375
1376
1377
1378 In Perl, \1 through \9 are always interpreted as back references; a
1379 backslash-escaped number greater than 9 is treated as a back reference if at
1380 least that many subexpressions exist, otherwise it is interpreted, if possible,
1381 as an octal escape. In this class octal escapes must always begin with a zero.
1382 In this class, \1 through \9 are always interpreted as back references, and a
1383 larger number is accepted as a back reference if at least that many
1384 subexpressions exist at that point in the regular expression, otherwise the
1385 parser will drop digits until the number is smaller or equal to the existing
1386 number of groups or it is one digit.
1387
1388 Perl uses the g flag to request a match that resumes where the last match left
1389 off. This functionality is provided implicitly by the
1390 (|java.util.regex.Matcher|) class: Repeated invocations of the
1391 find(|java.util.regex.Matcher|) method will resume where the last match left
1392 off, unless the matcher is reset.
1393
1394 In Perl, embedded flags at the top level of an expression affect the whole
1395 expression. In this class, embedded flags always take effect at the point at
1396 which they appear, whether they are at the top level or within a group; in the
1397 latter case, flags are restored at the end of the group just as in Perl.
1398
1399 Perl is forgiving about malformed matching constructs, as in the expression *a,
1400 as well as dangling brackets, as in the expression abc], and treats them as
1401 literals. This class also accepts dangling brackets but is strict about
1402 dangling metacharacters like +, ? and *, and will throw a
1403 (|java.util.regex.PatternSyntaxException|) if it encounters them.
1404
1405
1406
1407 For a more precise description of the behavior of regular expression
1408 constructs, please see Mastering Regular Expressions, 2nd Edition, Jeffrey E.
1409 F. Friedl, O'Reilly and Associates, 2002.
1410
1411
1412 *int_java.util.regex.Pattern.DOTALL*
1413
1414 A compiled representation of a regular expression.
1415
1416 A regular expression, specified as a string, must first be compiled into an
1417 instance of this class. The resulting pattern can then be used to create a
1418 (|java.util.regex.Matcher|) object that can match arbitrary </code>character
1419 sequences<code>(|java.lang.CharSequence|) against the regular expression. All
1420 of the state involved in performing a match resides in the matcher, so many
1421 matchers can share the same pattern.
1422
1423 A typical invocation sequence is thus
1424
1425
1426
1427 Pattern p = Pattern. compile(|java.util.regex.Pattern|) ("a*b"); Matcher m = p.
1428 matcher(|java.util.regex.Pattern|) ("aaaaab"); boolean b = m.
1429 matches(|java.util.regex.Matcher|) ();
1430
1431 A matches(|java.util.regex.Pattern|) method is defined by this class as a
1432 convenience for when a regular expression is used just once. This method
1433 compiles an expression and matches an input sequence against it in a single
1434 invocation. The statement
1435
1436
1437
1438 boolean b = Pattern.matches("a*b", "aaaaab");
1439
1440 is equivalent to the three statements above, though for repeated matches it is
1441 less efficient since it does not allow the compiled pattern to be reused.
1442
1443 Instances of this class are immutable and are safe for use by multiple
1444 concurrent threads. Instances of the (|java.util.regex.Matcher|) class are not
1445 safe for such use.
1446
1447 Summary of regular-expression constructs
1448
1449
1450
1451 Construct Matches
1452
1453 Characters
1454
1455 x The character x \\ The backslash character \0n The character with octal value
1456 0n (0<=n<=7) \0nn The character with octal value 0nn (0<=n<=7) \0mnn The
1457 character with octal value 0mnn (0<=m<=3, 0<=n<=7) \xhh The character with
1458 hexadecimalvalue0xhh uhhhh The character with hexadecimalvalue0xhhhh \t The tab
1459 character ('u0009') \n The newline (line feed) character ('u000A') \r The
1460 carriage-return character ('u000D') \f The form-feed character ('u000C') \a The
1461 alert (bell) character ('u0007') \e The escape character ('u001B') \cx The
1462 control character corresponding to x
1463
1464 Character classes
1465
1466 [abc] a, b, or c (simple class) [^abc] Any character except a, b, or c
1467 (negation) [a-zA-Z] a through z or A through Z, inclusive (range) [a-d[m-p]] a
1468 through d, or m through p: [a-dm-p] (union) [a-z d, e, or f (intersection) [a-z
1469 a through z, except for b and c: [ad-z] (subtraction) [a-z a through z, and not
1470 m through p: [a-lq-z](subtraction)
1471
1472 Predefined character classes
1473
1474 . Any character (may or may not match line terminators) \d A digit: [0-9] \D A
1475 non-digit: [^0-9] \s A whitespace character: [ \t\n\x0B\f\r] \S A
1476 non-whitespace character: [^\s] \w A word character: [a-zA-Z_0-9] \W A non-word
1477 character: [^\w]
1478
1479 POSIX character classes (US-ASCII only)
1480
1481 \p{Lower} A lower-case alphabetic character: [a-z] \p{Upper} An upper-case
1482 alphabetic character:[A-Z] \p{ASCII} All ASCII:[\x00-\x7F] \p{Alpha} An
1483 alphabetic character:[\p{Lower}\p{Upper}] \p{Digit} A decimal digit: [0-9]
1484 \p{Alnum} An alphanumeric character:[\p{Alpha}\p{Digit}] \p{Punct} Punctuation:
1485 One of !"#$%?@[\]^_`{|}~ [\!"#\$%\\?@\[\\\]\^_`\{\|\}~]
1486 [\X21-\X2F\X31-\X40\X5B-\X60\X7B-\X7E] --> \p{Graph} A visible character:
1487 [\p{Alnum}\p{Punct}] \p{Print} A printable character: [\p{Graph}\x20] \p{Blank}
1488 A space or a tab: [ \t] \p{Cntrl} A control character: [\x00-\x1F\x7F]
1489 \p{XDigit} A hexadecimal digit: [0-9a-fA-F] \p{Space} A whitespace character: [
1490 \t\n\x0B\f\r]
1491
1492 java.lang.Character classes (simple java character type)
1493
1494 \p{javaLowerCase} Equivalent to java.lang.Character.isLowerCase()
1495 \p{javaUpperCase} Equivalent to java.lang.Character.isUpperCase()
1496 \p{javaWhitespace} Equivalent to java.lang.Character.isWhitespace()
1497 \p{javaMirrored} Equivalent to java.lang.Character.isMirrored()
1498
1499 Classes for Unicode blocks and categories
1500
1501 \p{InGreek} A character in the Greekblock (simple block) \p{Lu} An uppercase
1502 letter (simple category) \p{Sc} A currency symbol \P{InGreek} Any character
1503 except one in the Greek block (negation) [\p{L} Any letter except an uppercase
1504 letter (subtraction)
1505
1506 Boundary matchers
1507
1508 ^ The beginning of a line $ The end of a line \b A word boundary \B A non-word
1509 boundary \A The beginning of the input \G The end of the previous match \Z The
1510 end of the input but for the final terminator, ifany \z The end of the input
1511
1512 Greedy quantifiers
1513
1514 X? X, once or not at all X* X, zero or more times X+ X, one or more times X{n}
1515 X, exactly n times X{n,} X, at least n times X{n,m} X, at least n but not more
1516 than m times
1517
1518 Reluctant quantifiers
1519
1520 X?? X, once or not at all X*? X, zero or more times X+? X, one or more times
1521 X{n}? X, exactly n times X{n,}? X, at least n times X{n,m}? X, at least n but
1522 not more than m times
1523
1524 Possessive quantifiers
1525
1526 X?+ X, once or not at all X*+ X, zero or more times X++ X, one or more times
1527 X{n}+ X, exactly n times X{n,}+ X, at least n times X{n,m}+ X, at least n but
1528 not more than m times
1529
1530 Logical operators
1531
1532 XY X followed by Y X|Y Either X or Y (X) X, as a capturing group
1533
1534 Back references
1535
1536 \n Whatever the nth capturing group matched
1537
1538 Quotation
1539
1540 \ Nothing, but quotes the following character \Q Nothing, but quotes all
1541 characters until \E \E Nothing, but ends quoting started by \Q ?[\]^{|} -->
1542
1543 Special constructs (non-capturing)
1544
1545 (?:X) X, as a non-capturing group (?idmsux-idmsux) Nothing, but turns match
1546 flags on - off (?idmsux-idmsux:X) X, as a non-capturing group with the given
1547 flags on - off (?=X) X, via zero-width positive lookahead (?!X) X, via
1548 zero-width negative lookahead (?<=X) X, via zero-width positive lookbehind
1549 (?<!X) X, via zero-width negative lookbehind (?>X) X, as an independent,
1550 non-capturing group
1551
1552
1553
1554
1555
1556 Backslashes, escapes, and quoting
1557
1558 The backslash character ('\') serves to introduce escaped constructs, as
1559 defined in the table above, as well as to quote characters that otherwise would
1560 be interpreted as unescaped constructs. Thus the expression \\ matches a single
1561 backslash and \{ matches a left brace.
1562
1563 It is an error to use a backslash prior to any alphabetic character that does
1564 not denote an escaped construct; these are reserved for future extensions to
1565 the regular-expression language. A backslash may be used prior to a
1566 non-alphabetic character regardless of whether that character is part of an
1567 unescaped construct.
1568
1569 Backslashes within string literals in Java source code are interpreted as
1570 required by the Java Language Specification as either Unicode escapes or other
1571 character escapes. It is therefore necessary to double backslashes in string
1572 literals that represent regular expressions to protect them from interpretation
1573 by the Java bytecode compiler. The string literal "b", for example, matches a
1574 single backspace character when interpreted as a regular expression, while "b"
1575 matches a word boundary. The string literal "(hello)" is illegal and leads to a
1576 compile-time error; in order to match the string (hello) the string literal
1577 "(hello)" must be used.
1578
1579 Character Classes
1580
1581 Character classes may appear within other character classes, and may be
1582 composed by the union operator (implicit) and the intersection operator ( and
1583 and ). The union operator denotes a class that contains every character that is
1584 in at least one of its operand classes. The intersection operator denotes a
1585 class that contains every character that is in both of its operand classes.
1586
1587 The precedence of character-class operators is as follows, from highest to
1588 lowest:
1589
1590 1 Literal escape \x 2 Grouping [...] 3 Range a-z 4 Union [a-e][i-u] 5
1591 Intersection [a-z
1592
1593 Note that a different set of metacharacters are in effect inside a character
1594 class than outside a character class. For instance, the regular expression .
1595 loses its special meaning inside a character class, while the expression -
1596 becomes a range forming metacharacter.
1597
1598 Line terminators
1599
1600 A line terminator is a one- or two-character sequence that marks the end of a
1601 line of the input character sequence. The following are recognized as line
1602 terminators:
1603
1604
1605
1606 A newline (line feed) character('\n'),
1607
1608 A carriage-return character followed immediately by a newline
1609 character("\r\n"),
1610
1611 A standalone carriage-return character('\r'),
1612
1613 A next-line character('u0085'),
1614
1615 A line-separator character('u2028'), or
1616
1617 A paragraph-separator character('u2029).
1618
1619 If (|java.util.regex.Pattern|) mode is activated, then the only line
1620 terminators recognized are newline characters.
1621
1622 The regular expression . matches any character except a line terminator unless
1623 the (|java.util.regex.Pattern|) flag is specified.
1624
1625 By default, the regular expressions ^ and $ ignore line terminators and only
1626 match at the beginning and the end, respectively, of the entire input sequence.
1627 If (|java.util.regex.Pattern|) mode is activated then ^ matches at the
1628 beginning of input and after any line terminator except at the end of input.
1629 When in (|java.util.regex.Pattern|) mode $ matches just before a line
1630 terminator or the end of the input sequence.
1631
1632 Groups and capturing
1633
1634 Capturing groups are numbered by counting their opening parentheses from left
1635 to right. In the expression ((A)(B(C))), for example, there are four such
1636 groups:
1637
1638 1 ((A)(B(C))) 2 (A) 3 (B(C)) 4 (C)
1639
1640 Group zero always stands for the entire expression.
1641
1642 Capturing groups are so named because, during a match, each subsequence of the
1643 input sequence that matches such a group is saved. The captured subsequence may
1644 be used later in the expression, via a back reference, and may also be
1645 retrieved from the matcher once the match operation is complete.
1646
1647 The captured input associated with a group is always the subsequence that the
1648 group most recently matched. If a group is evaluated a second time because of
1649 quantification then its previously-captured value, if any, will be retained if
1650 the second evaluation fails. Matching the string "aba" against the expression
1651 (a(b)?)+, for example, leaves group two set to "b". All captured input is
1652 discarded at the beginning of each match.
1653
1654 Groups beginning with (? are pure, non-capturing groups that do not capture
1655 text and do not count towards the group total.
1656
1657 Unicode support
1658
1659 This class is in conformance with Level 1 of Unicode Technical Standard #18:
1660 Unicode Regular Expression Guidelines, plus RL2.1 Canonical Equivalents.
1661
1662 Unicode escape sequences such as u2014 in Java source code are processed as
1663 described in ¤3.3 of the Java Language Specification. Such escape sequences are
1664 also implemented directly by the regular-expression parser so that Unicode
1665 escapes can be used in expressions that are read from files or from the
1666 keyboard. Thus the strings "u2014" and "\\u2014", while not equal, compile into
1667 the same pattern, which matches the character with hexadecimal value 0x2014.
1668
1669 Unicode blocks and categories are written with the \p and \P constructs as in
1670 Perl. \p{prop} matches if the input has the property prop, while \P{prop} does
1671 not match if the input has that property. Blocks are specified with the prefix
1672 In, as in InMongolian. Categories may be specified with the optional prefix Is:
1673 Both \p{L} and \p{IsL} denote the category of Unicode letters. Blocks and
1674 categories can be used both inside and outside of a character class.
1675
1676 The supported categories are those of
1677
1678 The Unicode Standard in the version specified by the
1679 Character(|java.lang.Character|) class. The category names are those defined in
1680 the Standard, both normative and informative. The block names supported by
1681 Pattern are the valid block names accepted and defined by
1682 UnicodeBlock.forName(|java.lang.Character.UnicodeBlock|) .
1683
1684 Categories that behave like the java.lang.Character boolean ismethodname
1685 methods (except for the deprecated ones) are available through the same
1686 \p{prop} syntax where the specified property has the name javamethodname.
1687
1688 Comparison to Perl 5
1689
1690 The Pattern engine performs traditional NFA-based matching with ordered
1691 alternation as occurs in Perl 5.
1692
1693 Perl constructs not supported by this class:
1694
1695
1696
1697 The conditional constructs (?{X}) and (?(condition)X|Y),
1698
1699 The embedded code constructs (?{code}) and (??{code}),
1700
1701 The embedded comment syntax (?#comment), and
1702
1703 The preprocessing operations \l u, \L, and \U.
1704
1705
1706
1707 Constructs supported by this class but not by Perl:
1708
1709
1710
1711 Possessive quantifiers, which greedily match as much as they can and do not
1712 back off, even when doing so would allow the overall match to succeed.
1713
1714 Character-class union and intersection as described above.
1715
1716
1717
1718 Notable differences from Perl:
1719
1720
1721
1722 In Perl, \1 through \9 are always interpreted as back references; a
1723 backslash-escaped number greater than 9 is treated as a back reference if at
1724 least that many subexpressions exist, otherwise it is interpreted, if possible,
1725 as an octal escape. In this class octal escapes must always begin with a zero.
1726 In this class, \1 through \9 are always interpreted as back references, and a
1727 larger number is accepted as a back reference if at least that many
1728 subexpressions exist at that point in the regular expression, otherwise the
1729 parser will drop digits until the number is smaller or equal to the existing
1730 number of groups or it is one digit.
1731
1732 Perl uses the g flag to request a match that resumes where the last match left
1733 off. This functionality is provided implicitly by the
1734 (|java.util.regex.Matcher|) class: Repeated invocations of the
1735 find(|java.util.regex.Matcher|) method will resume where the last match left
1736 off, unless the matcher is reset.
1737
1738 In Perl, embedded flags at the top level of an expression affect the whole
1739 expression. In this class, embedded flags always take effect at the point at
1740 which they appear, whether they are at the top level or within a group; in the
1741 latter case, flags are restored at the end of the group just as in Perl.
1742
1743 Perl is forgiving about malformed matching constructs, as in the expression *a,
1744 as well as dangling brackets, as in the expression abc], and treats them as
1745 literals. This class also accepts dangling brackets but is strict about
1746 dangling metacharacters like +, ? and *, and will throw a
1747 (|java.util.regex.PatternSyntaxException|) if it encounters them.
1748
1749
1750
1751 For a more precise description of the behavior of regular expression
1752 constructs, please see Mastering Regular Expressions, 2nd Edition, Jeffrey E.
1753 F. Friedl, O'Reilly and Associates, 2002.
1754
1755
1756 *int_java.util.regex.Pattern.LITERAL*
1757
1758 A compiled representation of a regular expression.
1759
1760 A regular expression, specified as a string, must first be compiled into an
1761 instance of this class. The resulting pattern can then be used to create a
1762 (|java.util.regex.Matcher|) object that can match arbitrary </code>character
1763 sequences<code>(|java.lang.CharSequence|) against the regular expression. All
1764 of the state involved in performing a match resides in the matcher, so many
1765 matchers can share the same pattern.
1766
1767 A typical invocation sequence is thus
1768
1769
1770
1771 Pattern p = Pattern. compile(|java.util.regex.Pattern|) ("a*b"); Matcher m = p.
1772 matcher(|java.util.regex.Pattern|) ("aaaaab"); boolean b = m.
1773 matches(|java.util.regex.Matcher|) ();
1774
1775 A matches(|java.util.regex.Pattern|) method is defined by this class as a
1776 convenience for when a regular expression is used just once. This method
1777 compiles an expression and matches an input sequence against it in a single
1778 invocation. The statement
1779
1780
1781
1782 boolean b = Pattern.matches("a*b", "aaaaab");
1783
1784 is equivalent to the three statements above, though for repeated matches it is
1785 less efficient since it does not allow the compiled pattern to be reused.
1786
1787 Instances of this class are immutable and are safe for use by multiple
1788 concurrent threads. Instances of the (|java.util.regex.Matcher|) class are not
1789 safe for such use.
1790
1791 Summary of regular-expression constructs
1792
1793
1794
1795 Construct Matches
1796
1797 Characters
1798
1799 x The character x \\ The backslash character \0n The character with octal value
1800 0n (0<=n<=7) \0nn The character with octal value 0nn (0<=n<=7) \0mnn The
1801 character with octal value 0mnn (0<=m<=3, 0<=n<=7) \xhh The character with
1802 hexadecimalvalue0xhh uhhhh The character with hexadecimalvalue0xhhhh \t The tab
1803 character ('u0009') \n The newline (line feed) character ('u000A') \r The
1804 carriage-return character ('u000D') \f The form-feed character ('u000C') \a The
1805 alert (bell) character ('u0007') \e The escape character ('u001B') \cx The
1806 control character corresponding to x
1807
1808 Character classes
1809
1810 [abc] a, b, or c (simple class) [^abc] Any character except a, b, or c
1811 (negation) [a-zA-Z] a through z or A through Z, inclusive (range) [a-d[m-p]] a
1812 through d, or m through p: [a-dm-p] (union) [a-z d, e, or f (intersection) [a-z
1813 a through z, except for b and c: [ad-z] (subtraction) [a-z a through z, and not
1814 m through p: [a-lq-z](subtraction)
1815
1816 Predefined character classes
1817
1818 . Any character (may or may not match line terminators) \d A digit: [0-9] \D A
1819 non-digit: [^0-9] \s A whitespace character: [ \t\n\x0B\f\r] \S A
1820 non-whitespace character: [^\s] \w A word character: [a-zA-Z_0-9] \W A non-word
1821 character: [^\w]
1822
1823 POSIX character classes (US-ASCII only)
1824
1825 \p{Lower} A lower-case alphabetic character: [a-z] \p{Upper} An upper-case
1826 alphabetic character:[A-Z] \p{ASCII} All ASCII:[\x00-\x7F] \p{Alpha} An
1827 alphabetic character:[\p{Lower}\p{Upper}] \p{Digit} A decimal digit: [0-9]
1828 \p{Alnum} An alphanumeric character:[\p{Alpha}\p{Digit}] \p{Punct} Punctuation:
1829 One of !"#$%?@[\]^_`{|}~ [\!"#\$%\\?@\[\\\]\^_`\{\|\}~]
1830 [\X21-\X2F\X31-\X40\X5B-\X60\X7B-\X7E] --> \p{Graph} A visible character:
1831 [\p{Alnum}\p{Punct}] \p{Print} A printable character: [\p{Graph}\x20] \p{Blank}
1832 A space or a tab: [ \t] \p{Cntrl} A control character: [\x00-\x1F\x7F]
1833 \p{XDigit} A hexadecimal digit: [0-9a-fA-F] \p{Space} A whitespace character: [
1834 \t\n\x0B\f\r]
1835
1836 java.lang.Character classes (simple java character type)
1837
1838 \p{javaLowerCase} Equivalent to java.lang.Character.isLowerCase()
1839 \p{javaUpperCase} Equivalent to java.lang.Character.isUpperCase()
1840 \p{javaWhitespace} Equivalent to java.lang.Character.isWhitespace()
1841 \p{javaMirrored} Equivalent to java.lang.Character.isMirrored()
1842
1843 Classes for Unicode blocks and categories
1844
1845 \p{InGreek} A character in the Greekblock (simple block) \p{Lu} An uppercase
1846 letter (simple category) \p{Sc} A currency symbol \P{InGreek} Any character
1847 except one in the Greek block (negation) [\p{L} Any letter except an uppercase
1848 letter (subtraction)
1849
1850 Boundary matchers
1851
1852 ^ The beginning of a line $ The end of a line \b A word boundary \B A non-word
1853 boundary \A The beginning of the input \G The end of the previous match \Z The
1854 end of the input but for the final terminator, ifany \z The end of the input
1855
1856 Greedy quantifiers
1857
1858 X? X, once or not at all X* X, zero or more times X+ X, one or more times X{n}
1859 X, exactly n times X{n,} X, at least n times X{n,m} X, at least n but not more
1860 than m times
1861
1862 Reluctant quantifiers
1863
1864 X?? X, once or not at all X*? X, zero or more times X+? X, one or more times
1865 X{n}? X, exactly n times X{n,}? X, at least n times X{n,m}? X, at least n but
1866 not more than m times
1867
1868 Possessive quantifiers
1869
1870 X?+ X, once or not at all X*+ X, zero or more times X++ X, one or more times
1871 X{n}+ X, exactly n times X{n,}+ X, at least n times X{n,m}+ X, at least n but
1872 not more than m times
1873
1874 Logical operators
1875
1876 XY X followed by Y X|Y Either X or Y (X) X, as a capturing group
1877
1878 Back references
1879
1880 \n Whatever the nth capturing group matched
1881
1882 Quotation
1883
1884 \ Nothing, but quotes the following character \Q Nothing, but quotes all
1885 characters until \E \E Nothing, but ends quoting started by \Q ?[\]^{|} -->
1886
1887 Special constructs (non-capturing)
1888
1889 (?:X) X, as a non-capturing group (?idmsux-idmsux) Nothing, but turns match
1890 flags on - off (?idmsux-idmsux:X) X, as a non-capturing group with the given
1891 flags on - off (?=X) X, via zero-width positive lookahead (?!X) X, via
1892 zero-width negative lookahead (?<=X) X, via zero-width positive lookbehind
1893 (?<!X) X, via zero-width negative lookbehind (?>X) X, as an independent,
1894 non-capturing group
1895
1896
1897
1898
1899
1900 Backslashes, escapes, and quoting
1901
1902 The backslash character ('\') serves to introduce escaped constructs, as
1903 defined in the table above, as well as to quote characters that otherwise would
1904 be interpreted as unescaped constructs. Thus the expression \\ matches a single
1905 backslash and \{ matches a left brace.
1906
1907 It is an error to use a backslash prior to any alphabetic character that does
1908 not denote an escaped construct; these are reserved for future extensions to
1909 the regular-expression language. A backslash may be used prior to a
1910 non-alphabetic character regardless of whether that character is part of an
1911 unescaped construct.
1912
1913 Backslashes within string literals in Java source code are interpreted as
1914 required by the Java Language Specification as either Unicode escapes or other
1915 character escapes. It is therefore necessary to double backslashes in string
1916 literals that represent regular expressions to protect them from interpretation
1917 by the Java bytecode compiler. The string literal "b", for example, matches a
1918 single backspace character when interpreted as a regular expression, while "b"
1919 matches a word boundary. The string literal "(hello)" is illegal and leads to a
1920 compile-time error; in order to match the string (hello) the string literal
1921 "(hello)" must be used.
1922
1923 Character Classes
1924
1925 Character classes may appear within other character classes, and may be
1926 composed by the union operator (implicit) and the intersection operator ( and
1927 and ). The union operator denotes a class that contains every character that is
1928 in at least one of its operand classes. The intersection operator denotes a
1929 class that contains every character that is in both of its operand classes.
1930
1931 The precedence of character-class operators is as follows, from highest to
1932 lowest:
1933
1934 1 Literal escape \x 2 Grouping [...] 3 Range a-z 4 Union [a-e][i-u] 5
1935 Intersection [a-z
1936
1937 Note that a different set of metacharacters are in effect inside a character
1938 class than outside a character class. For instance, the regular expression .
1939 loses its special meaning inside a character class, while the expression -
1940 becomes a range forming metacharacter.
1941
1942 Line terminators
1943
1944 A line terminator is a one- or two-character sequence that marks the end of a
1945 line of the input character sequence. The following are recognized as line
1946 terminators:
1947
1948
1949
1950 A newline (line feed) character('\n'),
1951
1952 A carriage-return character followed immediately by a newline
1953 character("\r\n"),
1954
1955 A standalone carriage-return character('\r'),
1956
1957 A next-line character('u0085'),
1958
1959 A line-separator character('u2028'), or
1960
1961 A paragraph-separator character('u2029).
1962
1963 If (|java.util.regex.Pattern|) mode is activated, then the only line
1964 terminators recognized are newline characters.
1965
1966 The regular expression . matches any character except a line terminator unless
1967 the (|java.util.regex.Pattern|) flag is specified.
1968
1969 By default, the regular expressions ^ and $ ignore line terminators and only
1970 match at the beginning and the end, respectively, of the entire input sequence.
1971 If (|java.util.regex.Pattern|) mode is activated then ^ matches at the
1972 beginning of input and after any line terminator except at the end of input.
1973 When in (|java.util.regex.Pattern|) mode $ matches just before a line
1974 terminator or the end of the input sequence.
1975
1976 Groups and capturing
1977
1978 Capturing groups are numbered by counting their opening parentheses from left
1979 to right. In the expression ((A)(B(C))), for example, there are four such
1980 groups:
1981
1982 1 ((A)(B(C))) 2 (A) 3 (B(C)) 4 (C)
1983
1984 Group zero always stands for the entire expression.
1985
1986 Capturing groups are so named because, during a match, each subsequence of the
1987 input sequence that matches such a group is saved. The captured subsequence may
1988 be used later in the expression, via a back reference, and may also be
1989 retrieved from the matcher once the match operation is complete.
1990
1991 The captured input associated with a group is always the subsequence that the
1992 group most recently matched. If a group is evaluated a second time because of
1993 quantification then its previously-captured value, if any, will be retained if
1994 the second evaluation fails. Matching the string "aba" against the expression
1995 (a(b)?)+, for example, leaves group two set to "b". All captured input is
1996 discarded at the beginning of each match.
1997
1998 Groups beginning with (? are pure, non-capturing groups that do not capture
1999 text and do not count towards the group total.
2000
2001 Unicode support
2002
2003 This class is in conformance with Level 1 of Unicode Technical Standard #18:
2004 Unicode Regular Expression Guidelines, plus RL2.1 Canonical Equivalents.
2005
2006 Unicode escape sequences such as u2014 in Java source code are processed as
2007 described in ¤3.3 of the Java Language Specification. Such escape sequences are
2008 also implemented directly by the regular-expression parser so that Unicode
2009 escapes can be used in expressions that are read from files or from the
2010 keyboard. Thus the strings "u2014" and "\\u2014", while not equal, compile into
2011 the same pattern, which matches the character with hexadecimal value 0x2014.
2012
2013 Unicode blocks and categories are written with the \p and \P constructs as in
2014 Perl. \p{prop} matches if the input has the property prop, while \P{prop} does
2015 not match if the input has that property. Blocks are specified with the prefix
2016 In, as in InMongolian. Categories may be specified with the optional prefix Is:
2017 Both \p{L} and \p{IsL} denote the category of Unicode letters. Blocks and
2018 categories can be used both inside and outside of a character class.
2019
2020 The supported categories are those of
2021
2022 The Unicode Standard in the version specified by the
2023 Character(|java.lang.Character|) class. The category names are those defined in
2024 the Standard, both normative and informative. The block names supported by
2025 Pattern are the valid block names accepted and defined by
2026 UnicodeBlock.forName(|java.lang.Character.UnicodeBlock|) .
2027
2028 Categories that behave like the java.lang.Character boolean ismethodname
2029 methods (except for the deprecated ones) are available through the same
2030 \p{prop} syntax where the specified property has the name javamethodname.
2031
2032 Comparison to Perl 5
2033
2034 The Pattern engine performs traditional NFA-based matching with ordered
2035 alternation as occurs in Perl 5.
2036
2037 Perl constructs not supported by this class:
2038
2039
2040
2041 The conditional constructs (?{X}) and (?(condition)X|Y),
2042
2043 The embedded code constructs (?{code}) and (??{code}),
2044
2045 The embedded comment syntax (?#comment), and
2046
2047 The preprocessing operations \l u, \L, and \U.
2048
2049
2050
2051 Constructs supported by this class but not by Perl:
2052
2053
2054
2055 Possessive quantifiers, which greedily match as much as they can and do not
2056 back off, even when doing so would allow the overall match to succeed.
2057
2058 Character-class union and intersection as described above.
2059
2060
2061
2062 Notable differences from Perl:
2063
2064
2065
2066 In Perl, \1 through \9 are always interpreted as back references; a
2067 backslash-escaped number greater than 9 is treated as a back reference if at
2068 least that many subexpressions exist, otherwise it is interpreted, if possible,
2069 as an octal escape. In this class octal escapes must always begin with a zero.
2070 In this class, \1 through \9 are always interpreted as back references, and a
2071 larger number is accepted as a back reference if at least that many
2072 subexpressions exist at that point in the regular expression, otherwise the
2073 parser will drop digits until the number is smaller or equal to the existing
2074 number of groups or it is one digit.
2075
2076 Perl uses the g flag to request a match that resumes where the last match left
2077 off. This functionality is provided implicitly by the
2078 (|java.util.regex.Matcher|) class: Repeated invocations of the
2079 find(|java.util.regex.Matcher|) method will resume where the last match left
2080 off, unless the matcher is reset.
2081
2082 In Perl, embedded flags at the top level of an expression affect the whole
2083 expression. In this class, embedded flags always take effect at the point at
2084 which they appear, whether they are at the top level or within a group; in the
2085 latter case, flags are restored at the end of the group just as in Perl.
2086
2087 Perl is forgiving about malformed matching constructs, as in the expression *a,
2088 as well as dangling brackets, as in the expression abc], and treats them as
2089 literals. This class also accepts dangling brackets but is strict about
2090 dangling metacharacters like +, ? and *, and will throw a
2091 (|java.util.regex.PatternSyntaxException|) if it encounters them.
2092
2093
2094
2095 For a more precise description of the behavior of regular expression
2096 constructs, please see Mastering Regular Expressions, 2nd Edition, Jeffrey E.
2097 F. Friedl, O'Reilly and Associates, 2002.
2098
2099
2100 *int_java.util.regex.Pattern.MULTILINE*
2101
2102 A compiled representation of a regular expression.
2103
2104 A regular expression, specified as a string, must first be compiled into an
2105 instance of this class. The resulting pattern can then be used to create a
2106 (|java.util.regex.Matcher|) object that can match arbitrary </code>character
2107 sequences<code>(|java.lang.CharSequence|) against the regular expression. All
2108 of the state involved in performing a match resides in the matcher, so many
2109 matchers can share the same pattern.
2110
2111 A typical invocation sequence is thus
2112
2113
2114
2115 Pattern p = Pattern. compile(|java.util.regex.Pattern|) ("a*b"); Matcher m = p.
2116 matcher(|java.util.regex.Pattern|) ("aaaaab"); boolean b = m.
2117 matches(|java.util.regex.Matcher|) ();
2118
2119 A matches(|java.util.regex.Pattern|) method is defined by this class as a
2120 convenience for when a regular expression is used just once. This method
2121 compiles an expression and matches an input sequence against it in a single
2122 invocation. The statement
2123
2124
2125
2126 boolean b = Pattern.matches("a*b", "aaaaab");
2127
2128 is equivalent to the three statements above, though for repeated matches it is
2129 less efficient since it does not allow the compiled pattern to be reused.
2130
2131 Instances of this class are immutable and are safe for use by multiple
2132 concurrent threads. Instances of the (|java.util.regex.Matcher|) class are not
2133 safe for such use.
2134
2135 Summary of regular-expression constructs
2136
2137
2138
2139 Construct Matches
2140
2141 Characters
2142
2143 x The character x \\ The backslash character \0n The character with octal value
2144 0n (0<=n<=7) \0nn The character with octal value 0nn (0<=n<=7) \0mnn The
2145 character with octal value 0mnn (0<=m<=3, 0<=n<=7) \xhh The character with
2146 hexadecimalvalue0xhh uhhhh The character with hexadecimalvalue0xhhhh \t The tab
2147 character ('u0009') \n The newline (line feed) character ('u000A') \r The
2148 carriage-return character ('u000D') \f The form-feed character ('u000C') \a The
2149 alert (bell) character ('u0007') \e The escape character ('u001B') \cx The
2150 control character corresponding to x
2151
2152 Character classes
2153
2154 [abc] a, b, or c (simple class) [^abc] Any character except a, b, or c
2155 (negation) [a-zA-Z] a through z or A through Z, inclusive (range) [a-d[m-p]] a
2156 through d, or m through p: [a-dm-p] (union) [a-z d, e, or f (intersection) [a-z
2157 a through z, except for b and c: [ad-z] (subtraction) [a-z a through z, and not
2158 m through p: [a-lq-z](subtraction)
2159
2160 Predefined character classes
2161
2162 . Any character (may or may not match line terminators) \d A digit: [0-9] \D A
2163 non-digit: [^0-9] \s A whitespace character: [ \t\n\x0B\f\r] \S A
2164 non-whitespace character: [^\s] \w A word character: [a-zA-Z_0-9] \W A non-word
2165 character: [^\w]
2166
2167 POSIX character classes (US-ASCII only)
2168
2169 \p{Lower} A lower-case alphabetic character: [a-z] \p{Upper} An upper-case
2170 alphabetic character:[A-Z] \p{ASCII} All ASCII:[\x00-\x7F] \p{Alpha} An
2171 alphabetic character:[\p{Lower}\p{Upper}] \p{Digit} A decimal digit: [0-9]
2172 \p{Alnum} An alphanumeric character:[\p{Alpha}\p{Digit}] \p{Punct} Punctuation:
2173 One of !"#$%?@[\]^_`{|}~ [\!"#\$%\\?@\[\\\]\^_`\{\|\}~]
2174 [\X21-\X2F\X31-\X40\X5B-\X60\X7B-\X7E] --> \p{Graph} A visible character:
2175 [\p{Alnum}\p{Punct}] \p{Print} A printable character: [\p{Graph}\x20] \p{Blank}
2176 A space or a tab: [ \t] \p{Cntrl} A control character: [\x00-\x1F\x7F]
2177 \p{XDigit} A hexadecimal digit: [0-9a-fA-F] \p{Space} A whitespace character: [
2178 \t\n\x0B\f\r]
2179
2180 java.lang.Character classes (simple java character type)
2181
2182 \p{javaLowerCase} Equivalent to java.lang.Character.isLowerCase()
2183 \p{javaUpperCase} Equivalent to java.lang.Character.isUpperCase()
2184 \p{javaWhitespace} Equivalent to java.lang.Character.isWhitespace()
2185 \p{javaMirrored} Equivalent to java.lang.Character.isMirrored()
2186
2187 Classes for Unicode blocks and categories
2188
2189 \p{InGreek} A character in the Greekblock (simple block) \p{Lu} An uppercase
2190 letter (simple category) \p{Sc} A currency symbol \P{InGreek} Any character
2191 except one in the Greek block (negation) [\p{L} Any letter except an uppercase
2192 letter (subtraction)
2193
2194 Boundary matchers
2195
2196 ^ The beginning of a line $ The end of a line \b A word boundary \B A non-word
2197 boundary \A The beginning of the input \G The end of the previous match \Z The
2198 end of the input but for the final terminator, ifany \z The end of the input
2199
2200 Greedy quantifiers
2201
2202 X? X, once or not at all X* X, zero or more times X+ X, one or more times X{n}
2203 X, exactly n times X{n,} X, at least n times X{n,m} X, at least n but not more
2204 than m times
2205
2206 Reluctant quantifiers
2207
2208 X?? X, once or not at all X*? X, zero or more times X+? X, one or more times
2209 X{n}? X, exactly n times X{n,}? X, at least n times X{n,m}? X, at least n but
2210 not more than m times
2211
2212 Possessive quantifiers
2213
2214 X?+ X, once or not at all X*+ X, zero or more times X++ X, one or more times
2215 X{n}+ X, exactly n times X{n,}+ X, at least n times X{n,m}+ X, at least n but
2216 not more than m times
2217
2218 Logical operators
2219
2220 XY X followed by Y X|Y Either X or Y (X) X, as a capturing group
2221
2222 Back references
2223
2224 \n Whatever the nth capturing group matched
2225
2226 Quotation
2227
2228 \ Nothing, but quotes the following character \Q Nothing, but quotes all
2229 characters until \E \E Nothing, but ends quoting started by \Q ?[\]^{|} -->
2230
2231 Special constructs (non-capturing)
2232
2233 (?:X) X, as a non-capturing group (?idmsux-idmsux) Nothing, but turns match
2234 flags on - off (?idmsux-idmsux:X) X, as a non-capturing group with the given
2235 flags on - off (?=X) X, via zero-width positive lookahead (?!X) X, via
2236 zero-width negative lookahead (?<=X) X, via zero-width positive lookbehind
2237 (?<!X) X, via zero-width negative lookbehind (?>X) X, as an independent,
2238 non-capturing group
2239
2240
2241
2242
2243
2244 Backslashes, escapes, and quoting
2245
2246 The backslash character ('\') serves to introduce escaped constructs, as
2247 defined in the table above, as well as to quote characters that otherwise would
2248 be interpreted as unescaped constructs. Thus the expression \\ matches a single
2249 backslash and \{ matches a left brace.
2250
2251 It is an error to use a backslash prior to any alphabetic character that does
2252 not denote an escaped construct; these are reserved for future extensions to
2253 the regular-expression language. A backslash may be used prior to a
2254 non-alphabetic character regardless of whether that character is part of an
2255 unescaped construct.
2256
2257 Backslashes within string literals in Java source code are interpreted as
2258 required by the Java Language Specification as either Unicode escapes or other
2259 character escapes. It is therefore necessary to double backslashes in string
2260 literals that represent regular expressions to protect them from interpretation
2261 by the Java bytecode compiler. The string literal "b", for example, matches a
2262 single backspace character when interpreted as a regular expression, while "b"
2263 matches a word boundary. The string literal "(hello)" is illegal and leads to a
2264 compile-time error; in order to match the string (hello) the string literal
2265 "(hello)" must be used.
2266
2267 Character Classes
2268
2269 Character classes may appear within other character classes, and may be
2270 composed by the union operator (implicit) and the intersection operator ( and
2271 and ). The union operator denotes a class that contains every character that is
2272 in at least one of its operand classes. The intersection operator denotes a
2273 class that contains every character that is in both of its operand classes.
2274
2275 The precedence of character-class operators is as follows, from highest to
2276 lowest:
2277
2278 1 Literal escape \x 2 Grouping [...] 3 Range a-z 4 Union [a-e][i-u] 5
2279 Intersection [a-z
2280
2281 Note that a different set of metacharacters are in effect inside a character
2282 class than outside a character class. For instance, the regular expression .
2283 loses its special meaning inside a character class, while the expression -
2284 becomes a range forming metacharacter.
2285
2286 Line terminators
2287
2288 A line terminator is a one- or two-character sequence that marks the end of a
2289 line of the input character sequence. The following are recognized as line
2290 terminators:
2291
2292
2293
2294 A newline (line feed) character('\n'),
2295
2296 A carriage-return character followed immediately by a newline
2297 character("\r\n"),
2298
2299 A standalone carriage-return character('\r'),
2300
2301 A next-line character('u0085'),
2302
2303 A line-separator character('u2028'), or
2304
2305 A paragraph-separator character('u2029).
2306
2307 If (|java.util.regex.Pattern|) mode is activated, then the only line
2308 terminators recognized are newline characters.
2309
2310 The regular expression . matches any character except a line terminator unless
2311 the (|java.util.regex.Pattern|) flag is specified.
2312
2313 By default, the regular expressions ^ and $ ignore line terminators and only
2314 match at the beginning and the end, respectively, of the entire input sequence.
2315 If (|java.util.regex.Pattern|) mode is activated then ^ matches at the
2316 beginning of input and after any line terminator except at the end of input.
2317 When in (|java.util.regex.Pattern|) mode $ matches just before a line
2318 terminator or the end of the input sequence.
2319
2320 Groups and capturing
2321
2322 Capturing groups are numbered by counting their opening parentheses from left
2323 to right. In the expression ((A)(B(C))), for example, there are four such
2324 groups:
2325
2326 1 ((A)(B(C))) 2 (A) 3 (B(C)) 4 (C)
2327
2328 Group zero always stands for the entire expression.
2329
2330 Capturing groups are so named because, during a match, each subsequence of the
2331 input sequence that matches such a group is saved. The captured subsequence may
2332 be used later in the expression, via a back reference, and may also be
2333 retrieved from the matcher once the match operation is complete.
2334
2335 The captured input associated with a group is always the subsequence that the
2336 group most recently matched. If a group is evaluated a second time because of
2337 quantification then its previously-captured value, if any, will be retained if
2338 the second evaluation fails. Matching the string "aba" against the expression
2339 (a(b)?)+, for example, leaves group two set to "b". All captured input is
2340 discarded at the beginning of each match.
2341
2342 Groups beginning with (? are pure, non-capturing groups that do not capture
2343 text and do not count towards the group total.
2344
2345 Unicode support
2346
2347 This class is in conformance with Level 1 of Unicode Technical Standard #18:
2348 Unicode Regular Expression Guidelines, plus RL2.1 Canonical Equivalents.
2349
2350 Unicode escape sequences such as u2014 in Java source code are processed as
2351 described in ¤3.3 of the Java Language Specification. Such escape sequences are
2352 also implemented directly by the regular-expression parser so that Unicode
2353 escapes can be used in expressions that are read from files or from the
2354 keyboard. Thus the strings "u2014" and "\\u2014", while not equal, compile into
2355 the same pattern, which matches the character with hexadecimal value 0x2014.
2356
2357 Unicode blocks and categories are written with the \p and \P constructs as in
2358 Perl. \p{prop} matches if the input has the property prop, while \P{prop} does
2359 not match if the input has that property. Blocks are specified with the prefix
2360 In, as in InMongolian. Categories may be specified with the optional prefix Is:
2361 Both \p{L} and \p{IsL} denote the category of Unicode letters. Blocks and
2362 categories can be used both inside and outside of a character class.
2363
2364 The supported categories are those of
2365
2366 The Unicode Standard in the version specified by the
2367 Character(|java.lang.Character|) class. The category names are those defined in
2368 the Standard, both normative and informative. The block names supported by
2369 Pattern are the valid block names accepted and defined by
2370 UnicodeBlock.forName(|java.lang.Character.UnicodeBlock|) .
2371
2372 Categories that behave like the java.lang.Character boolean ismethodname
2373 methods (except for the deprecated ones) are available through the same
2374 \p{prop} syntax where the specified property has the name javamethodname.
2375
2376 Comparison to Perl 5
2377
2378 The Pattern engine performs traditional NFA-based matching with ordered
2379 alternation as occurs in Perl 5.
2380
2381 Perl constructs not supported by this class:
2382
2383
2384
2385 The conditional constructs (?{X}) and (?(condition)X|Y),
2386
2387 The embedded code constructs (?{code}) and (??{code}),
2388
2389 The embedded comment syntax (?#comment), and
2390
2391 The preprocessing operations \l u, \L, and \U.
2392
2393
2394
2395 Constructs supported by this class but not by Perl:
2396
2397
2398
2399 Possessive quantifiers, which greedily match as much as they can and do not
2400 back off, even when doing so would allow the overall match to succeed.
2401
2402 Character-class union and intersection as described above.
2403
2404
2405
2406 Notable differences from Perl:
2407
2408
2409
2410 In Perl, \1 through \9 are always interpreted as back references; a
2411 backslash-escaped number greater than 9 is treated as a back reference if at
2412 least that many subexpressions exist, otherwise it is interpreted, if possible,
2413 as an octal escape. In this class octal escapes must always begin with a zero.
2414 In this class, \1 through \9 are always interpreted as back references, and a
2415 larger number is accepted as a back reference if at least that many
2416 subexpressions exist at that point in the regular expression, otherwise the
2417 parser will drop digits until the number is smaller or equal to the existing
2418 number of groups or it is one digit.
2419
2420 Perl uses the g flag to request a match that resumes where the last match left
2421 off. This functionality is provided implicitly by the
2422 (|java.util.regex.Matcher|) class: Repeated invocations of the
2423 find(|java.util.regex.Matcher|) method will resume where the last match left
2424 off, unless the matcher is reset.
2425
2426 In Perl, embedded flags at the top level of an expression affect the whole
2427 expression. In this class, embedded flags always take effect at the point at
2428 which they appear, whether they are at the top level or within a group; in the
2429 latter case, flags are restored at the end of the group just as in Perl.
2430
2431 Perl is forgiving about malformed matching constructs, as in the expression *a,
2432 as well as dangling brackets, as in the expression abc], and treats them as
2433 literals. This class also accepts dangling brackets but is strict about
2434 dangling metacharacters like +, ? and *, and will throw a
2435 (|java.util.regex.PatternSyntaxException|) if it encounters them.
2436
2437
2438
2439 For a more precise description of the behavior of regular expression
2440 constructs, please see Mastering Regular Expressions, 2nd Edition, Jeffrey E.
2441 F. Friedl, O'Reilly and Associates, 2002.
2442
2443
2444 *int_java.util.regex.Pattern.UNICODE_CASE*
2445
2446 A compiled representation of a regular expression.
2447
2448 A regular expression, specified as a string, must first be compiled into an
2449 instance of this class. The resulting pattern can then be used to create a
2450 (|java.util.regex.Matcher|) object that can match arbitrary </code>character
2451 sequences<code>(|java.lang.CharSequence|) against the regular expression. All
2452 of the state involved in performing a match resides in the matcher, so many
2453 matchers can share the same pattern.
2454
2455 A typical invocation sequence is thus
2456
2457
2458
2459 Pattern p = Pattern. compile(|java.util.regex.Pattern|) ("a*b"); Matcher m = p.
2460 matcher(|java.util.regex.Pattern|) ("aaaaab"); boolean b = m.
2461 matches(|java.util.regex.Matcher|) ();
2462
2463 A matches(|java.util.regex.Pattern|) method is defined by this class as a
2464 convenience for when a regular expression is used just once. This method
2465 compiles an expression and matches an input sequence against it in a single
2466 invocation. The statement
2467
2468
2469
2470 boolean b = Pattern.matches("a*b", "aaaaab");
2471
2472 is equivalent to the three statements above, though for repeated matches it is
2473 less efficient since it does not allow the compiled pattern to be reused.
2474
2475 Instances of this class are immutable and are safe for use by multiple
2476 concurrent threads. Instances of the (|java.util.regex.Matcher|) class are not
2477 safe for such use.
2478
2479 Summary of regular-expression constructs
2480
2481
2482
2483 Construct Matches
2484
2485 Characters
2486
2487 x The character x \\ The backslash character \0n The character with octal value
2488 0n (0<=n<=7) \0nn The character with octal value 0nn (0<=n<=7) \0mnn The
2489 character with octal value 0mnn (0<=m<=3, 0<=n<=7) \xhh The character with
2490 hexadecimalvalue0xhh uhhhh The character with hexadecimalvalue0xhhhh \t The tab
2491 character ('u0009') \n The newline (line feed) character ('u000A') \r The
2492 carriage-return character ('u000D') \f The form-feed character ('u000C') \a The
2493 alert (bell) character ('u0007') \e The escape character ('u001B') \cx The
2494 control character corresponding to x
2495
2496 Character classes
2497
2498 [abc] a, b, or c (simple class) [^abc] Any character except a, b, or c
2499 (negation) [a-zA-Z] a through z or A through Z, inclusive (range) [a-d[m-p]] a
2500 through d, or m through p: [a-dm-p] (union) [a-z d, e, or f (intersection) [a-z
2501 a through z, except for b and c: [ad-z] (subtraction) [a-z a through z, and not
2502 m through p: [a-lq-z](subtraction)
2503
2504 Predefined character classes
2505
2506 . Any character (may or may not match line terminators) \d A digit: [0-9] \D A
2507 non-digit: [^0-9] \s A whitespace character: [ \t\n\x0B\f\r] \S A
2508 non-whitespace character: [^\s] \w A word character: [a-zA-Z_0-9] \W A non-word
2509 character: [^\w]
2510
2511 POSIX character classes (US-ASCII only)
2512
2513 \p{Lower} A lower-case alphabetic character: [a-z] \p{Upper} An upper-case
2514 alphabetic character:[A-Z] \p{ASCII} All ASCII:[\x00-\x7F] \p{Alpha} An
2515 alphabetic character:[\p{Lower}\p{Upper}] \p{Digit} A decimal digit: [0-9]
2516 \p{Alnum} An alphanumeric character:[\p{Alpha}\p{Digit}] \p{Punct} Punctuation:
2517 One of !"#$%?@[\]^_`{|}~ [\!"#\$%\\?@\[\\\]\^_`\{\|\}~]
2518 [\X21-\X2F\X31-\X40\X5B-\X60\X7B-\X7E] --> \p{Graph} A visible character:
2519 [\p{Alnum}\p{Punct}] \p{Print} A printable character: [\p{Graph}\x20] \p{Blank}
2520 A space or a tab: [ \t] \p{Cntrl} A control character: [\x00-\x1F\x7F]
2521 \p{XDigit} A hexadecimal digit: [0-9a-fA-F] \p{Space} A whitespace character: [
2522 \t\n\x0B\f\r]
2523
2524 java.lang.Character classes (simple java character type)
2525
2526 \p{javaLowerCase} Equivalent to java.lang.Character.isLowerCase()
2527 \p{javaUpperCase} Equivalent to java.lang.Character.isUpperCase()
2528 \p{javaWhitespace} Equivalent to java.lang.Character.isWhitespace()
2529 \p{javaMirrored} Equivalent to java.lang.Character.isMirrored()
2530
2531 Classes for Unicode blocks and categories
2532
2533 \p{InGreek} A character in the Greekblock (simple block) \p{Lu} An uppercase
2534 letter (simple category) \p{Sc} A currency symbol \P{InGreek} Any character
2535 except one in the Greek block (negation) [\p{L} Any letter except an uppercase
2536 letter (subtraction)
2537
2538 Boundary matchers
2539
2540 ^ The beginning of a line $ The end of a line \b A word boundary \B A non-word
2541 boundary \A The beginning of the input \G The end of the previous match \Z The
2542 end of the input but for the final terminator, ifany \z The end of the input
2543
2544 Greedy quantifiers
2545
2546 X? X, once or not at all X* X, zero or more times X+ X, one or more times X{n}
2547 X, exactly n times X{n,} X, at least n times X{n,m} X, at least n but not more
2548 than m times
2549
2550 Reluctant quantifiers
2551
2552 X?? X, once or not at all X*? X, zero or more times X+? X, one or more times
2553 X{n}? X, exactly n times X{n,}? X, at least n times X{n,m}? X, at least n but
2554 not more than m times
2555
2556 Possessive quantifiers
2557
2558 X?+ X, once or not at all X*+ X, zero or more times X++ X, one or more times
2559 X{n}+ X, exactly n times X{n,}+ X, at least n times X{n,m}+ X, at least n but
2560 not more than m times
2561
2562 Logical operators
2563
2564 XY X followed by Y X|Y Either X or Y (X) X, as a capturing group
2565
2566 Back references
2567
2568 \n Whatever the nth capturing group matched
2569
2570 Quotation
2571
2572 \ Nothing, but quotes the following character \Q Nothing, but quotes all
2573 characters until \E \E Nothing, but ends quoting started by \Q ?[\]^{|} -->
2574
2575 Special constructs (non-capturing)
2576
2577 (?:X) X, as a non-capturing group (?idmsux-idmsux) Nothing, but turns match
2578 flags on - off (?idmsux-idmsux:X) X, as a non-capturing group with the given
2579 flags on - off (?=X) X, via zero-width positive lookahead (?!X) X, via
2580 zero-width negative lookahead (?<=X) X, via zero-width positive lookbehind
2581 (?<!X) X, via zero-width negative lookbehind (?>X) X, as an independent,
2582 non-capturing group
2583
2584
2585
2586
2587
2588 Backslashes, escapes, and quoting
2589
2590 The backslash character ('\') serves to introduce escaped constructs, as
2591 defined in the table above, as well as to quote characters that otherwise would
2592 be interpreted as unescaped constructs. Thus the expression \\ matches a single
2593 backslash and \{ matches a left brace.
2594
2595 It is an error to use a backslash prior to any alphabetic character that does
2596 not denote an escaped construct; these are reserved for future extensions to
2597 the regular-expression language. A backslash may be used prior to a
2598 non-alphabetic character regardless of whether that character is part of an
2599 unescaped construct.
2600
2601 Backslashes within string literals in Java source code are interpreted as
2602 required by the Java Language Specification as either Unicode escapes or other
2603 character escapes. It is therefore necessary to double backslashes in string
2604 literals that represent regular expressions to protect them from interpretation
2605 by the Java bytecode compiler. The string literal "b", for example, matches a
2606 single backspace character when interpreted as a regular expression, while "b"
2607 matches a word boundary. The string literal "(hello)" is illegal and leads to a
2608 compile-time error; in order to match the string (hello) the string literal
2609 "(hello)" must be used.
2610
2611 Character Classes
2612
2613 Character classes may appear within other character classes, and may be
2614 composed by the union operator (implicit) and the intersection operator ( and
2615 and ). The union operator denotes a class that contains every character that is
2616 in at least one of its operand classes. The intersection operator denotes a
2617 class that contains every character that is in both of its operand classes.
2618
2619 The precedence of character-class operators is as follows, from highest to
2620 lowest:
2621
2622 1 Literal escape \x 2 Grouping [...] 3 Range a-z 4 Union [a-e][i-u] 5
2623 Intersection [a-z
2624
2625 Note that a different set of metacharacters are in effect inside a character
2626 class than outside a character class. For instance, the regular expression .
2627 loses its special meaning inside a character class, while the expression -
2628 becomes a range forming metacharacter.
2629
2630 Line terminators
2631
2632 A line terminator is a one- or two-character sequence that marks the end of a
2633 line of the input character sequence. The following are recognized as line
2634 terminators:
2635
2636
2637
2638 A newline (line feed) character('\n'),
2639
2640 A carriage-return character followed immediately by a newline
2641 character("\r\n"),
2642
2643 A standalone carriage-return character('\r'),
2644
2645 A next-line character('u0085'),
2646
2647 A line-separator character('u2028'), or
2648
2649 A paragraph-separator character('u2029).
2650
2651 If (|java.util.regex.Pattern|) mode is activated, then the only line
2652 terminators recognized are newline characters.
2653
2654 The regular expression . matches any character except a line terminator unless
2655 the (|java.util.regex.Pattern|) flag is specified.
2656
2657 By default, the regular expressions ^ and $ ignore line terminators and only
2658 match at the beginning and the end, respectively, of the entire input sequence.
2659 If (|java.util.regex.Pattern|) mode is activated then ^ matches at the
2660 beginning of input and after any line terminator except at the end of input.
2661 When in (|java.util.regex.Pattern|) mode $ matches just before a line
2662 terminator or the end of the input sequence.
2663
2664 Groups and capturing
2665
2666 Capturing groups are numbered by counting their opening parentheses from left
2667 to right. In the expression ((A)(B(C))), for example, there are four such
2668 groups:
2669
2670 1 ((A)(B(C))) 2 (A) 3 (B(C)) 4 (C)
2671
2672 Group zero always stands for the entire expression.
2673
2674 Capturing groups are so named because, during a match, each subsequence of the
2675 input sequence that matches such a group is saved. The captured subsequence may
2676 be used later in the expression, via a back reference, and may also be
2677 retrieved from the matcher once the match operation is complete.
2678
2679 The captured input associated with a group is always the subsequence that the
2680 group most recently matched. If a group is evaluated a second time because of
2681 quantification then its previously-captured value, if any, will be retained if
2682 the second evaluation fails. Matching the string "aba" against the expression
2683 (a(b)?)+, for example, leaves group two set to "b". All captured input is
2684 discarded at the beginning of each match.
2685
2686 Groups beginning with (? are pure, non-capturing groups that do not capture
2687 text and do not count towards the group total.
2688
2689 Unicode support
2690
2691 This class is in conformance with Level 1 of Unicode Technical Standard #18:
2692 Unicode Regular Expression Guidelines, plus RL2.1 Canonical Equivalents.
2693
2694 Unicode escape sequences such as u2014 in Java source code are processed as
2695 described in ¤3.3 of the Java Language Specification. Such escape sequences are
2696 also implemented directly by the regular-expression parser so that Unicode
2697 escapes can be used in expressions that are read from files or from the
2698 keyboard. Thus the strings "u2014" and "\\u2014", while not equal, compile into
2699 the same pattern, which matches the character with hexadecimal value 0x2014.
2700
2701 Unicode blocks and categories are written with the \p and \P constructs as in
2702 Perl. \p{prop} matches if the input has the property prop, while \P{prop} does
2703 not match if the input has that property. Blocks are specified with the prefix
2704 In, as in InMongolian. Categories may be specified with the optional prefix Is:
2705 Both \p{L} and \p{IsL} denote the category of Unicode letters. Blocks and
2706 categories can be used both inside and outside of a character class.
2707
2708 The supported categories are those of
2709
2710 The Unicode Standard in the version specified by the
2711 Character(|java.lang.Character|) class. The category names are those defined in
2712 the Standard, both normative and informative. The block names supported by
2713 Pattern are the valid block names accepted and defined by
2714 UnicodeBlock.forName(|java.lang.Character.UnicodeBlock|) .
2715
2716 Categories that behave like the java.lang.Character boolean ismethodname
2717 methods (except for the deprecated ones) are available through the same
2718 \p{prop} syntax where the specified property has the name javamethodname.
2719
2720 Comparison to Perl 5
2721
2722 The Pattern engine performs traditional NFA-based matching with ordered
2723 alternation as occurs in Perl 5.
2724
2725 Perl constructs not supported by this class:
2726
2727
2728
2729 The conditional constructs (?{X}) and (?(condition)X|Y),
2730
2731 The embedded code constructs (?{code}) and (??{code}),
2732
2733 The embedded comment syntax (?#comment), and
2734
2735 The preprocessing operations \l u, \L, and \U.
2736
2737
2738
2739 Constructs supported by this class but not by Perl:
2740
2741
2742
2743 Possessive quantifiers, which greedily match as much as they can and do not
2744 back off, even when doing so would allow the overall match to succeed.
2745
2746 Character-class union and intersection as described above.
2747
2748
2749
2750 Notable differences from Perl:
2751
2752
2753
2754 In Perl, \1 through \9 are always interpreted as back references; a
2755 backslash-escaped number greater than 9 is treated as a back reference if at
2756 least that many subexpressions exist, otherwise it is interpreted, if possible,
2757 as an octal escape. In this class octal escapes must always begin with a zero.
2758 In this class, \1 through \9 are always interpreted as back references, and a
2759 larger number is accepted as a back reference if at least that many
2760 subexpressions exist at that point in the regular expression, otherwise the
2761 parser will drop digits until the number is smaller or equal to the existing
2762 number of groups or it is one digit.
2763
2764 Perl uses the g flag to request a match that resumes where the last match left
2765 off. This functionality is provided implicitly by the
2766 (|java.util.regex.Matcher|) class: Repeated invocations of the
2767 find(|java.util.regex.Matcher|) method will resume where the last match left
2768 off, unless the matcher is reset.
2769
2770 In Perl, embedded flags at the top level of an expression affect the whole
2771 expression. In this class, embedded flags always take effect at the point at
2772 which they appear, whether they are at the top level or within a group; in the
2773 latter case, flags are restored at the end of the group just as in Perl.
2774
2775 Perl is forgiving about malformed matching constructs, as in the expression *a,
2776 as well as dangling brackets, as in the expression abc], and treats them as
2777 literals. This class also accepts dangling brackets but is strict about
2778 dangling metacharacters like +, ? and *, and will throw a
2779 (|java.util.regex.PatternSyntaxException|) if it encounters them.
2780
2781
2782
2783 For a more precise description of the behavior of regular expression
2784 constructs, please see Mastering Regular Expressions, 2nd Edition, Jeffrey E.
2785 F. Friedl, O'Reilly and Associates, 2002.
2786
2787
2788 *int_java.util.regex.Pattern.UNIX_LINES*
2789
2790 A compiled representation of a regular expression.
2791
2792 A regular expression, specified as a string, must first be compiled into an
2793 instance of this class. The resulting pattern can then be used to create a
2794 (|java.util.regex.Matcher|) object that can match arbitrary </code>character
2795 sequences<code>(|java.lang.CharSequence|) against the regular expression. All
2796 of the state involved in performing a match resides in the matcher, so many
2797 matchers can share the same pattern.
2798
2799 A typical invocation sequence is thus
2800
2801
2802
2803 Pattern p = Pattern. compile(|java.util.regex.Pattern|) ("a*b"); Matcher m = p.
2804 matcher(|java.util.regex.Pattern|) ("aaaaab"); boolean b = m.
2805 matches(|java.util.regex.Matcher|) ();
2806
2807 A matches(|java.util.regex.Pattern|) method is defined by this class as a
2808 convenience for when a regular expression is used just once. This method
2809 compiles an expression and matches an input sequence against it in a single
2810 invocation. The statement
2811
2812
2813
2814 boolean b = Pattern.matches("a*b", "aaaaab");
2815
2816 is equivalent to the three statements above, though for repeated matches it is
2817 less efficient since it does not allow the compiled pattern to be reused.
2818
2819 Instances of this class are immutable and are safe for use by multiple
2820 concurrent threads. Instances of the (|java.util.regex.Matcher|) class are not
2821 safe for such use.
2822
2823 Summary of regular-expression constructs
2824
2825
2826
2827 Construct Matches
2828
2829 Characters
2830
2831 x The character x \\ The backslash character \0n The character with octal value
2832 0n (0<=n<=7) \0nn The character with octal value 0nn (0<=n<=7) \0mnn The
2833 character with octal value 0mnn (0<=m<=3, 0<=n<=7) \xhh The character with
2834 hexadecimalvalue0xhh uhhhh The character with hexadecimalvalue0xhhhh \t The tab
2835 character ('u0009') \n The newline (line feed) character ('u000A') \r The
2836 carriage-return character ('u000D') \f The form-feed character ('u000C') \a The
2837 alert (bell) character ('u0007') \e The escape character ('u001B') \cx The
2838 control character corresponding to x
2839
2840 Character classes
2841
2842 [abc] a, b, or c (simple class) [^abc] Any character except a, b, or c
2843 (negation) [a-zA-Z] a through z or A through Z, inclusive (range) [a-d[m-p]] a
2844 through d, or m through p: [a-dm-p] (union) [a-z d, e, or f (intersection) [a-z
2845 a through z, except for b and c: [ad-z] (subtraction) [a-z a through z, and not
2846 m through p: [a-lq-z](subtraction)
2847
2848 Predefined character classes
2849
2850 . Any character (may or may not match line terminators) \d A digit: [0-9] \D A
2851 non-digit: [^0-9] \s A whitespace character: [ \t\n\x0B\f\r] \S A
2852 non-whitespace character: [^\s] \w A word character: [a-zA-Z_0-9] \W A non-word
2853 character: [^\w]
2854
2855 POSIX character classes (US-ASCII only)
2856
2857 \p{Lower} A lower-case alphabetic character: [a-z] \p{Upper} An upper-case
2858 alphabetic character:[A-Z] \p{ASCII} All ASCII:[\x00-\x7F] \p{Alpha} An
2859 alphabetic character:[\p{Lower}\p{Upper}] \p{Digit} A decimal digit: [0-9]
2860 \p{Alnum} An alphanumeric character:[\p{Alpha}\p{Digit}] \p{Punct} Punctuation:
2861 One of !"#$%?@[\]^_`{|}~ [\!"#\$%\\?@\[\\\]\^_`\{\|\}~]
2862 [\X21-\X2F\X31-\X40\X5B-\X60\X7B-\X7E] --> \p{Graph} A visible character:
2863 [\p{Alnum}\p{Punct}] \p{Print} A printable character: [\p{Graph}\x20] \p{Blank}
2864 A space or a tab: [ \t] \p{Cntrl} A control character: [\x00-\x1F\x7F]
2865 \p{XDigit} A hexadecimal digit: [0-9a-fA-F] \p{Space} A whitespace character: [
2866 \t\n\x0B\f\r]
2867
2868 java.lang.Character classes (simple java character type)
2869
2870 \p{javaLowerCase} Equivalent to java.lang.Character.isLowerCase()
2871 \p{javaUpperCase} Equivalent to java.lang.Character.isUpperCase()
2872 \p{javaWhitespace} Equivalent to java.lang.Character.isWhitespace()
2873 \p{javaMirrored} Equivalent to java.lang.Character.isMirrored()
2874
2875 Classes for Unicode blocks and categories
2876
2877 \p{InGreek} A character in the Greekblock (simple block) \p{Lu} An uppercase
2878 letter (simple category) \p{Sc} A currency symbol \P{InGreek} Any character
2879 except one in the Greek block (negation) [\p{L} Any letter except an uppercase
2880 letter (subtraction)
2881
2882 Boundary matchers
2883
2884 ^ The beginning of a line $ The end of a line \b A word boundary \B A non-word
2885 boundary \A The beginning of the input \G The end of the previous match \Z The
2886 end of the input but for the final terminator, ifany \z The end of the input
2887
2888 Greedy quantifiers
2889
2890 X? X, once or not at all X* X, zero or more times X+ X, one or more times X{n}
2891 X, exactly n times X{n,} X, at least n times X{n,m} X, at least n but not more
2892 than m times
2893
2894 Reluctant quantifiers
2895
2896 X?? X, once or not at all X*? X, zero or more times X+? X, one or more times
2897 X{n}? X, exactly n times X{n,}? X, at least n times X{n,m}? X, at least n but
2898 not more than m times
2899
2900 Possessive quantifiers
2901
2902 X?+ X, once or not at all X*+ X, zero or more times X++ X, one or more times
2903 X{n}+ X, exactly n times X{n,}+ X, at least n times X{n,m}+ X, at least n but
2904 not more than m times
2905
2906 Logical operators
2907
2908 XY X followed by Y X|Y Either X or Y (X) X, as a capturing group
2909
2910 Back references
2911
2912 \n Whatever the nth capturing group matched
2913
2914 Quotation
2915
2916 \ Nothing, but quotes the following character \Q Nothing, but quotes all
2917 characters until \E \E Nothing, but ends quoting started by \Q ?[\]^{|} -->
2918
2919 Special constructs (non-capturing)
2920
2921 (?:X) X, as a non-capturing group (?idmsux-idmsux) Nothing, but turns match
2922 flags on - off (?idmsux-idmsux:X) X, as a non-capturing group with the given
2923 flags on - off (?=X) X, via zero-width positive lookahead (?!X) X, via
2924 zero-width negative lookahead (?<=X) X, via zero-width positive lookbehind
2925 (?<!X) X, via zero-width negative lookbehind (?>X) X, as an independent,
2926 non-capturing group
2927
2928
2929
2930
2931
2932 Backslashes, escapes, and quoting
2933
2934 The backslash character ('\') serves to introduce escaped constructs, as
2935 defined in the table above, as well as to quote characters that otherwise would
2936 be interpreted as unescaped constructs. Thus the expression \\ matches a single
2937 backslash and \{ matches a left brace.
2938
2939 It is an error to use a backslash prior to any alphabetic character that does
2940 not denote an escaped construct; these are reserved for future extensions to
2941 the regular-expression language. A backslash may be used prior to a
2942 non-alphabetic character regardless of whether that character is part of an
2943 unescaped construct.
2944
2945 Backslashes within string literals in Java source code are interpreted as
2946 required by the Java Language Specification as either Unicode escapes or other
2947 character escapes. It is therefore necessary to double backslashes in string
2948 literals that represent regular expressions to protect them from interpretation
2949 by the Java bytecode compiler. The string literal "b", for example, matches a
2950 single backspace character when interpreted as a regular expression, while "b"
2951 matches a word boundary. The string literal "(hello)" is illegal and leads to a
2952 compile-time error; in order to match the string (hello) the string literal
2953 "(hello)" must be used.
2954
2955 Character Classes
2956
2957 Character classes may appear within other character classes, and may be
2958 composed by the union operator (implicit) and the intersection operator ( and
2959 and ). The union operator denotes a class that contains every character that is
2960 in at least one of its operand classes. The intersection operator denotes a
2961 class that contains every character that is in both of its operand classes.
2962
2963 The precedence of character-class operators is as follows, from highest to
2964 lowest:
2965
2966 1 Literal escape \x 2 Grouping [...] 3 Range a-z 4 Union [a-e][i-u] 5
2967 Intersection [a-z
2968
2969 Note that a different set of metacharacters are in effect inside a character
2970 class than outside a character class. For instance, the regular expression .
2971 loses its special meaning inside a character class, while the expression -
2972 becomes a range forming metacharacter.
2973
2974 Line terminators
2975
2976 A line terminator is a one- or two-character sequence that marks the end of a
2977 line of the input character sequence. The following are recognized as line
2978 terminators:
2979
2980
2981
2982 A newline (line feed) character('\n'),
2983
2984 A carriage-return character followed immediately by a newline
2985 character("\r\n"),
2986
2987 A standalone carriage-return character('\r'),
2988
2989 A next-line character('u0085'),
2990
2991 A line-separator character('u2028'), or
2992
2993 A paragraph-separator character('u2029).
2994
2995 If (|java.util.regex.Pattern|) mode is activated, then the only line
2996 terminators recognized are newline characters.
2997
2998 The regular expression . matches any character except a line terminator unless
2999 the (|java.util.regex.Pattern|) flag is specified.
3000
3001 By default, the regular expressions ^ and $ ignore line terminators and only
3002 match at the beginning and the end, respectively, of the entire input sequence.
3003 If (|java.util.regex.Pattern|) mode is activated then ^ matches at the
3004 beginning of input and after any line terminator except at the end of input.
3005 When in (|java.util.regex.Pattern|) mode $ matches just before a line
3006 terminator or the end of the input sequence.
3007
3008 Groups and capturing
3009
3010 Capturing groups are numbered by counting their opening parentheses from left
3011 to right. In the expression ((A)(B(C))), for example, there are four such
3012 groups:
3013
3014 1 ((A)(B(C))) 2 (A) 3 (B(C)) 4 (C)
3015
3016 Group zero always stands for the entire expression.
3017
3018 Capturing groups are so named because, during a match, each subsequence of the
3019 input sequence that matches such a group is saved. The captured subsequence may
3020 be used later in the expression, via a back reference, and may also be
3021 retrieved from the matcher once the match operation is complete.
3022
3023 The captured input associated with a group is always the subsequence that the
3024 group most recently matched. If a group is evaluated a second time because of
3025 quantification then its previously-captured value, if any, will be retained if
3026 the second evaluation fails. Matching the string "aba" against the expression
3027 (a(b)?)+, for example, leaves group two set to "b". All captured input is
3028 discarded at the beginning of each match.
3029
3030 Groups beginning with (? are pure, non-capturing groups that do not capture
3031 text and do not count towards the group total.
3032
3033 Unicode support
3034
3035 This class is in conformance with Level 1 of Unicode Technical Standard #18:
3036 Unicode Regular Expression Guidelines, plus RL2.1 Canonical Equivalents.
3037
3038 Unicode escape sequences such as u2014 in Java source code are processed as
3039 described in ¤3.3 of the Java Language Specification. Such escape sequences are
3040 also implemented directly by the regular-expression parser so that Unicode
3041 escapes can be used in expressions that are read from files or from the
3042 keyboard. Thus the strings "u2014" and "\\u2014", while not equal, compile into
3043 the same pattern, which matches the character with hexadecimal value 0x2014.
3044
3045 Unicode blocks and categories are written with the \p and \P constructs as in
3046 Perl. \p{prop} matches if the input has the property prop, while \P{prop} does
3047 not match if the input has that property. Blocks are specified with the prefix
3048 In, as in InMongolian. Categories may be specified with the optional prefix Is:
3049 Both \p{L} and \p{IsL} denote the category of Unicode letters. Blocks and
3050 categories can be used both inside and outside of a character class.
3051
3052 The supported categories are those of
3053
3054 The Unicode Standard in the version specified by the
3055 Character(|java.lang.Character|) class. The category names are those defined in
3056 the Standard, both normative and informative. The block names supported by
3057 Pattern are the valid block names accepted and defined by
3058 UnicodeBlock.forName(|java.lang.Character.UnicodeBlock|) .
3059
3060 Categories that behave like the java.lang.Character boolean ismethodname
3061 methods (except for the deprecated ones) are available through the same
3062 \p{prop} syntax where the specified property has the name javamethodname.
3063
3064 Comparison to Perl 5
3065
3066 The Pattern engine performs traditional NFA-based matching with ordered
3067 alternation as occurs in Perl 5.
3068
3069 Perl constructs not supported by this class:
3070
3071
3072
3073 The conditional constructs (?{X}) and (?(condition)X|Y),
3074
3075 The embedded code constructs (?{code}) and (??{code}),
3076
3077 The embedded comment syntax (?#comment), and
3078
3079 The preprocessing operations \l u, \L, and \U.
3080
3081
3082
3083 Constructs supported by this class but not by Perl:
3084
3085
3086
3087 Possessive quantifiers, which greedily match as much as they can and do not
3088 back off, even when doing so would allow the overall match to succeed.
3089
3090 Character-class union and intersection as described above.
3091
3092
3093
3094 Notable differences from Perl:
3095
3096
3097
3098 In Perl, \1 through \9 are always interpreted as back references; a
3099 backslash-escaped number greater than 9 is treated as a back reference if at
3100 least that many subexpressions exist, otherwise it is interpreted, if possible,
3101 as an octal escape. In this class octal escapes must always begin with a zero.
3102 In this class, \1 through \9 are always interpreted as back references, and a
3103 larger number is accepted as a back reference if at least that many
3104 subexpressions exist at that point in the regular expression, otherwise the
3105 parser will drop digits until the number is smaller or equal to the existing
3106 number of groups or it is one digit.
3107
3108 Perl uses the g flag to request a match that resumes where the last match left
3109 off. This functionality is provided implicitly by the
3110 (|java.util.regex.Matcher|) class: Repeated invocations of the
3111 find(|java.util.regex.Matcher|) method will resume where the last match left
3112 off, unless the matcher is reset.
3113
3114 In Perl, embedded flags at the top level of an expression affect the whole
3115 expression. In this class, embedded flags always take effect at the point at
3116 which they appear, whether they are at the top level or within a group; in the
3117 latter case, flags are restored at the end of the group just as in Perl.
3118
3119 Perl is forgiving about malformed matching constructs, as in the expression *a,
3120 as well as dangling brackets, as in the expression abc], and treats them as
3121 literals. This class also accepts dangling brackets but is strict about
3122 dangling metacharacters like +, ? and *, and will throw a
3123 (|java.util.regex.PatternSyntaxException|) if it encounters them.
3124
3125
3126
3127 For a more precise description of the behavior of regular expression
3128 constructs, please see Mastering Regular Expressions, 2nd Edition, Jeffrey E.
3129 F. Friedl, O'Reilly and Associates, 2002.
3130
3131
3132
3133 *java.util.regex.Pattern.compile(String)*
3134
3135 public static |java.util.regex.Pattern| compile(java.lang.String regex)
3136
3137 Compiles the given regular expression into a pattern.
3138
3139     regex - The expression to be compiled
3140
3141 *java.util.regex.Pattern.compile(String,int)*
3142
3143 public static |java.util.regex.Pattern| compile(
3144   java.lang.String regex,
3145   int flags)
3146
3147 Compiles the given regular expression into a pattern with the given flags.
3148
3149     regex - The expression to be compiled
3150     flags - Match flags, a bit mask that may include {@link #CASE_INSENSITIVE}, {@link
3151        #MULTILINE}, {@link #DOTALL}, {@link #UNICODE_CASE}, and {@link
3152        #CANON_EQ}
3153
3154 *java.util.regex.Pattern.flags()*
3155
3156 public int flags()
3157
3158 Returns this pattern's match flags.
3159
3160
3161     Returns: The match flags specified when this pattern was compiled
3162 *java.util.regex.Pattern.matcher(CharSequence)*
3163
3164 public |java.util.regex.Matcher| matcher(java.lang.CharSequence input)
3165
3166 Creates a matcher that will match the given input against this pattern.
3167
3168     input - The character sequence to be matched
3169
3170     Returns: A new matcher for this pattern
3171 *java.util.regex.Pattern.matches(String,CharSequence)*
3172
3173 public static boolean matches(
3174   java.lang.String regex,
3175   java.lang.CharSequence input)
3176
3177 Compiles the given regular expression and attempts to match the given input
3178 against it.
3179
3180 An invocation of this convenience method of the form
3181
3182
3183
3184 Pattern.matches(regex, input);
3185
3186 behaves in exactly the same way as the expression
3187
3188
3189
3190 Pattern.compile(regex).matcher(input).matches()
3191
3192 If a pattern is to be used multiple times, compiling it once and reusing it
3193 will be more efficient than invoking this method each time.
3194
3195     regex - The expression to be compiled
3196     input - The character sequence to be matched
3197
3198 *java.util.regex.Pattern.pattern()*
3199
3200 public |java.lang.String| pattern()
3201
3202 Returns the regular expression from which this pattern was compiled.
3203
3204
3205     Returns: The source of this pattern
3206 *java.util.regex.Pattern.quote(String)*
3207
3208 public static |java.lang.String| quote(java.lang.String s)
3209
3210 Returns a literal pattern String for the specified String.
3211
3212 This method produces a String that can be used to create a Pattern that would
3213 match the string s as if it were a literal pattern. Metacharacters or escape
3214 sequences in the input sequence will be given no special meaning.
3215
3216     s - The string to be literalized
3217
3218     Returns: A literal string replacement
3219 *java.util.regex.Pattern.split(CharSequence)*
3220
3221 public |java.lang.String| split(java.lang.CharSequence input)
3222
3223 Splits the given input sequence around matches of this pattern.
3224
3225 This method works as if by invoking the two-argument
3226 split(|java.util.regex.Pattern|) method with the given input sequence and a
3227 limit argument of zero. Trailing empty strings are therefore not included in
3228 the resulting array.
3229
3230 The input "boo:and:foo", for example, yields the following results with these
3231 expressions:
3232
3233 Regex Result : { "boo", "and", "foo" } o { "b", "", ":and:f" }
3234
3235     input - The character sequence to be split
3236
3237     Returns: The array of strings computed by splitting the input around matches of this
3238              pattern
3239 *java.util.regex.Pattern.split(CharSequence,int)*
3240
3241 public |java.lang.String| split(
3242   java.lang.CharSequence input,
3243   int limit)
3244
3245 Splits the given input sequence around matches of this pattern.
3246
3247 The array returned by this method contains each substring of the input sequence
3248 that is terminated by another subsequence that matches this pattern or is
3249 terminated by the end of the input sequence. The substrings in the array are in
3250 the order in which they occur in the input. If this pattern does not match any
3251 subsequence of the input then the resulting array has just one element, namely
3252 the input sequence in string form.
3253
3254 The limit parameter controls the number of times the pattern is applied and
3255 therefore affects the length of the resulting array. If the limit n is greater
3256 than zero then the pattern will be applied at most n-1 times, the array's
3257 length will be no greater than n, and the array's last entry will contain all
3258 input beyond the last matched delimiter. If n is non-positive then the pattern
3259 will be applied as many times as possible and the array can have any length. If
3260 n is zero then the pattern will be applied as many times as possible, the array
3261 can have any length, and trailing empty strings will be discarded.
3262
3263 The input "boo:and:foo", for example, yields the following results with these
3264 parameters:
3265
3266 Regex Limit Result : 2 { "boo", "and:foo" } : 5 { "boo", "and", "foo" } : -2 {
3267 "boo", "and", "foo" } o 5 { "b", "", ":and:f", "", "" } o -2 { "b", "",
3268 ":and:f", "", "" } o 0 { "b", "", ":and:f" }
3269
3270     input - The character sequence to be split
3271     limit - The result threshold, as described above
3272
3273     Returns: The array of strings computed by splitting the input around matches of this
3274              pattern
3275 *java.util.regex.Pattern.toString()*
3276
3277 public |java.lang.String| toString()
3278
3279 Returns the string representation of this pattern. This is the regular
3280 expression from which this pattern was compiled.
3281
3282
3283     Returns: The string representation of this pattern
3284