This
section provides an overview
of the most important syntax elements.
Characters
The following table shows
characters.
Table 1. Characters| Construct |
Matches |
| x |
The character x |
| \\ |
The
backslash character |
| \0n |
The character with octal value 0n (0 <= n <= 7) |
| \0nn |
The
character with octal value 0nn (0 <= n <= 7) |
| \0mnn |
The character
with octal value 0mnn (0 <= m <= 3,
0 <= n <= 7) |
| \xhh |
The character with hexadecimal value 0xhh |
| \uhhhh |
The
character with hexadecimal value 0xhhhh |
| \t |
The tab character ('\u0009') |
| \n |
The
newline (line feed) character ('\u000A') |
| \r |
The carriage-return
character ('\u000D') |
| \f |
The form-feed character ('\u000C') |
| \a |
The alert
(bell) character ('\u0007') |
| \e |
The escape character ('\u001B') |
| \cx |
The control
character corresponding to x |
Character classes
The following table
shows character classes.
Table 2. Character classes| Construct |
Matches |
| [abc] |
a, b, or c (simple
class) |
| [^abc] |
Any character except a, b, or c (negation) |
| [a-zA-Z] |
a
through z or A through Z, inclusive (range) |
| [a-d[m-p]] |
a through
d, or m through p: [a-dm-p] (union) |
| [a-z&&[def]] |
d,
e, or f (intersection) |
| [a-z&&[^bc]] |
a through z, except for b and c: [ad-z] (subtraction) |
| [a-z&&[^m-p]] |
a through z, and not m through p: [a-lq-z](subtraction) |
Predefined
character classes
The following table
shows predefined character classes.
Table 3. Predefined character
classes| Construct |
Matches |
| . |
Any character |
| \d |
A digit: [0-9] |
| \D |
A non-digit: [^0-9] |
| \s |
A
whitespace character: [ \t\n\x0B\f\r] |
| \S |
A non-whitespace character:
[^\s] |
| \w |
A word character: [a-zA-Z_0-9] |
| \W |
A non-word character:
[^\w] |
POSIX
character classes
The
following POSIX character classes apply to US-ASCII only.
Table 4. POSIX
character classes that apply to US-ASCII only| Construct |
Matches |
| \p{Lower} |
A lower-case
alphabetic character: [a-z] |
| \p{Upper} |
An upper-case alphabetic character:[A-Z] |
| \p{ASCII} |
All
ASCII:[\x00-\x7F] |
| \p{Alpha} |
An alphabetic character:[\p{Lower}\p{Upper}] |
| \p{Digit} |
A
decimal digit: [0-9] |
| \p{Alnum} |
An alphanumeric character:[\p{Alpha}\p{Digit}] |
| \p{Punct} |
Punctuation:
One of !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~ |
| \p{Graph} |
A visible
character: [\p{Alnum}\p{Punct}] |
| \p{Print} |
A printable character: [\p{Graph}\x20] |
| \p{Blank} |
A
space or a tab: [ \t] |
| \p{Cntrl} |
A control character: [\x00-\x1F\x7F] |
| \p{XDigit} |
A
hexadecimal digit: [0-9a-fA-F] |
| \p{Space} |
A whitespace character: [ \t\n\x0B\f\r] |
java.lang.Character
classes
The following java.lang.Character
classes are of simple java character type.
Table 5. java.lang.Character
classes of simple java character type| Construct |
Matches |
| \p{javaLowerCase} |
Equivalent
to java.lang.Character.isLowerCase() |
| \p{javaUpperCase} |
Equivalent
to java.lang.Character.isUpperCase() |
| \p{javaWhitespace} |
Equivalent
to java.lang.Character.isWhitespace() |
| \p{javaMirrored} |
Equivalent
to java.lang.Character.isMirrored() |
Classes for Unicode blocks and categories
The following table
shows classes for Unicode blocks and categories.
Table 6. Classes
for Unicode blocks and categories| Construct |
Matches |
| \p{InGreek} |
A character
in the Greek block (simple block) |
| \p{Lu} |
An uppercase letter
(simple category) |
| \p{Sc} |
A currency symbol |
| \P{InGreek} |
Any character
except one in the Greek block
(negation) |
| [\p{L}&&[^\p{Lu}]] |
Any letter except an uppercase letter (subtraction) |
Boundary
matchers
The following table shows boundary
matchers.
Table 7. Boundary matchers| Construct |
Matches |
| ^ |
The beginning of a
line |
| $ |
The end of a line |
| \b |
A word boundary |
| \B |
A non-word boundary |
| \A |
The
beginning of the input |
| \G |
The end of the previous match |
| \Z |
The end of
the input but for the final terminator,
if any |
| \z |
The end of the input |
Greedy quantifiers
The
following table shows greedy quantifiers.
Table 8. Greedy quantifiers| Construct |
Matches |
| X? |
X, once or not at all |
| X* |
X, zero or more times |
| X+ |
X,
one or more times |
| X{n} |
X, exactly n times |
| X{n,} |
X, at least n times |
| X{n,m} |
X,
at least n but not more than m times |
Reluctant quantifiers
The
following table shows reluctant quantifiers.
Table 9. Reluctant
quantifiers| Construct |
Matches |
| X?? |
X, once or not at
all |
| X*? |
X, zero or more times |
| X+? |
X, one or more times |
| X{n}? |
X,
exactly n times |
| X{n,}? |
X, at least n times |
| X{n,m}? |
X, at least n
but not more than m times |
Possessive quantifiers
The
following table shows possessive quantifiers.
Table 10. Possessive
quantifiers| Construct |
Matches |
| X?+ |
X, once or not at
all |
| X*+ |
X, zero or more times |
| X++ |
X, one or more times |
| X{n}+ |
X,
exactly n times |
| X{n,}+ |
X, at least n times |
| X{n,m}+ |
X, at least n
but not more than m times |
Logical operators
The
following table shows logical operators.
Table 11. Logical
operators| Construct |
Matches |
Notes® |
| XY |
X followed
by Y |
- |
| X | Y |
Either X or Y |
Use this with care. In the regular-expression
implementation of the Java runtime
environment, the usage of alternations might cause a stack-overflow
runtime-exception depending on the size of the text to be analyzed.
A
pattern that works fine on a short input text might still fail on
a longer input text.
|
| (X
) |
X, as a capturing group |
Use capturing groups to assign matching
subparts of a regular expression to annotation features. Capturing
groups are numbered from left to right, beginning with 1.
Group
0 matches the whole regular expression.
|
Back reference
The following table shows
the back reference.
Table 12. POSIX character classes (US-ASCII
only)| Construct |
Matches |
| \n |
Whatever the nth capturing group
matched |
Quotations
The
following table shows quotations.
Table 13. POSIX character classes
(US-ASCII only)| Construct |
Matches |
| \ |
Nothing, but quotes
the following character |
| \Q |
Nothing, but quotes all characters until \E |
| \E |
Nothing,
but ends quoting started by \Q |
Special constructs (non-capturing)
The
following table shows special constructs.
Table 14. Special constructs
(non-capturing)| Construct |
Matches |
| (?:X) |
X, as a non-capturing
group |
| (?idmsux-idmsux) |
Nothing, but turns match flags i d m s u x on
- off |
| (?idmsux-idmsux:X) |
X, as a non-capturing group with the given flags
i d m s u x on - off |
| (?=X) |
X, via zero-width positive lookahead |
| (?!X) |
X,
via zero-width negative lookahead |
| (?<=X) |
X, via zero-width
positive lookbehind |
| (?X) |
X, via zero-width negative lookbehind |
| (?>X) |
X,
as an independent, non-capturing group |
| (?<!X) |
Does not match
if X occurs before the rule.
For example, if X=Bill\s, the regular expression pattern (?<Bill\s)
Ford matches only terms that match the string Ford and that are not
preceded by the term Bill. |
Match flags
The following table shows
the most important match flags.
Table 15. The most important match
flags| Construct |
Matches |
| (?i) |
Case insensitive matching |
| (?d) |
Enables UNIX lines
mode |
| (?m) |
Enables multiline mode |
| (?s) |
Enables dotall mode |
| (?u) |
Enables
Unicode-aware case sensitivity |
| (?x) |
Permits whitespace and comments in pattern.
In this mode, whitespace is ignored, and embedded comments starting
with # are ignored until the end of a line |
Copyright 1993-2006 Sun Microsystems, Inc. Reprinted
with
permission