A regular expression is a way of telling
awk to
select records that contain certain strings of characters. For example,
the instruction:
/ri/ { print }
tells
awk to
print all records that contain the string
ri. Regular
expressions are always enclosed in
slashes, as shown in the
instruction just discussed. For a discussion of regular expressions
beyond their usage in
awk, see
Appendix C. Regular Expressions (regexp) in
z/OS UNIX System Services Command Reference.
The following characters have special meanings when you use them
in regular expressions:
- Character
- Meaning
- ^
- Stands for the beginning of a field. For example:
$2 ~ /^b/ { print }
Prints
any record whose second field begins with b.
- $
- Stands for the end of a field. For example:
$2 ~ /g$/ { print }
prints
any record with a second field that ends with g.
- .
- Matches any single character (except the newline). For example:
$2 ~ /i.g/ { print }
selects the records with
fields containing ing, and also selects the records
containing bridge (idg).
- |
- Means or. For example:
/Linda|Lori/
is
a regular expression that matches either of the strings Linda or Lori.
- *
- Indicates zero or more repetitions of a character. For example:
/ab*c/
matches abc, abbc, abbbc,
and so on. It also matches ac (zero repetitions of b).
Since . matches any character except the newline, .* matches
an arbitrary string of zero or more characters. For example: $2 ~ /^r.*g$/ { print }
prints
any record with a second field that begins with r,
ends in g, and has any set of characters between
(for example, reading and role playing).
- +
- Is similar to *, but stands for one or
more repetitions of a character. For example:
/ab+c/
matches abc, abbc,
and so on, but does not match ac.
- \{m,n\}
- Indicates m to n repetitions of a character (where m and n are
both integers). For example:
/ab\{2,4\}c/
would
match abbc, abbbc, and abbbbc,
and nothing else.
- ?
- Is similar to *, but stands for zero or one repetitions
of a string. For example:
/ab?c/
matches ac and abc,
but not abbc, and so on.
- [X]
- Matches any one of the set of characters X given inside
the square brackets. For example:
$1 ~ /^[LJ]/ { print }
prints
any record whose first field begins with either L or J.
As a special case: [:lower:] inside the square brackets
stands for any lowercase letter, [:upper:] inside
the square brackets stands for any uppercase letter, [:alpha:] inside
the square brackets stands for any letter, and [:digit:] inside
the square brackets stands for any digit. Thus:
/[[:digit:][:alpha:]]/
matches a digit or letter.
- [^X]
- Matches any one character that is not in the set X. For
example:
$1 ~ /^[^LJ]/ { print }
prints any
record with a first field that does not begin with L or J.
$1 ~ /^[^[:digit:]]/ { print }
prints any record with a first field that does not begin
with a digit.
- (X)
- Matches anything that the regular expression X does. You
can use parentheses to control how other special characters behave.
For example, * normally applies to the single character
immediately preceding it. This means that:
/abc*d/
matches abd, abcd, abccd,
and so on. However: /a(bc)*d/
matches ad, abcd, abcbcd, abcbcbcd,
and so on.
The characters with special meanings are:
^ $ . * + ? [ ] ( ) |
These
are known as
metacharacters.
When a metacharacter appears in a regular expression, it usually
has its special meaning. If you want to use one of these characters
literally (without its special meaning), put a backslash in front
of the character. For example:
/\$1/ { print }
prints
all records that contain a dollar sign
$ followed
by a
1. If you simply entered:
/$1/ { print }
awk would
search for records where the end of the record was followed by a
1,
which is impossible.
Because the backslash has this special meaning, \ is
also considered a metacharacter. If you want to create a regular expression
that matches a backslash, you must therefore use two backslashes \\.