Regular expressions
A regular expression is a way of telling awk to
select records that contain certain strings of characters. For example,
the instruction:
/ri/ { print }
tells awk to
print all records that contain the string ri. Regular
expressions are always enclosed in slashes, as shown in the
instruction just discussed. For a discussion of regular expressions
beyond their usage in awk, see Appendix C. Regular Expressions (regexp) in z/OS UNIX System Services Command Reference.
The following characters have special meanings when you use them
in regular expressions:
- Character
- Meaning
- ^
- Stands for the beginning of a field. For example:
Prints any record whose second field begins with b.$2 ~ /^b/ { print }
- $
- Stands for the end of a field. For example:
prints any record with a second field that ends with g.$2 ~ /g$/ { print }
- .
- Matches any single character (except the newline). For example:
selects the records with fields containing ing, and also selects the records containing bridge (idg).$2 ~ /i.g/ { print }
- |
- Means or. For example:
is a regular expression that matches either of the strings Linda or Lori./Linda|Lori/
- *
- Indicates zero or more repetitions of a character. For example:
matches abc, abbc, abbbc, and so on. It also matches ac (zero repetitions of b). Since . matches any character except the newline, .* matches an arbitrary string of zero or more characters. For example:/ab*c/
prints any record with a second field that begins with r, ends in g, and has any set of characters between (for example, reading and role playing).$2 ~ /^r.*g$/ { print }
- +
- Is similar to *, but stands for one or
more repetitions of a character. For example:
matches abc, abbc, and so on, but does not match ac./ab+c/
- \{m,n\}
- Indicates m to n repetitions of a character (where m and n are
both integers). For example:
would match abbc, abbbc, and abbbbc, and nothing else./ab\{2,4\}c/
- ?
- Is similar to *, but stands for zero or one repetitions
of a string. For example:
matches ac and abc, but not abbc, and so on./ab?c/
- [X]
- Matches any one of the set of characters X given inside
the square brackets. For example:
prints any record whose first field begins with either L or J. As a special case: [:lower:] inside the square brackets stands for any lowercase letter, [:upper:] inside the square brackets stands for any uppercase letter, [:alpha:] inside the square brackets stands for any letter, and [:digit:] inside the square brackets stands for any digit.$1 ~ /^[LJ]/ { print }
Thus:
matches a digit or letter./[[:digit:][:alpha:]]/
- [^X]
- Matches any one character that is not in the set X. For
example:
prints any record with a first field that does not begin with L or J.$1 ~ /^[^LJ]/ { print }
prints any record with a first field that does not begin with a digit.$1 ~ /^[^[:digit:]]/ { print }
- (X)
- Matches anything that the regular expression X does. You
can use parentheses to control how other special characters behave.
For example, * normally applies to the single character
immediately preceding it. This means that:
matches abd, abcd, abccd, and so on. However:/abc*d/
matches ad, abcd, abcbcd, abcbcbcd, and so on./a(bc)*d/
The characters with special meanings are:
^ $ . * + ? [ ] ( ) |
These
are known as metacharacters.When a metacharacter appears in a regular expression, it usually
has its special meaning. If you want to use one of these characters
literally (without its special meaning), put a backslash in front
of the character. For example:
/\$1/ { print }
prints
all records that contain a dollar sign $ followed
by a 1. If you simply entered: /$1/ { print }
awk would
search for records where the end of the record was followed by a 1,
which is impossible.Because the backslash has this special meaning, \ is also considered a metacharacter. If you want to create a regular expression that matches a backslash, you must therefore use two backslashes \\.