Regular expressions

A regular expression is a way of telling awk to select records that contain certain strings of characters. For example, the instruction:

/ri/ { print }

tells awk to print all records that contain the string ri. Regular expressions are always enclosed in slashes, as shown in the instruction just discussed. For a discussion of regular expressions beyond their usage in awk, see Appendix C. Regular Expressions (regexp) in z/OS UNIX System Services Command Reference.

The following characters have special meanings when you use them in regular expressions:

Character

Meaning

^

Stands for the beginning of a field. For example:

$2 ~ /^b/ { print }

Prints any record whose second field begins with b.

$

Stands for the end of a field. For example:

$2 ~ /g$/ { print }

prints any record with a second field that ends with g.

.

Matches any single character (except the newline). For example:

$2 ~ /i.g/ { print }

selects the records with fields containing ing, and also selects the records containing bridge (idg).

|

Means or. For example:

/Linda|Lori/

is a regular expression that matches either of the strings Linda or Lori.

*

Indicates zero or more repetitions of a character. For example:

/ab*c/

matches abc, abbc, abbbc, and so on. It also matches ac (zero repetitions of b). Since . matches any character except the newline, .* matches an arbitrary string of zero or more characters. For example:

$2 ~ /^r.*g$/ { print }

prints any record with a second field that begins with r, ends in g, and has any set of characters between (for example, reading and role playing).

+

Is similar to *, but stands for one or more repetitions of a character. For example:

/ab+c/

matches abc, abbc, and so on, but does not match ac.

\{m,n\}

Indicates m to n repetitions of a character (where m and n are both integers). For example:

/ab\{2,4\}c/

would match abbc, abbbc, and abbbbc, and nothing else.

?

Is similar to *, but stands for zero or one repetitions of a string. For example:

/ab?c/

matches ac and abc, but not abbc, and so on.

[X]

Matches any one of the set of characters X given inside the square brackets. For example:

$1 ~ /^[LJ]/ { print }

prints any record whose first field begins with either L or J. As a special case: [:lower:] inside the square brackets stands for any lowercase letter, [:upper:] inside the square brackets stands for any uppercase letter, [:alpha:] inside the square brackets stands for any letter, and [:digit:] inside the square brackets stands for any digit.

Thus:

/[[:digit:][:alpha:]]/

matches a digit or letter.

[^X]

Matches any one character that is not in the set X. For example:

$1 ~ /^[^LJ]/ { print }

prints any record with a first field that does not begin with L or J.

$1 ~ /^[^[:digit:]]/ { print }

prints any record with a first field that does not begin with a digit.

(X)

Matches anything that the regular expression X does. You can use parentheses to control how other special characters behave. For example, * normally applies to the single character immediately preceding it. This means that:

/abc*d/

matches abd, abcd, abccd, and so on. However:

/a(bc)*d/

matches ad, abcd, abcbcd, abcbcbcd, and so on.

The characters with special meanings are:

^   $   .   *   +   ?   [   ]   (   )   |

These are known as metacharacters.

When a metacharacter appears in a regular expression, it usually has its special meaning. If you want to use one of these characters literally (without its special meaning), put a backslash in front of the character. For example:

/\$1/ { print }

prints all records that contain a dollar sign $ followed by a 1. If you simply entered:

/$1/ { print }

awk would search for records where the end of the record was followed by a 1, which is impossible.

Because the backslash has this special meaning, \ is also considered a metacharacter. If you want to create a regular expression that matches a backslash, you must therefore use two backslashes \\.