Regular expressions

A Regular expression is a pattern which specifies a string of characters used to match certain strings. By default, DFSORT treats regular expressions as case-insensitive. The input regular expression can be in any case (lowercase, uppercase or mixed case) and output results are also case-insensitive. DFSORT compares the NULL-terminated string specified by INCLUDE/OMIT field against the compiled regular expression. For comparison DFSORT internally adds NULL byte to the specified Regular expression. The specified Regular expression is then compiled and if an error is detected, DFSORT terminates with an error message.

This support is based on the following Standards /Extensions:

XPG4 (X/Open Common Applications Environment Specification, System Interfaces and Headers, Issue 4.)
XPG4.2 (X/Open Common Applications Environment Specification, System Interfaces and Headers, Issue 4, Version 2.)
Single UNIX Specification, Version 3 (IEEE Std 1003.1-2001.)
z/OS UNIX (functions that provide z/OS UNIX support beyond the defined standards.)

Two versions of regular expressions are supported:

Basic Regular expressions (BRE)
Extended Regular expressions (ERE)

Regular expressions can be made up of normal characters or special characters, sometimes called metacharacters. Basic and extended regular expressions differ only in the metacharacters they can contain.

The basic regular expression metacharacters are:

¬ $ . * \( \) [ \{ \} \

The extended regular expression metacharacters are:

| ¬ $ . * + ? ( ) [ { }

Table 1. Regular expression metacharacters
Symbol	Description
.	The period symbol matches any one character except the terminal newline character.
[character–character]	The hyphen symbol, within square brackets, means “through.” It fills in the intervening characters according to the current collating sequence. For example, [a–z] can be equivalent to [abc...xyz] or [aAbBcC...xXyYzZ].
[string]	A string within square brackets specifies any of the characters in string. Thus [abc], if compared to other strings, would match any that contained a, b, or c.
{n} {n,} {n,u}	Integer values enclosed in {} indicate the number of times to apply the preceding regular expression. n is the minimum number, and u is the maximum number. If you specify only n, it indicates the exact number of times to apply the regular expression. {n,} is equivalent to {n,u}. They both match n or more occurrences of the expression.
*	The asterisk symbol indicates 0 or more of any characters. For example, [a*e] is equivalent to any of the following: 99ae9, aaaaae, a999e99.
$	The dollar symbol matches the end of the string.
character+	The plus symbol specifies one or more occurrences of a character. Thus, smith+ern is equivalent to, for example, smithhhern.
[^string]	The caret symbol, when inside square brackets, negates the characters within the square brackets. Thus [^abc], if compared to other strings, would fail to match any that contains even one a, b, or c.
(expression)	Groups a sub-expression allowing an operator, such as , +, or [].], to work on the sub-expression enclosed in parentheses. For example, (a(cb+)*)$0.

Note:

Do not use multibyte characters.
You can use the ] (right square bracket) alone within a pair of square brackets, but only if it immediately follows either the opening left square bracket or if it immediately follows [^. For example: []–] matches the ] and – characters.
All the preceding symbols are special. You precede them with \ to use the symbol itself. For example, a\.e is equivalent to a.e.
You can use the – (hyphen) by itself, but only if it is the first or last character in the expression. For example, the expression []--0] matches either the ] or else the characters – through 0. Otherwise, use \–.

The following patterns are given as examples, along with descriptions of what they match:

abc: Matches any record containing the three letters abc in that order.
a.c: Matches any string beginning with the letter a, followed by any character, followed by the letter c.
^.$: Matches any record containing exactly one character.
.* [a–z]+ .*: Matches any record containing a word, consisting of lowercase/uppercase/mixed case alphabetic characters, delimited by at least one space on each side.
Johny.*Johny: Matches any record containing at least two occurrences of the string Johny.
^ibm: Matches records beginning with "ibm"
ibm$: Matches records ending with "ibm"
^ibm$: Matches records with exactly "ibm"