Using regular expressions

A regular expression is a pattern that is used to match characters in a string. Here are examples of some of the most common regular expression patterns:
Account
Matches the characters Account. By default, searches are case-sensitive.
[A-Z]
Matches one uppercase letter.
[A-Z] {3}
Matches for three consecutive uppercase letters.
[0-9] {5}
Matches five consecutive digits.
[0-9]+
Matches one or more digits.
[^a-z]
Matches everything except lowercase a to z.
\s
Matches one whitespace character, such as space or tab.
\S
Matches any character that is not whitespace.
Note: The only metacharacters that are supported are \s and \S. All other metacharacters require the equivalent regular expression, for example:
\d = [0-9]
\p(L) = [A-z]
\r = 0x0D

For additional information on compile regular expression syntax, see https://www.ibm.com/docs/en/zos/2.5.0?topic=functions-regcomp-compile-regular-expression.

ACIF can use a regular expression in the TRIGGER and FIELD parameters. In the TRIGGER parameter, the regular expression specifies the pattern to search for. In the FIELD parameter, the regular expression is applied to the characters that are extracted from the field in a way that is similar to using a mask. The regular expression must be specified in the code page indicated by the CPGID parameter.

If the CPGID of the document is EBCDIC, the regular expression can be specified as text. For example:
CPGID=037
TRIGGER1=*,*,'PAGE',(TYPE=GROUP)
TRIGGER2=*,25,REGEX='[A-Z]{3}-[A-Z]{6}',(TYPE=FLOAT)
FIELD1=0,9,2,(TRIGGER=1,BASE=TRIGGER)
FIELD2=0,38,10,(TRIGGER=2,BASE=0,REGEX='[A-Z] [0-9]{3}-\S+')
INDEX1='Page',FIELD1,(TYPE=GROUP,BREAK=YES)
INDEX2='Source-ID',FIELD2
In the example, TRIGGER2 uses a regular expression, which specifies a pattern of three uppercase letters, a hyphen, and six uppercase letters. The text "SUB-SOURCE" matches the pattern. FIELD2 uses a regular expression, which specifies one uppercase letter, a space, three numbers, a hyphen, and one or more non-whitespace characters. The character strings Q 010-1, I 000-RS, and L 133-1B match the regular expression pattern.
If the CPGID parameter of the document is not EBCDIC, the regular expression must be specified in hexadecimal in the code page that is indicated by the CPGID parameter. For example:
CPGID=850
TRIGGER1 = *,1,REGEX = X'5B302D395D7B337D' /* [0-9]{3} */

Using a regular expression on the TRIGGER parameter

On the TRIGGER parameter, use the regular expression instead of a text string. A regular expression can be used on both a group trigger and a floating trigger. The maximum length of the regular expression is 250 bytes.

If an asterisk is specified for the column, ACIF searches the entire record for the string that matches the regular expression. If a column is specified, ACIF searches the text starting in that column for the string that matches the regular expression. The regular expression must match text that begins in that column. If a column range is specified, ACIF searches only the text within the column range for the string that matches the regular expression. The regular expression must match text that begins in one of the columns specified by the column range.

The maximum record length to which the regular expression can be applied is 2 KB (2048 bytes). If longer records are in the file, use a trigger column range to specify a subset of the record. When the regular expression matches the text in a record, ACIF looks for the next trigger, or, if all the group triggers are found, ACIF collects the fields.

Using a regular expression on the FIELD parameter

On the FIELD parameter, use the regular expression instead of a mask. A mask and a regular expression cannot both be specified on the same FIELD parameter. The maximum length of the regular expression is 250 bytes.

The regular expression can be specified on a field based on a group trigger, a field based on a floating trigger, or a transaction field. Masks can be specified only on fields based on floating triggers and transaction fields. The maximum length of a field that can be specified in the FIELD parameter is 250 bytes.

ACIF extracts the text specified by the column and length values. After the field is extracted, ACIF applies the regular expression to the text. Any text that matches the regular expression is extracted for the field. If the matching text is shorter than the length specified in the FIELD parameter, it is padded with blanks until it equals the length. If the regular expression does not match any text in the field, one of these occurs:
  • For a field based on a group trigger, the default value that is specified on the FIELD parameter is used. If no default value is specified, ACIF ends with error message APK488S.
  • If the record is only long enough to contain part of the field, the regular expression is applied only to the portion of the record that is present.

Using default values when regular expressions do not match

If the regular expression does not match any text in the field, a default value might be used. Whether a default value is used and which type is used, depends on one of these field types:
  • GROUP field
    • If a regular expression does not match any text in the GROUP field, the default value that is specified on the FIELD parameter is used. If no default value is specified, ACIF ends processing with error message APK488S.
    • If the record is only long enough to contain part of the field, the regular expression is applied only to the portion of the record that is present.
    • If the record is not long enough to contain even the first byte of the field, the default value that is specified on the FIELD parameter is used. If no default value is specified, ACIF ends with error message APK449S.
  • FLOAT field
    • If a regular expression does not match any text in the FLOAT field, no error exists, and the default value that is specified on the FIELD parameter is not used.
    • If the record is only long enough to contain part of the field, the regular expression is applied only to the portion of the record that is present.
    • If the record is not long enough to contain even the first byte of the field, the default value that is specified on the FIELD parameter is used. If no default value is specified, ACIF ends processing with error message APK449S.
  • Transaction fields (GROUPRANGE and PAGERANGE)
    • If the regular expression does not match any text in the transaction field, no error exists, and processing continues. A default value cannot be specified for a transaction field.
    • If the record is not long enough to contain the entire field, no error exists, and processing continues.

Other considerations for using regular expressions

All text to which the regular expression is applied is converted to UTF-16. Keep in mind:
  • Performance might be slower when you are using a regular expression than when you are using a text string.
  • If the CPGID value is incorrect, the conversion might fail with error message APK2080I.

If the regular expression is not valid, ACIF fails with error message APK484S.

Examples of using regular expressions

Using a regular expression for a trigger

TRIGGER1=*,1,REGEX='P[A-Z]{3} ',(TYPE=GROUP)

In this example, the regular expression matches text that begins in column 1 with the letter P, three uppercase letters, and a space. For example, PAGE .

Using a regular expression to extract a date

TRIGGER1=*,1,'1'
FIELD1=0,13,18,( REGEX='[A-Z][a-z]+ [0-9]+, [0-9]{4}',DEFAULT='January 1, 1970')
INDEX1='Date',FIELD1

In this example, the regular expression matches a date in the form that begins with an uppercase letter, one or more lowercase letters, a space, one or more digits, a comma, a space, and four digits. For example, July 4, 1956. If a date is not found that matches the regular expression pattern, a default of January 1, 1970 is used.

Using a regular expression with a transaction field

TRIGGER1=*,1,'1'
FIELD1=0,30,3
FIELD2=*,*,12,(OFFSET=(59:70),ORDER=BYROW,REGEX='[0-9]{3}-[0-9]{2}-[0-9]{4}')
INDEX1='DEPT',FIELD1,(TYPE=GROUP)
INDEX2='SOCIAL SECURITY NUMBER',FIELD2,(TYPE=GROUPRANGE)

In this example, the regular expression is used to extract social security numbers that begin with three digits, a hyphen, two digits, a hyphen, and four digits.