Identifying simple pattern classes

Simple pattern classes are used to further identify the data with a meaningful pattern from which to match the pattern actions.

Simple pattern classes are represented by single characters.

Within patterns, you must use the backslash (\) escape character to prevent the syntax of the pattern tables from interfering with certain single character classes. Use the backslash (\) escape character with the following single character classes: the hyphen (-), slash (/), number sign (#), left and right parentheses () and ampersand (&).

Take care when specifying SEPLIST and STRIPLIST entries. For example, to recognize the ampersand as a single token, include it in the SEPLIST but not in the STRIPLIST. If the backslash is in theSEPLIST, its class is \ (backslash). If a backslash is used in a pattern, then it must have an escape character in a pattern as a double backslash (\\). Also see Applying parsing rules to a list

The NULL class (0) is not included in this list of single character classes. The NULL class is used in the classifications (.CLS) or in the RETYPE action to make a token NULL. Because a NULL class never matches to anything, it is never used in a pattern.

The simple pattern classes are as follows:
Table 1. List and description of simple pattern classes
Class Description
A - Z User-supplied class from the classifications

The classes A - Z correspond to classes that you code in the classifications. For example, if APARTMENT is given the class of U in the classifications, then APARTMENT matches a simple pattern of U.

^ Numeric

The class ^ (caret) represents a single number, for example, the number 123. However, the number 1,230 uses three tokens: the number 1, a comma, and the number 230.

? One or more consecutive words that are not in classifications.

The class ? (question mark) represents one or more consecutive alphabetic words. For example, MAIN, CHERRY HILL, and SATSUMA PLUM TREE HILL each match to a single ? class provided none of these words are in the classifications for the rule set. Class ? is useful for street names when multi-word and single-word street names must be treated identically.

+ A single alphabetic word that is not in classifications
The class + (plus sign) is useful for separating the parts of an unknown string. For example, in a name like OWAIN LIAM JONES, copy the individual words to columns with given name, middle name, and family name as follows:
+ | + | +
COPY [1] {GivenName}
COPY [2] {MiddleName}
COPY [3] {FamilyName}
& A single token of any type
The class & (ampersand) represents a single token of any class. For example, a pattern to match to a single word following an apartment type is:
U | &

SUITE 11 is recognized by this pattern. However, in a case such as APT 1ST FlOOR, only APT 1ST is recognized by this pattern.

\&
Type the backslash (\) escape character before the ampersand to use the ampersand as a literal.
< | \& | ? | T
1ST & MAIN ST is recognized by this pattern.
> Leading numeric
The class > (greater than symbol) represents a token with numbers that is followed by letters. For example, a house number like 123A MAPLE AVE can be matched as follows:
> | ? | T

123A is recognized by this pattern. The token contains numbers and alphabetic characters but the numbers are leading. In this example, T represents street type.

< Leading alphabetic character
The class < (less than symbol) matches itself to leading alphabetic letters. It is useful with the following examples:
  • A123
  • ALPHA77

The token contains alphabetic characters and numbers but the alphabetic characters are leading.

@ Complex mix
The class @ (at sign) represent tokens that have a complex mixture of alphabetic characters and numerics, for example: A123B, 345BCD789. For example, area information like Hamilton ON L8N 2P1 can be matched as follows:
+ | P | @ | @ 

In this example, P represents Province. The first @ represents L8N and the second @ represents 2P1.

~ Special punctuation

The class ~ (tilde) represents special characters that are not in the SEPLIST. For example, if a SEPLIST does note contain the dollar sign and percent sign, then you might use the following pattern:

~ | +  

In this example, $ HELLO and % OFF match the pattern.

k One or more Chinese numeric characters
/ Literal
The class / (slash) is useful for fractional addresses like 123 ½ MAPLE AVE, which matches to the following pattern:
> | ^ | / | ^ | ? | T
\/ Backslash, forward slash

You can use the backslash (\) escape character with the slash in the same manner that you use the / (slash) class.

- Literal
The class - (hyphen) is often used for address ranges, for example, an address range like 123-127 matches the following pattern:
^ | - | ^
\-

You can use the backslash (\) escape character with the hyphen in the same manner you use the - (hyphen) class.

\# Literal. You must use with the backslash (\) escape character, for example: \#.
The class # (pound sign) is often used as a unit prefix, for example, an address like suite #12 or unit #9A matches the following pattern:
U | \# | &
()

Literal

The classes ( and ) (parentheses) are used to enclose operands or user variables in a pattern syntax. An example of a pattern syntax that includes a leading numeric operators and a trailing character operator is as follows:

> | ? | T
COPY [1](n) {HouseNumber}
COPY [1](-c) {HouseNumberSuffix}
COPY [2] {StreetName}
COPY_A [3] {StreetSuffixType}
EXIT

The pattern syntax example, can recognize the address 123A MAPLE AVE. The numbers 123 are recognized as the house number and the letter A is recognized as a house number suffix.

Use the backslash (\) escape character with the opening parenthesis or closing parenthesis to filter out parenthetical remarks. To remove a parenthetical remark such as (see Joe, Room 202), you specify this pattern:

\( | ** | \)
RETYPE [1] 0
RETYPE [2] 0
RETYPE [3] 0

The code example removes the parentheses and the contents of the parenthetical remark. In addition, when you retype these fields to NULL you essentially remove the parenthetical statement from consideration by any patterns that are further down in the pattern-action file.

The NULL class (0) is not included in this list of single character classes. The NULL class is used in the classifications or in the RETYPE action to make a token NULL. Because a NULL class never matches to anything, it is never used in a pattern.

\( and \)

Use the backslash (\) escape character with the opening parenthesis or closing parenthesis to filter out parenthetical remarks. To remove a parenthetical remark such as (see Joe, Room 202), you specify this pattern:

\( | ** | \)
RETYPE [1] 0
RETYPE [2] 0
RETYPE [3] 0

The code example removes the parentheses and the contents of the parenthetical remark. In addition, when you retype these fields to NULL you essentially remove the parenthetical statement from consideration by any patterns that are further down in the pattern-action file.