Identifying simple pattern classes

Simple pattern classes are used to further identify the data with a meaningful pattern from which to match the pattern actions.

Simple pattern classes are represented by single characters.

Within patterns, you must use the backslash (\) escape character to prevent the syntax of the pattern tables from interfering with certain single character classes. Use the backslash (\) escape character with the following single character classes: the hyphen (-), slash (/), number sign (#), left and right parentheses () and ampersand (&).

Take care when specifying SEPLIST and STRIPLIST entries. For example, to recognize the ampersand as a single token, include it in the SEPLIST but not in the STRIPLIST. If the backslash is in theSEPLIST, its class is \ (backslash). If a backslash is used in a pattern, then it must have an escape character in a pattern as a double backslash (\\). Also see Applying parsing rules to a list

The NULL class (0) is not included in this list of single character classes. The NULL class is used in the classifications (.CLS) or in the RETYPE action to make a token NULL. Because a NULL class never matches to anything, it is never used in a pattern.

The simple pattern classes are as follows:
Table 1. List and description of simple pattern classes
Class Description
A - Z User-supplied class from the classifications

The classes A - Z correspond to classes that you code in the classifications. For example, if APARTMENT is given the class of U in the classifications, then APARTMENT matches a simple pattern of U.

^ Numeric

The class ^ (caret) represents a single number, for example, the number 123. However, the number 1,230 uses three tokens: the number 1, a comma, and the number 230.

? One or more consecutive words that are not in classifications.

The class ? (question mark) represents one or more consecutive alphabetic words. For example, MAIN, CHERRY HILL, and SATSUMA PLUM TREE HILL each match to a single ? class provided none of these words are in the classifications for the rule set. Class ? is useful for street names when multi-word and single-word street names must be treated identically.

+ A single alphabetic word that is not in classifications
The class + (plus sign) is useful for separating the parts of an unknown string. For example, in a name like OWAIN LIAM JONES, copy the individual words to columns with given name, middle name, and family name as follows:

+ | + | +
COPY [1] {GivenName}
COPY [2] {MiddleName}
COPY [3] {FamilyName}
& A single token of any type
The class & (ampersand) represents a single token of any class. For example, a pattern to match to a single word following an apartment type is:

U | &

SUITE 11 is recognized by this pattern. However, in a case such as APT 1ST FlOOR, only APT 1ST is recognized by this pattern.

\&
Type the backslash (\) escape character before the ampersand to use the ampersand as a literal.

< | \& | ? | T
1ST & MAIN ST is recognized by this pattern.
> Leading numeric
The class > (greater than symbol) represents a token with numbers that is followed by letters. For example, a house number like 123A MAPLE AVE can be matched as follows:

> | ? | T

123A is recognized by this pattern. The token contains numbers and alphabetic characters but the numbers are leading. In this example, T represents street type.

< Leading alphabetic character
The class < (less than symbol) matches itself to leading alphabetic letters. It is useful with the following examples:
  • A123
  • ALPHA77

The token contains alphabetic characters and numbers but the alphabetic characters are leading.

@ Complex mix
The class @ (at sign) represent tokens that have a complex mixture of alphabetic characters and numerics, for example: A123B, 345BCD789. For example, area information like Hamilton ON L8N 2P1 can be matched as follows:

+ | P | @ | @ 

In this example, P represents Province. The first @ represents L8N and the second @ represents 2P1.

~ Special punctuation

The class ~ (tilde) represents special characters that are not in the SEPLIST. For example, if a SEPLIST does note contain the dollar sign and percent sign, then you might use the following pattern:


~ | +  

In this example, $ HELLO and % OFF match the pattern.

k One or more Chinese numeric characters
/ Literal
The class / (slash) is useful for fractional addresses like 123 ½ MAPLE AVE, which matches to the following pattern:

> | ^ | / | ^ | ? | T
\/ Backslash, forward slash

You can use the backslash (\) escape character with the slash in the same manner that you use the / (slash) class.

- Literal
The class - (hyphen) is often used for address ranges, for example, an address range like 123-127 matches the following pattern:

^ | - | ^
\-

You can use the backslash (\) escape character with the hyphen in the same manner you use the - (hyphen) class.

\# Literal. You must use with the backslash (\) escape character, for example: \#.
The class # (pound sign) is often used as a unit prefix, for example, an address like suite #12 or unit #9A matches the following pattern:

U | \# | &
()

Literal

The classes ( and ) (parentheses) are used to enclose operands or user variables in a pattern syntax. An example of a pattern syntax that includes a leading numeric operators and a trailing character operator is as follows:

> | ? | T
COPY [1](n) {HouseNumber}
COPY [1](-c) {HouseNumberSuffix}
COPY [2] {StreetName}
COPY_A [3] {StreetSuffixType}
EXIT

The pattern syntax example, can recognize the address 123A MAPLE AVE. The numbers 123 are recognized as the house number and the letter A is recognized as a house number suffix.

Use the backslash (\) escape character with the opening parenthesis or closing parenthesis to filter out parenthetical remarks. To remove a parenthetical remark such as (see Joe, Room 202), you specify this pattern:


\( | ** | \)
RETYPE [1] 0
RETYPE [2] 0
RETYPE [3] 0

The code example removes the parentheses and the contents of the parenthetical remark. In addition, when you retype these fields to NULL you essentially remove the parenthetical statement from consideration by any patterns that are further down in the pattern-action file.

The NULL class (0) is not included in this list of single character classes. The NULL class is used in the classifications or in the RETYPE action to make a token NULL. Because a NULL class never matches to anything, it is never used in a pattern.

\( and \)

Use the backslash (\) escape character with the opening parenthesis or closing parenthesis to filter out parenthetical remarks. To remove a parenthetical remark such as (see Joe, Room 202), you specify this pattern:


\( | ** | \)
RETYPE [1] 0
RETYPE [2] 0
RETYPE [3] 0

The code example removes the parentheses and the contents of the parenthetical remark. In addition, when you retype these fields to NULL you essentially remove the parenthetical statement from consideration by any patterns that are further down in the pattern-action file.

Applying subfield classes (1 to 9, -1 to -9)

The subfield classes 1 to 9, and -1 to -9 are used to parse individual words of an ? string.

The number 1 represents the first word, the number 2 represents the second, the number –1 represents the last word, and the number –2 represents the next to last word. If the word referenced does not exist, the pattern does not match. If you are processing company names and only wanted the first word, a company name like WILLIAMS BIG SUPER CELL COMPANY matches to the following patterns (assume COMPANY is in the classifications (.CLS) as a type C).

Pattern Description
+ | + | + | + | C WILLIAMS is operand [1], BIG is operand [2], SUPER is operand [3] CELL is operand [4] and COMPANY is operand [5]
? | C WILLIAMS BIG SUPER CELL is operand [1], COMPANY is operand [2]
1 | C WILLIAMS is operand [1], COMPANY is operand [2]
2 | C BIG is operand [1], COMPANY is operand [2]
–1 | C CELL is operand [1], COMPANY is operand [2]
–2 | C SUPER is operand [1], COMPANY is operand [2]

You can combine single alphabetic classes (+) with subfield classes. For example, in a series of consecutive unknown tokens like CHERRY HILL SANDS, the following pattern causes the following match:


+ | -1

The + matches to the word CHERRY and the –1 matches to SANDS. The operand [1] is CHERRY and operand [2] is SANDS.

Specifying subfield ranges

When matching to a pattern, you can specify a range of words.

The format is as follows:


(beg:end)

Examples are:

Specification Description
(1:3) Specifies a range of words 1 - 3
(–3:–1) Specifies a range of the third word from the last to the last word
(1:–1) Specifies a range of the first word to the last word (note that by using ? for the last word makes action more efficient)

If you have the address 123 - A B Main St, you can use the following pattern:


^ | - | (1:2)
COPY [3] {HouseNumberSuffix}
RETYPE [2] 0
RETYPE [3] 0

This pattern results in A and B being moved to the {HouseNumberSuffix} (house number suffix) field. This pattern also retypes A and B to NULL tokens (and similarly retyping the hyphen) to remove them from consideration by any further patterns in the file.

Applying the universal class

You can combine the universal ( **) class with other operands to restrict the tokens grabbed by the class. The universal class can be null, signifying no tokens.

The class ** matches all tokens. For example, if you use a pattern of **, you match 123 MAIN ST and 123 MAIN ST, LOS ANGELES, CA 90016, and so on. The following pattern matches to all tokens before the type (which can be no tokens) and the type:


** | T

Thus, 123 N MAPLE AVE matches with operand [1] being 123 N MAPLE and operand [2] being AVE.

The universal class can be null. No tokens are required to precede the type. AVENUE also matches this pattern with operand [1] being NULL.

In the following pattern, the ** refers to all tokens between the numeric and the street type:


^ | ** | T

In the example, the class ^ (caret) and the type (T) class define the start and end of the ** class. The class ** can contain numbers that are in addition to the class ^ but not any additional street type tokens.

You can specify a range of tokens for an ** operand. For example, the following pattern matches a numeric followed by at least two nonstreet-type tokens followed by a street type:


^ | ** (1:2) | T

Operand [2] consists of exactly two nonstreet-type tokens. This matches 123 CHERRY TREE DR, but not 123 ELM DR. Only one token follows the number. You can specify ranges from a single token, such as (1:1), to all the tokens, such as (1:–1).

The pattern **(1:1) results in much slower processing time than the equivalent & to match to any single token. However, you do not want to use & in a pattern with **, such as ** | &, because the first token encountered is used by &. Value checks or appropriate conditions that are applied by using & with ** can make sense. For example:


** | & = "123", "ABC"

No conditional values or expressions are permitted for operands with **.

Using the end of field specifier ($)

The $ specifier does not match any real token, but denotes the end of the pattern.

A pattern condition without the $ specifier can represent a portion of the field, such as the city, state, and postal code information in a U.S. address. For example, Littleton, MA 01460, and LITTLETON MA 01460-6245 match the following pattern:


? | S | ^

However, the hyphen (-) and the ZIP+4, 01460-6245, are not part of the match. To include the postal code, 01460-6245, as part of the match condition, use the pattern as follows:


? | S | ^ | - | ^ | $ 

Any input data following the postal code is not part of the match.

Using floating positioning specifier

You use positioning specifiers to modify the placement of the pattern matching.

For the patterns documented so far, the pattern had to match the first token in the field. For example, the following pattern matches MAPLE AVE and CHERRY HILL RD, but does not match 123 MAPLE AVE, because a number is the first token:


? | T

You can use floating specifiers to scan the input field for a particular pattern. The asterisk (*) is a positioning specifier and means that the pattern is searched, from left to right, until there is a match or the entire pattern is scanned. You can use the asterisk (*) to indicate that the class immediately following is a floating class.

If you have apartment numbers in the address, to simplify your data, you might want to scan for, process, and retype the apartment numbers to NULL so that basic patterns can process the core address. For example, addresses such as 123 MAIN ST APT 34 and 770 KING ST FL 3 RM 101 contain a basic street address with additional information. The following pattern searches for the unit and floor information, populates the appropriate dictionary fields, and removes the unit and floor information from further processing. U is the class for unit and F is the class for floor.


*U | ^ 
COPY_A [1] {UnitType}
COPY [2] {UnitValue}
RETYPE [1] 0 
RETYPE [2] 0


*F | ^ 
COPY_A [1] {FloorType}
COPY [2] {FloorValue}
RETYPE [1] 0 
RETYPE [2] 0

Retyping the tokens to NULL removes the tokens from consideration by any patterns later in the Pattern-Action file. You prevent recounting all combinations of possibilities. Now, the data to be processed is 123 MAIN ST and 770 KING ST. Both entries have the pattern: ^ | ? | T.

Processing a portion of the data by using the floating specifier simplifies the two input fields and makes the input pattern the same. The standardization task is made easier.

Floating positioning specifiers operate by scanning a token until a match is found. If all operands match, the pattern matches. If the operands do not match, the scanner advances one token to the right and repeats the process. This is like moving a template across the input string. If the template matches, the process is done. Otherwise, the template advances to the next token.

Note: There can only be one match to a pattern in an input string. After the actions are processed, control goes to the next pattern, even though there might be other matches on the line.

The asterisk must be followed by a class. For example, the following operands are valid with a floating positioning specifier followed by a standard class:


* U
* ?
* ^
There can be more than one floating positioning specifier in a pattern. For example, the following operands match to JOHN DOE 123 CHERRY HILL NORTH RD:
*^ | ? | *T

Operand [1] is 123. Operand [2] is CHERRY HILL. Operand [3] is RD. NORTH is classified as a directional (D) so it is not included in the unknown string (?).

Using the reverse floating positioning specifier

The reverse floating positioning specifier, indicated by a number sign (#), is similar to the floating positioning specifier (*) except that scanning proceeds from right to left instead of from left to right.

You can use this specifier to search for items that appear at the end of a field, such as postal code, state, and apartment designations.

The reverse floating positioning specifier must only appear in the first operand of a pattern, since it is used to position the pattern. For example, if you wanted to find a postal code and you have given the state name the class S, the following pattern scans from right to left for a state name followed by a number:


#S | ^

If you have an input string CALIFORNIA 45 PRODUCTS, PHOENIX ARIZONA 12345 DEPT 45, the right to left scan positions the pattern to the ARIZONA. The number following causes a match to the pattern.

If no match is found, scanning continues to the left until a state followed by a number is found. If you are limited to the standard left-right floating positioning specifier (*S | ^), the CALIFORNIA 45 is incorrectly interpreted as a state name and postal code.

Using the fixed position specifier

The fixed position specifier is positioned at a particular operand in the input string.

Sometimes it is necessary to position the pattern matching at a particular operand in the input string. This is handled by the %n fixed position specifier. Examples of the fixed position specifier are:

Fixed Position Specifier Value
%1 Matches to the first token
%2 Matches to the second token
%–1 Matches to the last token
%–2 Matches to the second from last token

The positions can be qualified by following the %n with a token type. Some examples are:

Fixed Position Specifier Value
%2^ Matches to the second numeric token
%-1^ Matches to the last numeric token
%3T Matches to the third street type token
%2? Matches to the second set of two or more unknown alphabetic tokens

You can use the fixed position specifier (%) in only two ways:

  • As the first operand of a pattern
  • As the first and second operands of a pattern

The following pattern is allowed and matches the second numeric token as operand [1] and the third leading alpha token that follows as operand [2]:


%2^ | %3<

The fixed position specifier treats each token according to its class. The following examples illustrate how to use the fixed position specifier for the input field:


John Doe 
123 Martin Luther St 
Salt Lake
Fixed Position Specifier Description
%1 1 Matches to the first word in the first string: JOHN
%1 2 Matches to the second word in the first string: DOE
%2 ? Matches to the second string of unknown alphabetic words: MARTIN LUTHER
%2 1 Matches to the first word in the second string: MARTIN
%-2 –1 Matches to the last word in the next to the last string: LUTHER
%3+ Matches to the third single alphabetic word: MARTIN
%–1 ? Matches to the last string of unknown alphabetic words: SALT LAKE
%–1+ Matches to the last single alphabetic word: LAKE

The position specifier does not continue scanning if a pattern fails to match (unlike * and #).

Assuming the input value S is classified as a D for direction, the following pattern matches the 789 S in the string 123 A 456 B 789 S:


%3^ | D

That same pattern does not match 123 A 456 B 789 C 124 S because the third number (789) is not followed by a direction.

Negation class qualifier

The exclamation point (!) is used to indicate NOT.

The following pattern language syntax shows how to use the negative to specify matching:

Pattern Description
!T Match to any token except a street type
!? Match to any unknown token

The following example matches to SUITE 3, APT GROUND but not to SUITE CIRCLE because CIRCLE is classified as a street type (T):


*U | !T

The phrase RT 123 can be considered to be a street name only if there is no unknown word following, such as RT 123 MAPLE AVE. You can use the following pattern to create a non-match to the unknown word:


*T | ^ | !?

This pattern matches to RT 123, but not to RT 123 MAPLE AVE because an unknown alphabetic character follows the numeric operand.

You can combine the negation class with the floating class (*) only at the beginning of a pattern. For example, when processing street addresses, you might want to expand ST to SAINT where appropriate.

For example, change 123 ST CHARLES ST to 123 SAINT CHARLES ST, but do not convert 123 MAIN ST REAR APT to 123 MAIN SAINT REAR APT. You can use the following pattern and action set:


*!? | S | +
RETYPE [2] ? "SAINT"

The previous example requires that no unknown class precede the value ST because tokens with this value have their own class of S.