Parsing elements (PRAGMA)

The standardization process begins by identifying tokens within the incoming data. A token can be a single character, a word, or multiple words that are not separated by spaces.

The parsing parameters of the table in the pattern-action file define the tokens. For example, for Latin-based languages, 123-456 has three tokens: 123, the hyphen (-), and 456. A hyphen separates words and is considered to be a token in itself.

Spaces are separate tokens. They are also stripped from the input. For example, 123 MAIN ST consists of three tokens: 123, MAIN, and ST.

Using SEPLIST and STRIPLIST

SEPLIST and STRIPLIST are specification statements that are placed between the PRAGMA_START and PRAGMA_END lines in a Pattern-Action file.

You can override the default assumptions by specifying one or both of the following statements:

SEPLIST. Uses any character in the list to separate tokens
STRIPLIST. Removes any character in the list.

Any character that is in both lists separates tokens but does not appear as a token itself. The best example is spaces. One or more spaces are stripped but the space indicates where one word ends and another begins. Include the space character in both SEPLIST and STRIPLIST.

If you want to include SEPLIST and STRIPLIST, put them as the first set of statements in the .pat file, preceded with a \PRAGMA_START, and followed by a PRAGMA_END. For example:


\PRAGMA_START
SEPLIST " ,"
STRIPLIST " -"
\PRAGMA_END

Enclose the characters in the list in quotation marks.

Applying parsing rules to a list

The special token class (~) represents special characters that are not included in the SEPLIST and STRIPLIST. These characters (!, \, @, ~, %) require special handling.

When adding special characters, consider the following rules:

Do not use the quotation mark in the SEPLIST or STRIPLIST unless you precede it with the backslash (\) escape character.
The backslash (\) is the escape character that you use in a pattern but it must itself be escaped (\\).

In this example, the space is in both lists and the hyphen is in the STRIPLIST but not the SEPLIST. Hyphens are stripped so that STRATFORD-ON-AVON is considered to be STRATFORDONAVON.


SEPLIST: " !?%$,.;:()/#&"
STRIPLIST: " !?*@$,.;:-\\''"

In this example, the hyphen is in both lists. Because the SEPLIST is applied before the STRIPLIST, STRATFORD-ON-AVON in the incoming data is parsed into three tokens: STRATFORD, ON, and AVON.


SEPLIST: " !?%$,.;:-()/#&"
STRIPLIST: " !?*@$,.;:-\\''"

In this example, the comma separates tokens so that the city name and state can be found (SALT LAKE CITY, UTAH). Any other special characters are classified as a special type.


SEPLIST: " !?%$,.;:()-/#&"
STRIPLIST: " !?*@$.;:\\''"

Each rule set has its own lists. If no list is coded for a rule set, the following default lists are used:


SEPLIST: " !?%$,.;:()-/#&"
STRIPLIST: " !?*@$,.;:\\''"

When overriding the default SEPLIST and STRIPLIST, do not cause collisions with the predefined class meanings because the class of a special character changes if it is included in the SEPLIST.

If a special character is included in the SEPLIST and not in the STRIPLIST, the token class for that character becomes the character itself.

For example, ^ is the numeric class specifier. If you add this character to SEPLIST and not to STRIPLIST, any token consisting of ^ is given the class of ^ . This token would next match to a numeric class (^) in a pattern-action file.

Specifying the tokenizer

In the PRAGMA section of the pattern-action file, you can use the TOK command to specify the regional setting that you want to use in a rule set, and thereby indicate the way you want tokens handled.

TOK is an optional specification statement. If you do not specify a tokenizer, the tokenizer used is based on the regional setting of the computer on which you run investigation or standardization. You can choose one of the following tokenizers:

Tokenizer	Description
Latin-based	For languages such as English, Spanish, French, and German. Tokens are typically separated by spaces.
CJK	For Chinese, Japanese and Korean languages. Tokens are typically not separated.

The syntax for TOK is as follows:


TOK locale

Locale is the International Components for Unicode (ICU) locale standard.

During standardization, the CJK tokenizer is used if the TOK command is followed by a locale variable that begins with one of the following codes:

If the locale variable is any other value, the Latin-based tokenizer is used.

For example, if you specify TOK en_US, the tokenizer includes Latin-based language considerations in the tokenization approach. If you specify TOK jp_JP, the tokenizer includes locale-specific (CKJ) considerations in the tokenization approach.