Parsing elements (PRAGMA)
The standardization process begins by identifying tokens within the incoming data. A token can be a single character, a word, or multiple words that are not separated by spaces.
The parsing parameters of the table in the pattern-action file define the tokens. For example, for Latin-based languages, 123-456 has three tokens: 123, the hyphen (-), and 456. A hyphen separates words and is considered to be a token in itself.
Spaces are separate tokens. They are also stripped from the input. For example, 123 MAIN ST consists of three tokens: 123, MAIN, and ST.
Using SEPLIST and STRIPLIST
SEPLIST and STRIPLIST are specification statements that are placed between the PRAGMA_START and PRAGMA_END lines in a Pattern-Action file.
- SEPLIST. Uses any character in the list to separate tokens
- STRIPLIST. Removes any character in the list.
Any character that is in both lists separates tokens but does not appear as a token itself. The best example is spaces. One or more spaces are stripped but the space indicates where one word ends and another begins. Include the space character in both SEPLIST and STRIPLIST.
If you want to include SEPLIST and STRIPLIST, put them as the first set of statements in the .pat file, preceded with a \PRAGMA_START, and followed by a PRAGMA_END. For example:
\PRAGMA_START
SEPLIST " ,"
STRIPLIST " -"
\PRAGMA_END
Enclose the characters in the list in quotation marks.
Applying parsing rules to a list
The special token class (~) represents special characters that are not included in the SEPLIST and STRIPLIST. These characters (!, \, @, ~, %) require special handling.
When adding special characters, consider the following rules:
- Do not use the quotation mark in the SEPLIST or STRIPLIST unless you precede it with the backslash (\) escape character.
- The backslash (\) is the escape character that you use in a pattern but it must itself be escaped (\\).
SEPLIST: " !?%$,.;:()/#&"
STRIPLIST: " !?*@$,.;:-\\''"
In this example, the hyphen is in both lists. Because the SEPLIST is applied before the STRIPLIST, STRATFORD-ON-AVON in the incoming data is parsed into three tokens: STRATFORD, ON, and AVON.
SEPLIST: " !?%$,.;:-()/#&"
STRIPLIST: " !?*@$,.;:-\\''"
SEPLIST: " !?%$,.;:()-/#&"
STRIPLIST: " !?*@$.;:\\''"
Each rule set has its own lists. If no list is coded for a rule set, the following default lists are used:
SEPLIST: " !?%$,.;:()-/#&"
STRIPLIST: " !?*@$,.;:\\''"
When overriding the default SEPLIST and STRIPLIST, do not cause collisions with the predefined class meanings because the class of a special character changes if it is included in the SEPLIST.
If a special character is included in the SEPLIST and not in the STRIPLIST, the token class for that character becomes the character itself.
For example, ^ is the numeric class specifier. If you add this character to SEPLIST and not to STRIPLIST, any token consisting of ^ is given the class of ^ . This token would next match to a numeric class (^) in a pattern-action file.
Specifying the tokenizer
In the PRAGMA section of the pattern-action file, you can use the TOK command to specify the regional setting that you want to use in a rule set, and thereby indicate the way you want tokens handled.
Tokenizer | Description |
---|---|
Latin-based | For languages such as English, Spanish, French, and German. Tokens are typically separated by spaces. |
CJK | For Chinese, Japanese and Korean languages. Tokens are typically not separated. |
TOK locale
Locale is the International Components
for Unicode (ICU) locale standard. - ja
- zh
- ko
- vi
For example, if you specify TOK en_US
,
the tokenizer includes Latin-based language considerations in the
tokenization approach. If you specify TOK jp_JP
,
the tokenizer includes locale-specific (CKJ) considerations in the
tokenization approach.