Specifying the tokenizer
In the PRAGMA section of the pattern-action file, you can use the TOK command to specify the regional setting that you want to use in a rule set, and thereby indicate the way you want tokens handled.
TOK is an optional specification
statement. If you do not specify a tokenizer, the tokenizer used is
based on the regional setting of the computer on which you run investigation
or standardization. You can choose one of the following tokenizers:
Tokenizer | Description |
---|---|
Latin-based | For languages such as English, Spanish, French, and German. Tokens are typically separated by spaces. |
CJK | For Chinese, Japanese and Korean languages. Tokens are typically not separated. |
The syntax for TOK is as follows:
TOK locale
Locale is the International Components
for Unicode (ICU) locale standard. During standardization,
the CJK tokenizer is used if the TOK command is followed by a locale
variable that begins with one of the following codes:
- ja
- zh
- ko
- vi
For example, if you specify TOK en_US
,
the tokenizer includes Latin-based language considerations in the
tokenization approach. If you specify TOK jp_JP
,
the tokenizer includes locale-specific (CKJ) considerations in the
tokenization approach.