Specifying the tokenizer

In the PRAGMA section of the pattern-action file, you can use the TOK command to specify the regional setting that you want to use in a rule set, and thereby indicate the way you want tokens handled.

TOK is an optional specification statement. If you do not specify a tokenizer, the tokenizer used is based on the regional setting of the computer on which you run investigation or standardization. You can choose one of the following tokenizers:
Tokenizer Description
Latin-based For languages such as English, Spanish, French, and German. Tokens are typically separated by spaces.
CJK For Chinese, Japanese and Korean languages. Tokens are typically not separated.
The syntax for TOK is as follows:

TOK locale
Locale is the International Components for Unicode (ICU) locale standard.
During standardization, the CJK tokenizer is used if the TOK command is followed by a locale variable that begins with one of the following codes:
  • ja
  • zh
  • ko
  • vi
If the locale variable is any other value, the Latin-based tokenizer is used.

For example, if you specify TOK en_US, the tokenizer includes Latin-based language considerations in the tokenization approach. If you specify TOK jp_JP, the tokenizer includes locale-specific (CKJ) considerations in the tokenization approach.