In the PRAGMA section of the pattern-action file, you can use the TOK command to specify the regional setting that you want to use in a rule set, and thereby indicate the way you want tokens handled.
| Tokenizer | Description |
|---|---|
| Latin-based | For languages such as English, Spanish, French, and German. Tokens are typically separated by spaces. |
| CJK | For Chinese, Japanese and Korean languages. Tokens are typically not separated. |
TOK locale
Locale is the International Components
for Unicode (ICU) locale standard. For example, if you specify TOK en_US, the tokenizer includes Latin-based language considerations in the tokenization approach. If you specify TOK jp_JP, the tokenizer includes locale-specific (CKJ) considerations in the tokenization approach.