Word Separator Tables

The single-byte character set (SBCS) EBCDIC code page is extracted from the CCSID of the current job. The characters in the word list are mapped from the job"s code page to the multinational code page 500 except for Greek and Turkish. Greek is mapped to code page 875; Turkish is mapped to code page 1026.


Delimiter Categories

Each character in a code page is assigned to a delimiter category (always, sometimes, and never a delimiter) as shown in the following tables.

Table 1. Always Delimiters

Table 2. Never Delimiters

Table 3. Sometimes Delimiters



Considerations for the Sometimes Delimiters Table

Categories E through K are usually called possible delimiters because they function as delimiters only in certain contexts.

Characters . (period), ! (exclamation point), and ? (question mark) have a special status. When identical characters from this category occur together in a sequence, the individual characters do not act as delimiters; the entire sequence of characters forms a single token. However, the sequence of characters taken together does act as a delimiter because the sequence forms a token separate from characters that precede and follow it. For example, the text streams:

A simple token table is a 256-element array of unsigned characters. The simple token category value for the character is found in the element indexed by the code point of each character. Each code point (character) must be assigned exactly one category. If, according to the above definition of sets, a character is a member of more than one category, it should be assigned to the highest level category (for example, the category with the letter name latest in alphabetical order).

The categories assigned to each character in the three code pages are shown in the following tables. See the topic that shows the code pages in i5/OS globalization to refer to the characters that match these tables.

Table 4. Simple Token Table for Code Page 500

Table 5. Simple Token Table for Code Page 875 Greek Support

Table 6. Simple Token Table for Code Page 1026 Turkish Support



Notes:

  1. (1) PCFILE is a file assigned the document type of PCFILE (Document Interchange Architecture type of 14) by the Client Access program.
  2. (2) PCFILE is a file assigned the document type of PCFILE (Document Interchange Architecture type of 14) by the Client Access program.


[ Back to top | Office APIs | APIs by category ]