Linguistic processing for Chinese, Japanese, and Korean documents

You can process documents that are in Chinese, Japanese, or Korean by using dictionary-based segmentation or by using n-gram segmentation.

For a search engine, getting good search results depends in large part on the techniques that are used to process text. After the text is extracted from the document, the first step in text processing is to identify the individual words in the text. Identifying the individual words in the text is referred to as segmentation. For many languages, white space (blanks, the end of a line, and certain punctuation) can be used to recognize word boundaries. However, Chinese, Japanese, and Korean do not use white space between characters to separate words, so other techniques must be used.

The OmniFind Text Search Server for DB2® for i provides the following two methods to support the linguistic processing of Chinese, Japanese, and Korean:

Dictionary-based word segmentation (also called morphological analysis)
N-gram segmentation

Dictionary-based word segmentation

Dictionary-based word segmentation uses a language-specific dictionary to identify words in the sequence of characters in the document. This technique provides precise search results, because the dictionaries are used to identify word boundaries. However, dictionary-based word segmentation can miss specific matching results.

N-gram segmentation

N-gram segmentation avoids the problem of identifying word boundaries, and instead indexes overlapping pairs of characters. Because the OmniFind Text Search Server for DB2 for i uses two characters, this technique is also called bi-gram segmentation.

N-gram segmentation always returns all matching documents that contain the search terms; however, this technique might sometimes return documents that do not match the query.

By default, the OmniFind Text Search Server for DB2 for i comes with a pre-configured index that uses n-gram segmentation for Chinese, Japanese, and Korean.

To see how both types of linguistic processing work, examine the following text in a document: election for governor of Kanagawa prefecture. In Japanese, this text contains eight characters. For this example, the eight characters are represented as A B C D E F G H. A sample query that users might enter could be election for governor, which is four characters and are represented as E F G H. (The document text and the sample query share similar characters.)

If you use n-gram segmentation processing:

After the document is indexed, the search engine segments the text election for governor of Kanagawa prefecture into the following sets of characters: AB BC CD DE EF FG GH

The sample query election for governor is segmented into the following sets of characters: DE EF FG GH. If you search with the sample query election for governor, the document is found. The reason is that the tokens for both the document text and the query appear in the same order.

When you enable n-gram segmentation, you might see more results but possibly less precise results. For example, in Japanese, if you search with the query Kyoto and a document in your index contains the text City of Tokyo, the document is found. The reason is that City of Tokyo and Kyoto share two of the same Japanese characters.

If you do not use n-gram segmentation processing:

After the document is indexed, the search engine segments the text election for governor of Kanagawa prefecture into the following sets of characters: ABC DEF GH.

The sample query election for governor is segmented into the following sets of characters: EF GH. The characters EF do not appear in the tokens of the document text. (Even though the document does not have EF, it does have DEF).

The document text contains DEF, but the query contains only EF. Therefore, the document is less likely to be found by using the sample query.

When you do not enable n-gram segmentation, you probably receive more precise results but possibly fewer results.