Linguistic processing for Chinese, Japanese, and Korean documents
You can process documents that are in Chinese, Japanese, or Korean by using n-gram segmentation.
For a search engine, getting good search results depends in large part on the techniques that are used to process text. After the text is extracted from the document, the first step in text processing is to identify the individual words in the text. Identifying the individual words in the text is referred to as segmentation. For many languages, white space (blanks, the end of a line, and certain punctuation) can be used to recognize word boundaries. However, Chinese, Japanese, and Korean do not use white space between characters to separate words, so other techniques must be used.
A text search server provides n-gram segmentation to support the linguistic processing of Chinese, Japanese, and Korean, and by default, comes with a pre-configured index.
N-gram segmentation
N-gram segmentation avoids the problem of identifying word boundaries, and instead indexes overlapping pairs of characters. Because two characters are used, this technique is also called bi-gram segmentation.
N-gram segmentation always returns all matching documents that contain the search terms; however, this technique can return documents that do not match the query.
To show how both types of linguistic processing work, examine the following text in a document: election for governor of Kanagawa prefecture. In Japanese, this text contains eight characters. For this example, the eight characters are represented as A B C D E F G H. A sample query that users might enter is election for governor, which is four characters and is represented as E F G H. In this case, the document text and the sample query share similar characters.
After the document is indexed, the search engine segments the text election for governor of Kanagawa prefecture into the following sets of characters: AB BC CD DE EF FG GH.
The sample query election for governor is segmented into the following sets of characters: DE EF FG GH. If you search with the sample query election for governor, the document will be found by the query, because the tokens for both the document text and the query appear in the same order.
When n-gram segmentation is used, you will see more results, but the results might be less precise. For example, in Japanese, if you search with the query Kyoto and a document in your index contains the text City of Tokyo, the query Kyoto will return the document with the text City of Tokyo. The reason for this result is that City of Tokyo and Kyoto share two of the same Japanese characters.