Linguistic support for hybrid segmentation

Hybrid segmentation combines the high precision benefits of dictionary-based segmentation with the high recall benefits of nondictionary-based, n-gram segmentation.

Dictionary-based segmentation uses language-specific dictionaries to determine the words in a given document. Through morphological analysis, in which words and word alternatives are identified and indexed, dictionary-based segmentation provides high search quality as query terms are precisely matched against word forms in the index.

Nondictionary-based segmentation uses rules to tokenize data, not language-specific dictionaries. For example, a word might be determined by a sequence of n characters, which might or might not include white space. This type of segmentation is useful for languages in which word boundaries are not clearly delimited.

Hybrid segmentation, which combines these two methods, can be used for languages such as Chinese, Japanese, and Korean that have dictionaries but do not use white space to delimit word boundaries.

When hybrid segmentation is enabled, search quality improves in terms of both precision and recall. Precision, which represents the number of relevant documents in the search results compared to the total number of documents returned, is one of the benefits of morphological analysis. Recall, which represents the number of relevant documents returned compared to the total number of potentially relevant documents, is one of the benefits of n-gram segmentation. For example, perfect recall might be achieved by returning all documents in the index, but the corresponding precision would then be very poor.

To use hybrid segmentation, select the option to enable both morphological and n-gram segmentation when you create a collection. If you decide that you want to use hybrid segmentation after you create the collection, you can enable it by configuring parsing options.