Linguistic support for nondictionary-based segmentation

For documents in languages that are not supported by the lexical analysis technology, Watson Explorer Content Analytics provides basic support in the form of Unicode-based white space segmentation, n-gram segmentation, and hybrid segmentation.

Unicode-based white space segmentation
This method of linguistic processing uses the white space (or blank space) between words as a word delimiter.
N-gram segmentation
This method of linguistic processing treats overlapping sequences of n characters as a single word. N-gram segmentation, which can be used for languages such as Chinese, Japanese, Korean, Thai, and Hebrew, returns better results than the basic form of Unicode-based white space segmentation does. This simple method of segmentation, which is sufficient for many retrieval tasks, is supported only in enterprise search collections.
Hybrid segmentation
This method of linguistic processing combines dictionary-based and nondictionary-based segmentation. You can select this method when you configure an enterprise search collection or a content analytics collection.

These methods are independent of any language dictionary and do not include sophisticated linguistic processing technology, such as base-form reduction.