Linguistic support for nondictionary-based segmentation

For documents in languages that are not supported by the lexical analysis technology, Watson Explorer Content Analytics provides basic support in the form of Unicode-based white space segmentation, n-gram segmentation, and hybrid segmentation.

Unicode-based white space segmentation: This method of linguistic processing uses the white space (or blank space) between words as a word delimiter.
N-gram segmentation: This method of linguistic processing treats overlapping sequences of n characters as a single word. N-gram segmentation, which can be used for languages such as Chinese, Japanese, Korean, Thai, and Hebrew, returns better results than the basic form of Unicode-based white space segmentation does. This simple method of segmentation, which is sufficient for many retrieval tasks, is supported only in enterprise search collections.
Hybrid segmentation: This method of linguistic processing combines dictionary-based and nondictionary-based segmentation. You can select this method when you configure an enterprise search collection or a content analytics collection.

These methods are independent of any language dictionary and do not include sophisticated linguistic processing technology, such as base-form reduction.