For documents in languages that are not supported by the
lexical analysis technology, Watson Explorer Content Analytics provides
basic support in the form of Unicode-based white space segmentation,
n-gram segmentation, and hybrid segmentation.
- Unicode-based white space segmentation
- This method of linguistic processing uses the white space (or
blank space) between words as a word delimiter.
- N-gram segmentation
- This method of linguistic processing treats overlapping sequences
of n characters as a single word. N-gram segmentation, which
can be used for languages such as Chinese, Japanese, Korean, Thai,
and Hebrew, returns better results than the basic form of Unicode-based
white space segmentation does. This simple method of segmentation,
which is sufficient for many retrieval tasks, is supported only in
enterprise search collections.
- Hybrid segmentation
- This method of linguistic processing combines dictionary-based
and nondictionary-based segmentation. You can select this method when
you configure an enterprise search collection or a content analytics
collection.
These methods are independent of any language dictionary and do
not include sophisticated linguistic processing technology, such as
base-form reduction.