Linguistic support in Watson Content Analytics

The linguistic analysis functions that are provided with Watson Content Analytics include document language detection and segmentation.

When a document is processed, parsing and tokenization functions determine the language of that document and breaks up the stream of input text into distinct units or tokens.

During a search, the user or search application must specify the query language. The query string is segmented and analyzed, and then the index is searched.

Document and query string analysis includes:

Basic nondictionary-based support includes white space and n-gram segmentation. Basic nondictionary-based support also contains sentence segmentation.
Dictionary-based linguistic support includes morphological analysis, such as word and sentence segmentation and lemmatization.
Linguistic processing involves lexical analysis, which is the process of creating alternative representations of the input text that associates all available dictionary data to the tokens that are recognized in the input text. Search quality is greatly enhanced by using advanced language processing.
Hybrid segmentation support includes a mixture of nondictionary-based n-gram segmentation and dictionary-based morphological analysis. This approach can improve search quality for collections that include documents in languages that have dictionaries but do not use white space to delimit word boundaries, such as Chinese, Japanese, and Korean.