Japanese document handling

If the text of a document is in Japanese, Watson Content Analytics performs relevant word segmentation by using morphological analysis technology that is optimized for the Japanese language.

Content analytics collections

An enhanced linguistic analysis engine called JJSA is used to analyze Japanese documents in content analytics collections. JJSA provides dependency information between words so that users can create rules to match against relations of words. JJSA ignores sentences that contain only ASCII characters. JJSA also ignore sentences that are longer than or equal to 50 characters because long English sentences have a negative effect in Japanese linguistic analysis.

Enterprise search collections

In an enterprise search collection, Watson Content Analytics performs relevant word segmentation by using morphological analysis technology that is optimized for the Japanese language.

Word decomposition: Japanese uses a large number of compound words. These words are decomposed into tokens of optimal size to achieve better search results. Inflected words and prepositions are also decomposed to improve search performance.
Orthographic variants: Watson Content Analytics uses a variant dictionary to map typical Katakana variants to their base forms (similar to a lemma) so that all documents, including those with orthographic variants of the Katakana word in the query string, are found.
The system also supports typical Okurigana variants, which are Kanji word endings that are written in Hiragana.