Indexing Multiple Languages

The default indexing settings in the Watson™ Explorer Engine default collection are biased towards English content. If your content contains mainly another language, or a mix of languages, you may want to change some settings in order to better index your data. For information on searching with different languages outside of the context of indexing, see Language Configuration in Watson Explorer Engine.

Note: This information does not apply to lexical analysis language streams. Each lexical analysis stream performs language-appropriate segmentation and tokenization using resources that are contained within a PEAR file, and no other configuration is needed. See Lexical Analysis Streams for more information.

Segmenting is the process of splitting an input into separate words. It is heavily influenced by the tokenization process, which determines which characters are parts of words. Many languages share similar segmentation properties - English and German, for example, use the same segmenter, which basically divides words at spaces. This is the default segmenter behavior.

Some languages, however, are not segmented with spaces. Japanese, Chinese, and Thai are examples. Because of this, Watson Explorer Engine provides special segmenters for these languages.

When your collection will contain only languages that have the same segmentation, it's easy to set Watson Explorer Engine to index those languages properly. Some good, general rules are the following:

  • If the languages are not segmented by spaces, simply use the proper base collection when creating your collection. For example, if your collection contains only Japanese, use japanese-default. You can find a list of base collections by adding a new collection and browsing the top portion of the list of possible base collections. If there is no default collection for the non-space-segmented language that your data is in, chinese-default is a suitable substitute. Each of the default collections defines a stream with proper options set for its particular language thus simplifying the process of adding a language stream to a collection.
  • Otherwise, simply use default as the base collection. This has a stream defined that is suitable for space segmented languages.

However, if your collection contains more than one type of segmentation, or if there are other options that you wish to set per-language, one stream definition will not suffice for the entire collection. Use language-specific indexing to assign a different stream definition for each detected language.

Note: It may seem tempting to simply add a second index stream to a collection with mixed languages. This should work, but is not recommended. Usually, using language-specific indexing or just using chinese-default as your base collection is a better choice. Before doing this, read about multi-stream indexing and be sure that you understand the implications in terms of index size and query speed.

When processing Japanese, one or more segmenters can be used via two processing modes:

  • Standard (default) applies the default Japanese segmenter and indexes surface forms. Surface forms may contain Kanji, Katakana, and Hiragana, and are sometimes ambiguous in meaning.
  • Base+Reading actually applies two different segmenters (japanese-base and japanese-reading). It indexes both the normalized base form, which is similar to stemming in segmented languages, and the reading form, in which Kanji and Katakana characters are provided specific pronunciations and meanings. This processing mode may provide higher search accuracy but creates a larger search index since each word is indexed twice.

Optionally, when searching non-space-segmented languages, you may choose to change a setting in your search collection that affects how multiple word queries are interpreted. This setting can be found in the Configuration > Searching > Advanced section of the Watson Explorer Engine administration tool. Click edit to show the available options. The Phrase logic option controls the interpretation of soft-phrases in the user's input. A soft phrase is a search term entered by the user that is many "words" (as determined by the segmenter). The default setting of this option will interpret the query as a phrase search and will return only documents containing that exact phrase. By setting Phrase logic to near, the search will return documents containing the individual terms in close proximity. The latter is usually more natural for queries in non-segmented languages, so we recommend that you change this option.

Tip: If you need to edit the language settings after you have already created a collection, look at the stream definitions on the default collections and set your stream definitions to match them. After making any changes to the stream definitions, you will need to re-crawl the collection.