Language-Specific Indexing

The Watson™ Explorer Engine language detection facility is active by default, but only adds a content item that identifies the language in which specific content is written so that you can search by a specific language by default. When crawling data that is written in another language, especially non-western languages, you can configure the streams used when processing the documents in that search collection, as explained in Indexing Multiple Languages.

Note: Languages are processed differently when using lexical analysis language streams. See Lexical Analysis Streams for more information.

In addition to enabling you to search or index entire documents based on the language that they are written in, Watson Explorer Engine also enables you to index documents that contain multiple languages so that each content element in those documents can be indexed in the way that is most appropriate for the language in which that content is expressed.

The Language detection and Language specific indexing options can be configured in the Normalization converter. This converter is one of the final converters used by the Watson Explorer Engine search engine before beginning indexing. To view and optionally modify the current values for these options for a given search collection:

  1. Select the Configuration tab's Converting sub-tab for that search collection.
  2. Scroll to the end of the list of converters that is displayed on this screen, and click edit to the right of then Converter Component: Normalization converter. An editable version of the converter displays.
  3. Scroll down until you see the Language detection header and click the arrow to the left of that header to expand the group of options that is associated with it.
  4. To activate language-specific indexing, select true from the list of values beside the Use language specific indexing? option. (The default value is "false".)
    Note: If Watson Explorer Engine cannot determine the language in which a content element is written, Watson Explorer Engine will index that content based on the default language settings for that collection.

    Other values that you may want to configure are Contents for language specific indexing, which enables you to identify specific content elements that you want to index based on the primary language that they contain (the default is '*', which means to index all content elements in a language-specific fashion), and Language specific indexing minimum bytes, which enables you to specify the minimum size of a content element that should be indexed in a language-specific fashion.

    Advanced users may wish to configure the Language specific indexing options setting, which controls the index stream that is used for each detected language.

  5. After modifying any of these values, click OK or Apply at the top of the editable Normalization converter to save your changes. To discard your changes without saving them, click Cancel.
Tip: When using language-specific indexing in a search collection, you may also want to explicitly set the segmenter for that collection's main index stream to be mixed, which is a segmenter that is designed to segment based on both Asian and English characters. For more information about streams, see Index Streams.

The mixed segmenter tokenizes the following Unicode characters as single tokens (unigrams):

  • All characters identified by the Unicode standard to be "ideographic" except for full-width ASCII characters
  • All characters identified by the Unicode standard to be "complex context". See the Complex Context section of the Unicode Line Breaking documentation for additional information.
  • Katakana characters

When using this segmenter, you do not need to specially handle or exempt any Unicode ranges for Korean characters.