Language Detection in Watson Explorer Engine

Language detection in Watson Explorer Engine is done via statistical classification in its Normalization. The language detection process compares the language model for a specified content element against language models that have been precalculated from known representative samples of those languages. The Language Detection settings for the Normalization converter enable you to change the content element(s) that are used for language detection, how many bytes of each element are used in the detection process, and so on. Watson Explorer Engine currently enables 14 language models: English, Spanish, French, Swedish, Italian, Dutch, Japanese, Chinese, Korean, Thai, Portuguese, German, Russian, and Catalan.

Watson Explorer Engine's language models are partitioned into Chinese-Japanese-Korean and Thai (CJK) and non-CJK languages. An initial test that counts the number of CJK and non-CJK characters in the text determines which set of language models is used. If more CJK characters are found then the CJK language models are used, otherwise the non-CJK models are used. The reason for this separation is that the size of the language model is huge in CJK languages relative to the size of language models such as English. For example, in mixed English and Japanese text, a relatively small subset of English text could have a reasonably large match with the model.

Once the set of language models to use is determined, Watson Explorer Engine calculates a language model for the input text using the 300 most common (in other words, most highly ranked) sequences of between 1 and 4 characters (n-grams) that appear within any word in the input text. The degree of match between two language models is calculated by using an "out of place" measure, where each n-gram that appears in the new language model is compared with those in each existing language model. The result of this comparison contributes to the overall comparison score in the following way:

After comparing the language model of the input text against each of the precalculated language models, the best language match is the one with the lowest score.

The language match is then further qualified (and ties are eliminated) by generating a confidence score for that match. This confidence score is calculated by dividing the difference between the two lowest scores (in other words, the two best scores) by a divisor that consists of a general comfort value for the quality of the language match minus the difference between the best and worst scores. The comfort value is applied to take into account whether the language model for the new value provided sufficient data for a meaningful language analysis. This factor is calculated by multiplying the number of n-grams in a standard language model (300) by 300 minus the number of n-grams that were actually present in the language model that you are trying to identify. The calculations done to compute the confidence score for the language match are therefore:

    COMFORT = 300 - (300 - input-n-grams)
    CONFIDENCE = (best2 - best1) / (COMFORT - best + worst)

If the CONFIDENCE score is less than 0.01, the normalization converter assumes that no language has been successfully detected, and will return "unknown" (or whatever label has been set in the normalization converter for the default language).