Language Detection in Watson Explorer Engine
Language detection in Watson™ Explorer Engine is done through statistical classification by the Normalization converter. The language detection process compares the language model for a specified content element against language models that have been precalculated from known representative samples of those languages. The Language Detection settings for the Normalization converter enable you to change the content elements that are used for language detection, how many bytes of each element are used in the detection process, and so on. Watson Explorer Engine currently enables 14 language models: English, Spanish, French, Swedish, Italian, Dutch, Japanese, Chinese, Korean, Thai, Portuguese, German, Russian, and Catalan.
The language models are partitioned into Chinese-Japanese-Korean and Thai (CJK) and non-CJK languages. An initial test that counts the number of CJK and non-CJK characters in the text determines which set of language models is used. If more CJK characters are found then the CJK language models are used, otherwise the non-CJK models are used. The reason for this separation is that the size of the language model is huge in CJK languages relative to the size of language models such as English.
Once the set of language models to use is determined, Watson Explorer Engine calculates a language model for the input text using the 300 most common (in other words, most highly ranked) sequences of between 1 and 4 characters (n-grams) that appear within any word in the input text. The degree of match between two language models is calculated by using an "out of place" measure, where each n-gram that appears in the new language model is compared with those in each existing language model. The result of this comparison contributes to the overall comparison score in the following way:
- if an n-gram appears in both language models, that n-gram contributes the absolute value of rank1 - rank2 to the score, where rank1 is the rank of the n-gram in the new language model and rank2 is the rank of the n-gram in the existing model. The difference in ranks identifies the distance between the two n-grams in terms of frequency.
- if an n-gram does not appear in both language models, that n-gram contributes a score of 300 to the score because the different between the ranks of the two n-grams is at least 300 since one of them did not appear in the language model.
After comparing the language model of the input text against each of the precalculated language models, the best language match is the one with the lowest score.
The language match is then further qualified (and ties are eliminated) by generating a confidence score for that match. This confidence score is calculated by dividing the difference between the two lowest scores (in other words, the two best scores) by a divisor that consists of a general comfort value for the quality of the language match minus the difference between the best and worst scores. The comfort value is applied to take into account whether the language model for the new value provided sufficient data for a meaningful language analysis. This factor is calculated by multiplying the number of n-grams in a standard language model (300) by 300 minus the number of n-grams that were actually present in the language model that you are trying to identify. The calculations done to compute the confidence score for the language match are therefore:
COMFORT = 300 - (300 - input-n-grams) CONFIDENCE = (best2 - best1) / (COMFORT - best + worst)
If the CONFIDENCE score is less than 0.01, the normalization converter assumes that no language has been successfully detected, and will return "unknown" (or whatever label has been set in the normalization converter for the default language).