Supported languages

Pretrained models

Language
name
Language
code
Task types supported
All languages Detag 1, Lang-Detect 2, Syntax (Izumo) 3
Arabic ar Entity-Mentions (RBR, BiLSTM, BERT, Transformer), Keywords, Noun-Phrases, Relations (Transformer), Sentiment (CNN, BERT, Transformer), Target-Mentions
Chinese
(Simplified)
zh-cn Entity-Mentions (RBR, BiLSTM, BERT, Transformer), Keywords, Noun-Phrases, Sentiment (CNN, BERT, Transformer), Target-Mentions
Chinese
(Traditional)
zh-tw Entity-Mentions (RBR), Keywords, Noun-Phrases, Target-Mentions
Czech cs Entity-Mentions (RBR, BERT, Transformer), Keywords, Noun-Phrases, Sentiment (BERT, Transformer)
Danish da Entity-Mentions (RBR, BERT, Transformer), Keywords, Noun-Phrases, Sentiment (BERT, Transformer)
Dutch nl Entity-Mentions (RBR, BiLSTM, BERT, Transformer), Keywords, Noun-Phrases, Sentiment (CNN, BERT, Transformer), Target-Mentions
German de Categories, Concepts, Entity-Mentions (RBR, BiLSTM, BERT, Transformer), Keywords, Noun-Phrases, Relations (Transformer), Sentiment (CNN, BERT, Transformer), Target-Mentions
English en Categories, Concepts, Emotion, Entity-Mentions (RBR, BiLSTM, BERT, Transformer), Keywords, Noun-Phrases, Relations (SIRE, Transformer), Sentiment (CNN, BERT, Transformer), Target-Mentions, Tone
Finnish fi Entity-Mentions (RBR, BERT, Transformer), Keywords, Noun-Phrases, Sentiment (BERT, Transformer)
French fr Concepts, Emotion, Entity-Mentions (RBR, BiLSTM, BERT, Transformer), Keywords, Noun-Phrases, Relations (Transformer), Sentiment (CNN, BERT, Transformer), Target-Mentions, Tone
Hebrew he Entity-Mentions (RBR, BERT, Transformer), Keywords, Noun-Phrases, Sentiment (BERT, Transformer)
Hindi hi Entity-Mentions (RBR, BERT, Transformer), Keywords, Noun-Phrases, Sentiment (BERT, Transformer)
Italian it Concepts, Entity-Mentions (RBR, BiLSTM, BERT, Transformer), Keywords, Noun-Phrases, Relations (Transformer), Sentiment (CNN, BERT, Transformer), Target-Mentions
Japanese ja Concepts, Entity-Mentions (RBR, BiLSTM, BERT, Transformer), Keywords, Noun-Phrases, Relations (Transformer), Sentiment (CNN, BERT, Transformer), Target-Mentions
Korean ko Concepts, Entity-Mentions (RBR, BiLSTM, BERT, Transformer), Keywords, Noun-Phrases, Relations (Transformer), Sentiment (CNN, BERT, Transformer), Target-Mentions
Norwegian
Bokmal
nb Entity-Mentions (RBR, BERT, Transformer), Keywords, Noun-Phrases, Sentiment (BERT, Transformer)
Norwegian
Nynorsk
nn Entity-Mentions (RBR, BERT, Transformer), Keywords, Noun-Phrases, Sentiment (BERT, Transformer)
Portuguese pt Concepts, Entity-Mentions (RBR, BiLSTM, BERT, Transformer), Keywords, Noun-Phrases, Relations (Transformer), Sentiment (CNN, BERT, Transformer), Target-Mentions
Polish pl Entity-Mentions (RBR, BERT, Transformer), Keywords, Noun-Phrases, Sentiment (BERT, Transformer)
Romanian ro Entity-Mentions (RBR, BERT, Transformer), Keywords, Noun-Phrases, Sentiment (BERT, Transformer)
Russian ru Entity-Mentions (RBR, BERT, Transformer), Keywords, Noun-Phrases, Sentiment (BERT, Transformer)
Slovak sk Entity-Mentions (RBR, BERT, Transformer), Keywords, Noun-Phrases, Sentiment (BERT, Transformer)
Spanish es Concepts, Entity-Mentions (RBR, BiLSTM, BERT, Transformer), Keywords, Noun-Phrases, Relations (Transformer), Sentiment (CNN, BERT, Transformer), Target-Mentions
Swedish sv Entity-Mentions (RBR, BERT, Transformer), Keywords, Noun-Phrases, Sentiment (BERT, Transformer)
Turkish tr Entity-Mentions (RBR, BERT, Transformer), Keywords, Noun-Phrases, Sentiment (BERT, Transformer)

1 The Detag task is language agnostic.

2 Lang-Detect is supported for the languages described in List of Supported Languages below.

3 Syntax support for different parsers (sentence detection, tokenization, lemmatization, parts-of-speech and dependency parsing) is described in List of Supported Languages below.

NLP tasks

List of Supported Languages

Watson NLP supports 31 languages for core functions. Based on a study of content languages for worldwide websites, this can cover 92.6% of them.

Supported languages

Language name Locale code Language identification Sentence segmentation Tokenization PoS tagging Dependency parsing
Afrikaans af ✓ ✓ ✓ ✓ ✓
Arabic ar ✓ ✓ ✓ ✓ ✓
Bosnian bs ✓* ✓ ✓ ✓ ✓
Catalan ca ✓ ✓ ✓ ✓
Chinese (Simplified) zh_CN ✓ ✓ ✓ ✓
Chinese (Traditional) zh_TW ✓ ✓ ✓ ✓
Croatian hr ✓ ✓ ✓ ✓ ✓
Czech cs ✓ ✓ ✓ ✓ ✓
Danish da ✓ ✓ ✓ ✓ ✓
Dutch nl ✓ ✓ ✓ ✓ ✓
English en ✓ ✓ ✓ ✓ ✓
Finnish fi ✓ ✓ ✓ ✓ ✓
French fr ✓ ✓ ✓ ✓ ✓
German de ✓ ✓ ✓ ✓ ✓
Greek el ✓ ✓ ✓ ✓
Hebrew he ✓ ✓ ✓ ✓
Hindi hi ✓ ✓ ✓ ✓ ✓
Italian it ✓ ✓ ✓ ✓ ✓
Japanese ja ✓ ✓ ✓ ✓ ✓
Korean ko ✓ ✓ ✓ ✓
Norwegian Bokmål nb ✓ ✓ ✓ ✓ ✓
Norwegian Nynorsk nn ✓ ✓ ✓ ✓ ✓
Polish pl ✓ ✓ ✓ ✓
Portuguese pt ✓ ✓ ✓ ✓ ✓
Romanian ro ✓ ✓ ✓ ✓ ✓
Russian ru ✓ ✓ ✓ ✓ ✓
Serbian sr ✓ ✓ ✓ ✓ ✓
Slovak sk ✓ ✓ ✓ ✓ ✓
Spanish es ✓ ✓ ✓ ✓ ✓
Swedish sv ✓ ✓ ✓ ✓ ✓
Turkish tr ✓ ✓ ✓ ✓

Additionally, the following languages are supported for language identification.

Language name Locale code Language identification
Albanian sq ✓
Armenian hy ✓
Azerbaijani az ✓
Bangla bn ✓
Bashkir ba ✓
Basque eu ✓
Belarusian be ✓
Bulgarian bg ✓
Chuvash cv ✓
Esperanto eo ✓
Estonian et ✓
Georgian ka ✓
Gujarati gu ✓
Haitian Creole ht ✓
Hungarian hu ✓
Icelandic is ✓
Irish ga ✓
Kazakh kk ✓
Khmer km ✓
Kurdish ku ✓
Kyrgyz ky ✓
Latvian lv ✓
Lithuanian lt ✓
Malay ms ✓
Malayalam ml ✓
Maltese mt ✓
Mongolian mn ✓
Pashto ps ✓
Persian fa ✓
Punjabi pa ✓
Slovenian sl ✓
Somali so ✓
Tamil ta ✓
Telugu te ✓
Thai th ✓
Ukrainian uk ✓
Urdu ur ✓
Vietnamese vi ✓

Language Dialects, Writing Systems

Arabic

Watson NLP supports Standard Arabic (SA) used across the Middle East, and North Africa. It is reported that it is less accurate for Dialectal Arabic (DA) (e.g. Egyptian Arabic).

Chinese

Watson NLP supports Simplified Chinese (zh_CN) used in Mainland China, and Traditional Chinese (zh_TW) used in Taiwan.

It is not tested for Cantonese used in Hong Kong. Note the following points for Cantonese:

  1. Vocabulary: Uses the same grammar but different vocabularies with some overlapping. For written Cantonese, it could be covered to some extent but may not be complete.

  2. Character set: Watson NLP supports Unicode, which includes Cantonese characters. But it was not standardized well before 2004 (e.g. GCCS, HKSCS-1999, HKSCS-2001).. Some old systems may hit this issue and it may be incompatible with the system based on the latest Unicode including Watson NLP.

  3. Lemma: Watson NLP normalizes the lemma of Chinese words to Simplified Chinese always.

Portuguese

Watson NLP supports both European Portuguese (pt_PT) used in Portugal, and Brazilian Portuguese (pt_BR) used in Brazil. There are some differences in orthography but not significant. Watson NLP supports Portuguese (pt) using combined dictionaries and models.

Serbian

Serbian language has 2 writing systems, Cyrillic script and Latin script. There is a direct transliteration between the two. Watson NLP supports both scripts.

Bosnian and Croatian

Bosnian and Croatian belong to the same language family and it is not easy to distinguish them well in written form. Currently Watson NLP language detection module outputs hr for Bosnian language.