Lemmatization

This section describes implementation notes on lemmatization.

Overview

Watson NLP provides lemmatization. Lemma is the base form of word. It is equivalent to headword in paper dictionary (vocabulary). Many languages derive various forms from the base form according to its meaning or use. For example:

English noun has singular form and plural form (e.g. singer: singer, singers).
English verb has inflected forms according to tense, person, and so on (e.g. sing: sing, sings, singing, sang, sung).

Lemma distinguishes Part-of-Speech (e.g. singer (noun) and sing (verb) have different lemma, though they have same origin). And it depends on the context (e.g. My thought (noun, singular) -> thought, I thought (verb, past tense) -> think)

Applications often use lemma to group a set of words into one for 1) document search (e.g. including derived forms for match), and 2) statistical analysis (e.g. facet in Watson Discovery).

Implementation

In Watson NLP, lemma is analyzed by the following steps:

Tokenizer looks up dictionaries and finds possible lemma entries (e.g. thought -> {think, thought})
PoS tagger disambiguates them according to the context (e.g. I thought (verb, past tense) -> think, My thought (noun, singular) -> thought).

It does not generate lemma when:

The token is an out-of-vocabulary word (e.g. URL: http://www.ibm.com/watson?q=ai, number: 100, unknown: qwerty)
The lemma entries do not match with the PoS tagging result (e.g. if PoS tagger says thought as adverb for some reasons, then it does not match with any possible entries given by the tokenizer)

In JSON output, Watson NLP leaves the lemma of those tokens to empty on purpose. You may simply take surface text as a substitution of lemma, or apply some normalization on surface text to canonicalize (e.g. extracting domain part from URL). It depends on requirements specific to your application.

In CoNLLU format output, Watson NLP copies the surface text of those tokens to the lemma. This is for comparing the Watson NLP output with UD corpus (ground truth) and calculating its accuracy scores.

Difference between Stem and Lemma

Stem is a common prefix of a set of words (e.g. {singer, singers, sing, sings, singing} -> sing-). It is used as a substitution of lemma when 1) lemmatization is not supported, or 2) computation resources are limited.

It does not distinguish Part-of-Speech (e.g. {singer, singers, sing, sings, singing} -> sing-)
It does not take into account of the context (e.g. thought -> thought, always)
Sometimes words with same lemma fall into different stems (e.g. {sing, sings, singing} -> sing, {sang} -> sang, {sung} -> sung)
Sometimes different words fall into same stem (e.g. {experiment, experience} -> experi-)

Difference between Synonym and Lemma

Synonym is a word that has same meaning (e.g. {thought, idea, opinion, view}). It is also useful for grouping a set of words (e.g. facet in Watson Discovery). Synonyms do not necessarily have same base form. The grouping of synonym words depend on the use case of application. For example,

Someone in medical domain may want to include {cerebration, intellection, mentation} to the synonyms of thought
Some others may want to exclude view from the synonyms of thought, because view can belong to another synonym words group {view, vista, panorama, scene} and has slightly different meaning

Also it is sometimes not obvious what the synonym and lemma of proper noun and abbreviation are. For example,

Sun. is abbreviation of Sunday. Probably most people agree that the lemma of Sun. is Sunday. Similarly St. is abbreviation of street. However in this case, it would be better to leave it as is when used in address.
IBM stands for IBM Corporation, or International Business Machines. They could have a same lemma and be synonyms
EPS means Earnings Per Share in finance domain. They could have a same lemma and be synonyms in that domain. But EPS can have different meanings in different domains (Wikipedia has more than 40 entries for EPS).

Applications should allow users to define their own synonym words according to customer's use case and domain.