Lemmatization
This section describes implementation notes on lemmatization.
Overview
Watson NLP provides lemmatization. Lemma is the base form of word. It is equivalent to headword in paper dictionary (vocabulary). Many languages derive various forms from the base form according to its meaning or use. For example:
-
English noun has singular form and plural form (e.g.
singer:singer,singers). -
English verb has inflected forms according to tense, person, and so on (e.g.
sing:sing,sings,singing,sang,sung).
Lemma distinguishes Part-of-Speech (e.g. singer (noun) and sing (verb) have different lemma, though they have same origin). And it depends on the context (e.g. My thought (noun, singular) -> thought,
I thought (verb, past tense) -> think)
Applications often use lemma to group a set of words into one for 1) document search (e.g. including derived forms for match), and 2) statistical analysis (e.g. facet in Watson Discovery).
Implementation
In Watson NLP, lemma is analyzed by the following steps:
-
Tokenizer looks up dictionaries and finds possible lemma entries (e.g.
thought-> {think,thought}) -
PoS tagger disambiguates them according to the context (e.g.
I thought(verb, past tense) ->think,My thought(noun, singular) ->thought).
It does not generate lemma when:
-
The token is an out-of-vocabulary word (e.g. URL:
http://www.ibm.com/watson?q=ai, number:100, unknown:qwerty) -
The lemma entries do not match with the PoS tagging result (e.g. if PoS tagger says
thoughtas adverb for some reasons, then it does not match with any possible entries given by the tokenizer)
In JSON output, Watson NLP leaves the lemma of those tokens to empty on purpose. You may simply take surface text as a substitution of lemma, or apply some normalization on surface text to canonicalize (e.g. extracting domain part from URL). It depends on requirements specific to your application.
In CoNLLU format output, Watson NLP copies the surface text of those tokens to the lemma. This is for comparing the Watson NLP output with UD corpus (ground truth) and calculating its accuracy scores.
Difference between Stem and Lemma
Stem is a common prefix of a set of words (e.g. {singer, singers, sing, sings, singing} -> sing-). It is used as a substitution of lemma when 1) lemmatization is
not supported, or 2) computation resources are limited.
-
It does not distinguish Part-of-Speech (e.g. {
singer,singers,sing,sings,singing} ->sing-) -
It does not take into account of the context (e.g.
thought->thought, always) -
Sometimes words with same lemma fall into different stems (e.g. {
sing,sings,singing} ->sing, {sang} ->sang, {sung} ->sung) -
Sometimes different words fall into same stem (e.g. {
experiment,experience} ->experi-)
Difference between Synonym and Lemma
Synonym is a word that has same meaning (e.g. {thought, idea, opinion, view}). It is also useful for grouping a set of words (e.g. facet in Watson Discovery). Synonyms do not necessarily
have same base form. The grouping of synonym words depend on the use case of application. For example,
-
Someone in medical domain may want to include {
cerebration,intellection,mentation} to the synonyms ofthought -
Some others may want to exclude
viewfrom the synonyms ofthought, becauseviewcan belong to another synonym words group {view,vista,panorama,scene} and has slightly different meaning
Also it is sometimes not obvious what the synonym and lemma of proper noun and abbreviation are. For example,
-
Sun.is abbreviation ofSunday. Probably most people agree that the lemma ofSun.isSunday. SimilarlySt.is abbreviation ofstreet. However in this case, it would be better to leave it as is when used in address. -
IBMstands forIBM Corporation, orInternational Business Machines. They could have a same lemma and be synonyms -
EPSmeansEarnings Per Sharein finance domain. They could have a same lemma and be synonyms in that domain. ButEPScan have different meanings in different domains (Wikipedia has more than 40 entries forEPS).
Applications should allow users to define their own synonym words according to customer's use case and domain.