Stemming and lemmatization

Authors

Jacob Murel Ph.D.

Senior Technical Content Creator

Eda Kavlakoglu

Business Development + Partnerships

IBM Research

What are stemming and lemmatization?

In natural language processing (NLP), stemming and lemmatization are text preprocessing techniques that reduce the inflected forms of words across a text data set to one common root word or dictionary form, also known as a “lemma” in computational linguistics.1

Stemming and lemmatization are particularly helpful in information retrieval systems like search engines where users may submit a query with one word (for example, meditate) but expect results that use any inflected form of the word (for example, meditates, meditation, etc.). Stemming and lemmatization further aim to improve text processing in machine learning algorithms.

Diagram exemplifying stemming of morphological variants for dance

Why stemming and lemmatization?

Researchers debate whether artificial intelligence can reason, and this debate has extended to computational linguistics. Can chatbots and deep learning models only process linguistic forms, or can they understand semantics?2 Whatever one believes on this matter, it nevertheless remains that machine learning models must be trained to recognize different words as morphological variants of one base word. Even process words according to morphology, not semantics. By reducing derivational word forms to one stem word, stemming and lemmatization help information retrieval systems and deep learning models equate morphologically related words.

For many text mining tasks including text classification, clustering, indexing, and more, stemming and lemmatization help improve accuracy by shrinking the dimensionality of machine learning algorithms and group morphologically related words. Reduction in algorithm dimensionality can, in turn, improve the accuracy and precision of statistical models in NLP, such as topic models and word vector models.3

The latest AI trends, brought to you by experts

Get curated insights on the most important—and intriguing—AI news. Subscribe to our weekly Think newsletter. See the IBM Privacy Statement.

Thank you! You are subscribed.

Your subscription will be delivered in English. You will find an unsubscribe link in every newsletter. You can manage your subscriptions or unsubscribe here. Refer to our IBM Privacy Statement for more information.

Stemming versus lemmatization

Stemming and lemmatization function as one stage in text mining pipelines that convert raw text data into a structured format for machine processing. Both stemming and lemmatization strip affixes from inflected word forms, leaving only a root form.4 These processes amount to removing characters from the beginning and end of word tokens. The resulting roots, or base words, are then passed along for further processing. Beyond this basic similarity, stemming and lemmatization have key differences in how they reduce different forms of a word to one common base form.

How stemming works

Stemming algorithms differ widely, though they do share some general modes of operation. Stemmers eliminate word suffixes by running input word tokens against a pre-defined list of common suffixes. The stemmer then removes any found suffix character strings from the word, should the latter not defy any rules or conditions attached to that suffix. Some stemmers (for example, Lovins stemmer) run the resulting stemmed bits through an additional set of rules to correct for malformed roots.

The most widely used algorithm is the Porter stemming algorithm, and its updated version the Snowball stemmer. To better understand stemming, we can run the following passage from Shakespeare’s Hamlet through the Snowball stemmer: “There is nothing either good or bad but thinking makes it so.”

The Python natural language toolkit (NLTK) contains built-in functions for the Snowball and Porter stemmers. After tokenizing the Hamlet quotation using NLTK, we can pass the tokenized text through the Snowball stemmer using this code:

from nltk.stem.snowball import SnowballStemmer
from nltk.tokenize import word_tokenize

stemmer = SnowballStemmer("english", True)
text = "There is nothing either good or bad but thinking makes it so."
words = word_tokenize(text)
stemmed_words = [stemmer.stem(word) for word in words]

print("Original:", text)
print("Tokenized:", words)
print("Stemmed:", stemmed_words)

The code outputs:

Original: There is nothing either good or bad but thinking makes it so.
Tokenized: ['There', 'is', 'nothing', 'either', 'good', 'or', 'bad', 'but', 'thinking', 'makes', 'it', 'so', '.']
Stemmed: ['there', 'is', 'noth', 'either', 'good', 'or', 'bad', 'but', 'think', 'make', 'it', 'so', '.']

The Snowball and Porter stemmer algorithms have a more mathematical method of eliminating suffixes than other stemmers. Suffice it to say, the stemmer runs every word token against a list of rules specifying suffix strings to remove according to the number of vowel and consonant groups in a token.5 Of course, because the English language follows general but not absolute lexical rules, the stemming algorithm’s systematic criterion returns errors, such as noth.

The stemmer removes -ing, being a common ending signifying the present progressive. In the Hamlet quote, however, removing -ing erroneously produces the stemmed noth. This can inhibit subsequent linguistic analysis from associating nothing to similar nouns, such as anything and something. Additionally, the stemmer leaves the irregular verb is unchanged. The Snowball stemmer similarly leaves other conjugations of to be, such as was and are, unstemmed. This can inhibit models from properly associating irregular conjugations of a given verb.

How lemmatization works

Literature generally defines stemming as the process of stripping affixes from words to obtain stemmed word strings, and lemmatization as the larger enterprise of reducing morphological variants to one dictionary base form.6 The practical distinction between stemming and lemmatization is that, where stemming merely removes common suffixes from the end of word tokens, lemmatization ensures the output word is an existing normalized form of the word (for example, lemma) that can be found in the dictionary.7

Because lemmatization aims to output dictionary base forms, it requires more robust morphological analysis than stemming. Part of speech (POS) tagging is a crucial step in lemmatization. POS essentially assigns each word tag signifying its syntactic function in the sentence. The Python NLTK provides a function for the Word Net Lemmatization algorithm, by which we can lemmatize the Hamlet passage:

from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
from nltk import word_tokenize, pos_tag
 
def get_wordnet_pos(tag):
    if tag.startswith('J'):
        return wordnet.ADJ
    elif tag.startswith('V'):
        return wordnet.VERB
    elif tag.startswith('N'):
        return wordnet.NOUN
    elif tag.startswith('R'):
        return wordnet.ADV
    else:         
        return wordnet.NOUN
       
def lemmatize_passage(text):
    words = word_tokenize(text)
    pos_tags = pos_tag(words)
    lemmatizer = WordNetLemmatizer()
    lemmatized_words = [lemmatizer.lemmatize(word, get_wordnet_pos(tag)) for word, tag in pos_tags]
    lemmatized_sentence = ' '.join(lemmatized_words)
    return lemmatized_sentence
 
text = "There is nothing either good or bad but thinking makes it so."
result = lemmatize_passage(text)
 
print("Original:", text)
print("Tokenized:", word_tokenize(text))
print("Lemmatized:", result)

The code returns:

Original: There is nothing either good or bad but thinking makes it so.
Tokenized: ['There', 'is', 'nothing', 'either', 'good', 'or', 'bad', 'but', 'thinking', 'makes', 'it', 'so', '.']
Lemmatized: There be nothing either good or bad but think make it so .

The WordNetLemmatizer, much like the Snowball stemmer, reduces verb conjugations to base forms—for example thinking to think, makes to make. Unlike the Snowball stemming algorithm, however, the lemmatizer identifies nothing as a noun, and appropriately leaves its -ing ending unaltered while further altering is to its base form be. In this way, the lemmatizer more appropriately conflates irregular verb forms.

Mixture of Experts | 5 December, episode 84

Decoding AI: Weekly News Roundup

Join our world-class panel of engineers, researchers, product leaders and more as they cut through the AI noise to bring you the latest in AI news and insights.

Limitations

Stemming and lemmatization primarily support normalization of English language text data. Both text normalization techniques also support several other Roman script languages, such as French, German, and Spanish. Other scripts, such as Russian, are further supported by the Snowball stemmer. Development of stemming and lemmatization algorithms for other languages, notably Arabic, is a recent and ongoing area of research. Arabic is particularly challenging due to its agglutinative morphology, orthographic variations, and lexical ambiguity, among other features.8 In all, such elements problematize a systematic method for identifying base word forms among morphological variants, at least when compared to English words.

Beyond this general limitation, stemming and lemmatization have their respective disadvantages. As illustrated with the Hamlet example, stemming is a relatively heuristic, rule-based process of character string removal. Over-stemming and under-stemming are two common errors that arise. The former is when two semantically distinct words are reduced to the same root (for example, news to new); under-stemming is when two words semantically related are not reduced to the same root (for example, knavish and knave to knavish and knave respectively).9 Additionally, stemming only strips suffixes from words and so cannot account for irregular verb forms or prefixes as lemmatization does. Of course, stemming is relatively simple and straightforward to implement while lemmatization can be more computationally expensive and time-consuming depending on the size of the data processed.

Related solutions
IBM® watsonx Orchestrate™

Easily design scalable AI assistants and agents, automate repetitive tasks and simplify complex processes with IBM® watsonx Orchestrate™.

Explore watsonx Orchestrate
Natural language processing tools and APIs

Accelerate the business value of artificial intelligence with a powerful and flexible portfolio of libraries, services and applications.

Explore NLP solutions
AI consulting and services

Reinvent critical workflows and operations by adding AI to maximize experiences, real-time decision-making and business value.

Explore AI services
Take the next step

Easily design scalable AI assistants and agents, automate repetitive tasks and simplify complex processes with IBM® watsonx Orchestrate™.

Explore watsonx Orchestrate Explore NLP solutions