Stemming algorithms differ widely, though they do share some general modes of operation. Stemmers eliminate word suffixes by running input word tokens against a pre-defined list of common suffixes. The stemmer then removes any found suffix character strings from the word, should the latter not defy any rules or conditions attached to that suffix. Some stemmers (for example, Lovins stemmer) run the resulting stemmed bits through an additional set of rules to correct for malformed roots.
The most widely used algorithm is the Porter stemming algorithm, and its updated version the Snowball stemmer. To better understand stemming, we can run the following passage from Shakespeare’s Hamlet through the Snowball stemmer: “There is nothing either good or bad but thinking makes it so.”
The Python natural language toolkit (NLTK) contains built-in functions for the Snowball and Porter stemmers. After tokenizing the Hamlet quotation using NLTK, we can pass the tokenized text through the Snowball stemmer using this code:
from nltk.stem.snowball import SnowballStemmer
from nltk.tokenize import word_tokenize
stemmer = SnowballStemmer("english", True)
text = "There is nothing either good or bad but thinking makes it so."
words = word_tokenize(text)
stemmed_words = [stemmer.stem(word) for word in words]
print("Original:", text)
print("Tokenized:", words)
print("Stemmed:", stemmed_words)
The code outputs:
Original: There is nothing either good or bad but thinking makes it so.
Tokenized: ['There', 'is', 'nothing', 'either', 'good', 'or', 'bad', 'but', 'thinking', 'makes', 'it', 'so', '.']
Stemmed: ['there', 'is', 'noth', 'either', 'good', 'or', 'bad', 'but', 'think', 'make', 'it', 'so', '.']
The Snowball and Porter stemmer algorithms have a more mathematical method of eliminating suffixes than other stemmers. Suffice it to say, the stemmer runs every word token against a list of rules specifying suffix strings to remove according to the number of vowel and consonant groups in a token.5 Of course, because the English language follows general but not absolute lexical rules, the stemming algorithm’s systematic criterion returns errors, such as noth.
The stemmer removes -ing, being a common ending signifying the present progressive. In the Hamlet quote, however, removing -ing erroneously produces the stemmed noth. This can inhibit subsequent linguistic analysis from associating nothing to similar nouns, such as anything and something. Additionally, the stemmer leaves the irregular verb is unchanged. The Snowball stemmer similarly leaves other conjugations of to be, such as was and are, unstemmed. This can inhibit models from properly associating irregular conjugations of a given verb.