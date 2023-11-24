The Lovins stemmer is the first published stemming algorithm. Essentially, it functions as a heavily parametrized find-and-replace function. It compares every input token against a list of common suffixes, with each suffix conditioned by one of 29 rules. If one of the list’s suffixes is found in a token, and removing the suffix does not violate any of the associated suffix’s conditions, the algorithm removes that suffix from the token. The stemmed token is then run through another set of rules, correcting for common malformations in stemmed roots, such as double letters (for example, hopping becomes hopp becomes hop).6

This code uses the the Python stemming library,7to stem the tokenized Shakespeare quotation:

from stemming.lovins import stem from nltk.tokenize import word_tokenize text = "Love looks not with the eyes but with the mind, and therefore is winged Cupid painted blind." words = word_tokenize(text) stemmed_words = [stem(word) for word in words]

The code outputs:

Stemmed: ['Lov', 'look', 'not', 'with', 'th', 'ey', 'but', 'with', 'th', 'mind', ',', 'and', 'therefor', 'is', 'wing', 'Cupid', 'paint', 'blind', '.']

The output shows how the Lovins stemmer correctly turns conjugations and tenses to base forms (for example, painted becomes paint) while eliminating pluralization (for example, eyes becomes eye). But the Lovins stemming algorithm also returns a number of ill-formed stems, such as lov, th, and ey. These malformed root words result from removing too many characters. As is often the case in machine learning, such errors help reveal underlying processes.

When compared against the Lovins stemmer’s list of suffixes, the longest suffix fitting both love and the is the single-character -e. The only condition attached to this suffix is “No restrictions on stem,” meaning the stemmer may remove -e no matter the remaining stem’s length. Unfortunately, neither of the stems lov or th contain any of the characteristics the Lovins algorithm uses to identify malformed words, such as double letters or irregular plurals.8

When such malformed stems escape the algorithm, the Lovins stemmer can reduce semantically unrelated words to the same stem—for example, the, these, and this all reduce to th. Of course, these three words are all demonstratives, and so share a grammatical function. But other demonstratives, such as that and those, do not reduce to th. This means the Lovins generated stems do not properly represent word groups.