Computers are adept at performing rapid searches of large stores of information for a given search term, but the search capability required for a spell checking application goes beyond exact string matching. In this article, I'll describe some of the history of search algorithms, including phonetic matching algorithms such as the Soundex and Metaphone and string similarity types such as the Dynamic Programming algorithm. I'll explain both the strengths and weaknesses of these algorithms in relation to spell checking, and then introduce one final variant, the Aspell algorithm, which was written specifically for spell checking applications.
The Aspell algorithm, which combines the best features of previous search-and-match algorithms, is the underlying framework of Jazzy, the spell checker API for the Java platform. In the second half of this article, you'll see how Jazzy applies the Aspell algorithm in a Java framework. I'll show you the steps by which Jazzy identifies a misspelled word and then provides a likely correction. I'll close the article with a working example that demonstrates the ease with which Jazzy can help you incorporate its spell checking features into your Java applications.
Spelling family names correctly can be a challenge. People with unusual family names often find their name has been mangled when making a booking on the phone. Even common names can be misspelled due to minor variations in spelling, for instance Smith and Smyth are two variations on a common name that sound the same.
The rich variation in the spelling of names, in particular, has led to some interesting spell checking algorithms. The first algorithm type I'll look at is the phonetic matching algorithm, which aims to solve the problem of "which names match those that sound like x." This algorithm type is quite common in search databases and other reference applications. When researching a family history, for example, users should be able to retrieve both exact matches and similar ones, thus allowing for the possibility that the family name has changed over the centuries or has been incorrectly spelled in some records.
Soundex algorithms have been used to index all US censuses from 1920 onwards, and are a staple of family history software. The original Soundex algorithm was patented in 1918 by Margaret K. Odell and Robert C. Russell (see Resources), who sought to "provide an index wherein names are entered and grouped phonetically rather than according to the alphabetic construction of the names."
Essentially, Soundex algorithms function by mapping each letter of a given
alphabet to a numerical code representing its phonetic group. In this scheme,
letters such as d and t are in the same group since they sound alike (in fact each letter is vocalized by a similar mechanism) and vowels are omitted altogether. By applying this mapping to a whole word a phonetic "key" for the word is produced. Words that sound alike will usually have the same key. For example, the Soundex key for both Smith and Smyth is S530.
One of the most common Soundex variants is the one popularized in Donald E. Knuth's The Art of Computer Programming. You can see a Java implementation of the algorithm in Listing 1. Note that the algorithm employs Java regular expressions, which are only available in release 1.4 onwards.
Listing 1. Knuth's Soundex
public class KnuthSoundex implements PhoneticEncoder {
// ABCDEFGHIJKLMNOPQRSTUVWXYZ
private static final String SOUNDEX_DIGITS = "01230120022455012623010202";
public String calculateCode(String string) {
String word = string.toUpperCase(); // 01 ASHCROFT
word = word.replaceAll("[^A-Z]", ""); // 02
if (word.length() == 0) { // 03
return ""; // 04
} // 05
char first = word.charAt(0); // 06
word = first + word.substring(1).replaceAll("[HW]", ""); // 07 ASCROFT
StringBuffer sndx = new StringBuffer(); // 08
for (int i = 0; i < word.length(); i++) { // 09
sndx.append(SOUNDEX_DIGITS.charAt((int) (word.charAt(i) - 'A'))); // 10
} // 11
word = sndx.toString().replaceAll("(.)\\1+", "$1"); // 12 026013
word = first + word.substring(1); // 13 A26013
word = word.replaceAll("0", ""); // 14 A2613
return (word + "000").substring(0, 4); // 15 A261
}
}
|
The above code is quite terse, so I'll go over what it does line by line:
- Lines 01 to 05 normalize the input to capital letters and drop all other characters.
- Line 06 ensures the first character of the word is unchanged.
- Line 07 drops subsequent
HorWcharacters. - Lines 08 to 11 replace each letter in the word with its phonetic code.
- Line 12 removes adjacent phonetic codes that are the same. (Note that this means, unlike vowels, intervening
HandWcharacters do not act as a barrier to combining runs of letters with the same code.) - Like line 06, line 13 ensures the first character of the word is unchanged.
- Line 14 drops all vowels.
- Line 15 constructs the Soundex by truncating the word to four characters (possibly padded with
0characters).
To really understand the algorithm it helps to run through it by hand. The right-hand column of the code tracks the value of the word variable starting with the input name Ashcroft. This is a good test case for the algorithm since the s and the c combine despite the intervening h. (Many Soundex implementations found on genealogy Web sites don't implement this rule correctly.)
Unfortunately, the Soundex algorithm is a poor candidate for spell checking. For a start, words that sound distinct may have the same soundex. For example, White and Wood both have the soundex W300. This isn't surprising, since the Soundex algorithm was designed to group names that sound alike, not just those that
sound identical. While this feature might be a desirable for some
applications -- such as one intended to help telephone operators
recognize names spoken in a variety of accents -- it is not useful
for a spell-checking application, since it produces far too many
matches. For instance, the misspelling algorithum matches the
following words from my sample dictionary:
alacritous, alacrity, alcheringa, alcoran, algeria, algerian, algerians, algiers, algor, algorism, algorithm, algorithmic, algorithmically, algorithms, alizarin, alizarine, alkoran, alleger, allegers, allegoric, allegorical, allegorically, allegories, allegorist, allegorists, allegorizes, allegory, allegretto, allegrettos, allegro, allegros, allocheiria, allochiria, allocortex, allograft, allograph, allographic, allographs
Even taking into account the extra matches due to variants of the same word (allegoric, allegorical, allegorically), you would normally require much more stringent matching from a spell-checking algorithm. As you'll recall, the Soundex algorithm also truncates each soundex code to four characters, further boosting the number of matches by ignoring the ends of long words. And the trouble doesn't end there.
Just as different sounding words may have the same soundex, the
reverse situation can also occur: words that sound identical, called homophones, may have different codes. This is
the result of some letters being silent, such as the p in
Thompson (T512) versus
Thomson (T525), or the gh in
Leigh (L200) versus Lee
(L000). In a similar vein, the initial letter of
a word may vary without affecting its sound, such as the c in
Carr (C600) versus the k
of Karr (K600). The Soundex algorithm
brings this problem on itself by failing to map the initial letter of each word to a phonetic digit.
The so-called homophone problem actually stems from the fact that the
English language has very irregular spelling, perhaps more so than any
other language. While there are many minor variations of the Soundex algorithm, they all have scant knowledge of English spelling rules, not to mention the exceptions to these rules. As a consequence of this irregularity, the Soundex is not particularly well suited to spell checking in the English language. For example, the Soundex gives the phonetically misspelled lam (L500) a different soundex code from the correct form, lamb (L510). A spell check application
based on Soundex would not, therefore, offer lamb as a suggested correction for the misspelling lam.
It was this state of affairs that led Lawrence Phillips to find a replacement for the Soundex algorithm, called the Metaphone.
The idea behind the Metaphone algorithm, first published in
Computer Language magazine in 1990 (see Resources), is to explicitly code common rules of
English pronunciation that the Soundex doesn't attempt to address. For
example, the Metaphone algorithm includes an explicit rule that drops
the letter b where it occurs after the letter m at the end
of a word. This rule ensures that lam and lamb will have
the same encoding (LM), thus enabling a spell
check application to offer the correct replacement for lam.
The Metaphone algorithm uses 16 consonant classes represented by the following characters:
B X S K J T F H L M N P R 0 W Y |
The 0 is a zero, used to represent the
th sound. As in the Soundex algorithm, the first letter is retained and the final code is truncated to four characters, although it is not padded if shorter. Repeated letters and vowels are generally dropped, as are vowels. The bulk of the Metaphone algorithm is a set of rules that map letter combinations into the consonant classes. A Java implementation runs to a few hundred lines of code, as exemplified by the Metaphone code in the Apache Jakarta Commons
Codec project (see Resources). In Listing 2, you
see what happens when you use the Apache Metaphone class
as a JUnit test case to check some code for the word lamb.
Listing 2. Using the Apache Metaphone class
import junit.framework.TestCase;
import org.apache.commons.codec.language.Metaphone;
public class ApacheMetaphoneTest extends TestCase {
public void test() {
Metaphone metaphone = new Metaphone();
assertEquals("LM", metaphone.encode("lam"));
assertEquals("LM", metaphone.metaphone("lam"));
assertEquals(metaphone.encode("lamb"), metaphone.encode("lam"));
assertTrue(metaphone.isMetaphoneEqual("lamb", "lam"));
}
}
|
The Metaphone algorithm generally improves on the Soundex, although
there are some flaws in its rules. Metaphone author Phillips points out,
for instance, that Bryan (BRYN) and
Brian (BRN) should have the same code.
Phillips published his attempt to improve on the Metaphone's fuzzy matching (so called) in the June 2000 issue of the C/C++ Users Journal.
The DoubleMetaphone algorithm tinkered somewhat with the original's
consonant classes, and broke with the Soundex algorithm by encoding all leading vowels as A. More fundamentally, the DoubleMetaphone
was written to return different codes for words that can be pronounced in more than one way. For example, hegemony can be pronounced with a soft or a hard g, so the algorithm returns both HJMN
and HKMN. Despite these examples, most words in the Metaphone algorithm return a single key. You can see the Apache DoubleMetaphone class being exercised in Listing 3.
Listing 3. Using the Apache DoubleMetaphone class
import junit.framework.TestCase;
import org.apache.commons.codec.language.DoubleMetaphone;
public class ApacheDoubleMetaphoneTest extends TestCase {
public void test() {
DoubleMetaphone metaphone = new DoubleMetaphone();
assertEquals("HJMN", metaphone.encode("hegemony"));
assertEquals("HJMN", metaphone.doubleMetaphone("hegemony"));
assertEquals("HJMN", metaphone.doubleMetaphone("hegemony", false));
assertEquals("HKMN", metaphone.doubleMetaphone("hegemony", true));
}
}
|
While both Soundex and Metaphone algorithms solve phonetic fuzzy matching very well, no spell checking application is complete without typo correction. A
typo occurs when your fingers slip on the keyboard and type labm
(LBM) instead of lamb (LM). A phonetic matching algorithm cannot match this type
of misspelling with its replacement, since the two words sound different. To solve a case like this, your spell checking application must include a string similarity algorithm.
Remember those puzzles where you had to transform one word into another while changing only one letter at a time? For example, ship can be transformed into crow via the intermediate words shop, chop, and crop. This game gives you a way to clearly understand the concept of distance between two words: distance is the number of steps it takes to transform one word into another, provided that you change only one letter at a time and use actual dictionary words for each step. I'll call this the puzzle distance. In this example, the puzzle distance between ship and crow is 4.
Although we often think of distance as being a physical measurement between two points in space, mathematicians define it more generally, calling it a metric. This definition allows you to use the concept of distance in different applications; here you're interested in the distance between two character strings, or words. The idea is that for a misspelling, you should look for words that are suitably "close" -- using the definition of distance -- to the misspelled word. Any definition of a distance metric has to satisfy a handful of properties to qualify; for example, a distance can never be negative.
While sequence comparison has many facets (see Resources), the trick for your purpose is to find a definition of distance that lends itself to good spelling corrections. The puzzle distance defined above is unsuitable for at least one good reason: A misspelled word is not always one letter away from a correctly spelled one. For instance, there are no "stepping stones" from the misspelling puzzel to any correctly spelled English word. Fortunately, a number of metrics have already been devised that are appropriate for spell checking.
The Dynamic Programming algorithm
The Dynamic Programming algorithm is essentially a brute-force method that considers all the different ways of transforming a source word to a target to find the least cost, or distance between the two. Levenshtein distance, a particular implementation of the Dynamic Programming algorithm, permits three types of operation for transforming the source word x to the target word y:
- The substitution of one character of x by a character in y
- The deletion of a character of x
- The insertion of a character in y
Each operation has a certain cost, and the total distance is the smallest total cost for transforming word x to word y. It is intuitively plausible that an algorithm based on these operations would work well for spelling correction, since typos are nothing more than these operations interpreted as keying errors. (In fact, a Levenshtein distance is also known as an edit distance.) For example, when I key the word wrong as wromg (hitting the m key instead of the n key), it is a substitution error; when I key it as wromng (hitting the m key as well as the n key), it is a deletion error; and when I key it as wrog (missing the n key), it is an insertion error.
The Dynamic Programming algorithm is best understood by drawing a grid whose rows correspond to the letters of the source, and whose columns correspond to the letters of the target. The cell at (i, j) represents the smallest distance between the first i letters of the source and the first j letters of the target.
For the Levenshtein distance, the cost of the deletions and insertions is 1. The cost of substitutions is 1 if the characters differ, otherwise it is 0. To start the algorithm you fill in the first row, which corresponds to an empty source word, so it is the cost of inserting 0, 1, ..., j letters. Similarly the first column corresponds to an empty target word, so it is the cost of deleting 0, 1, ..., i letters. If you take the transformation of pzzel to puzzle as an example, you have the grid shown in Figure 1.
Figure 1. The first stage of the Levenshtein algorithm

Next, you calculate the values in each remaining cell by considering its three neighbors: above, to the left, and diagonally upward and to the left. This is shown schematically in Figure 2.
Figure 2. How to calculate costs for cells
| Diagonal | Above |
| Left |
Min( Diagonal + substitution cost, Above + delete cost, Left + insert cost ) |
The resulting grid for the example is shown in Figure 3. The cost in the bottom right-hand cell, 3, is the Levenshtein distance between pzzel and puzzle.
Figure 3. The final stage of the Levenshtein algorithm

Properties of the Levenshtein algorithm
As a bonus, the Levenshtein algorithm also gives you a series of operations, also called alignments, which constitute the transformation. A pair of words often has more than one alignment. Alignments correspond to minimum-cost paths from the top left-hand cell to the bottom right-hand cell that follow the arrows on the diagram. For instance, the alignment depicted in Listing 4 (and shown as red arrows in Figure 3), may be read, character by character, as the following series of operations:
- Substitute p with p (cost 0)
- Insert u (cost 1)
- Substitute z with z (cost 0)
- Substitute z with z (cost 0)
- Insert l (cost 1)
- Substitute e with e (cost 0)
- Delete l (cost 1)
Listing 4. An alignment between pzzel and puzzle
p-zz-el puzzle- |
A Java implementation of the Levenshtein algorithm
A simple, illustrative Java implementation of the Levenshtein
algorithm is shown in Listing 5. The
LevenshteinDistanceMetric class is similar to the
StringUtils class from the Apache Jakarta Commons project.
A limitation of both of these implementations is that they do not scale
to large strings, since the storage requirements are O(mn), where
m and n are the length of the source and target words,
respectively. If you need only to compute the distance and not
alignments, as is usually the case, it is easy to reduce this to
O(n), since only the previous row is required to compute the
next. A fix has been proposed for the Apache version (see Resources), but it has not been incorporated as of
the current time of writing (version 2.0).
Note that the running time for the Levenshtein algorithm is always O(mn). As a result, this algorithm is too slow for finding the closest match for a misspelling in very large dictionaries.
Listing 5. An implementation of the Levenshtein distance algorithm
public class LevenshteinDistanceMetric implements SequenceMetric {
/**
* Calculates the distance between Strings x and y using the
* <b>Dynamic Programming</b> algorithm.
*/
public final int distance(String x, String y) {
int m = x.length();
int n = y.length();
int[][] T = new int[m + 1][n + 1];
T[0][0] = 0;
for (int j = 0; j < n; j++) {
T[0][j + 1] = T[0][j] + ins(y, j);
}
for (int i = 0; i < m; i++) {
T[i + 1][0] = T[i][0] + del(x, i);
for (int j = 0; j < n; j++) {
T[i + 1][j + 1] = min(
T[i][j] + sub(x, i, y, j),
T[i][j + 1] + del(x, i),
T[i + 1][j] + ins(y, j)
);
}
}
return T[m][n];
}
private int sub(String x, int xi, String y, int yi) {
return x.charAt(xi) == y.charAt(yi) ? 0 : 1;
}
private int ins(String x, int xi) {
return 1;
}
private int del(String x, int xi) {
return 1;
}
private int min(int a, int b, int c) {
return Math.min(Math.min(a, b), c);
}
}
|
So far, I've considered two approaches to spell checking: phonetic matching and sequence comparison. Since neither of these on its own provides a complete solution, an algorithm was written to combine the two. To quote from the GNU Aspell manual:
The magic behind [Aspell] comes from merging Lawrence Philips excellent metaphone algorithm and Ispell's near miss strategy which is inserting a space or hyphen, interchanging two adjacent letters, changing one letter, deleting a letter, or adding a letter.
Jazzy is a GPL/LGPLed Java-based spell checker API based on the Aspell algorithm, which was originally written in C++.
The Aspell algorithm and Jazzy
If the word being spell checked is not in the dictionary, then the Aspell algorithm assumes it is misspelled. In this case, the algorithm uses the following steps to build a ranked list of suggested corrections:
- Add close phonetic matches of the misspelling: Add all
dictionary words that have the same phonetic code as the misspelled word
and whose edit distance from the misspelled word is less than a
given threshold.
- Add close phonetic matches of the misspelling's near
misses: Find all phonetic codes for words that are one edit
operation from the misspelled word. For these codes, add all dictionary
words that have the same phonetic code as the misspelled word and
whose edit distance from the misspelled word is less than a given
threshold.
- Best guess: If no suggestions have been found, add all
dictionary words that have the same phonetic code as the misspelled word
and with the smallest edit distance from the misspelled
word.
- Sort: Sort the words by edit distance, keeping words found at each step together.
The strength of the Aspell algorithm is the way it uses edit distance at both the word level and the phonetic code level. In practice, this introduces enough fuzziness to produce good suggestions for misspelled words.
The edit distance used in Jazzy differs from the Levenshtein distance defined earlier. As well as substitutions, deletions, and insertions, Jazzy includes operations to swap adjacent letters and to change the case of a letter. The cost of the operations is configurable. The default phonetic encoding is Metaphone, although it is possible to use a phonetic transformations rules file (see Resources), which is a table-driven way of defining transformations like Metaphone. The table-driven approach makes it easy to configure a Jazzy-based spell checker to support other languages.
From here forward, I'll concentrate on describing the steps to actually build a spell checker using the Jazzy API. Listing 6 illustrates how to write a Java spell checker using Jazzy.
Listing 6. A simple spell checker
import java.io.BufferedReader;
import java.io.File;
import java.io.InputStreamReader;
import java.util.Iterator;
import java.util.List;
import com.swabunga.spell.engine.SpellDictionary;
import com.swabunga.spell.engine.SpellDictionaryHashMap;
import com.swabunga.spell.event.SpellCheckEvent;
import com.swabunga.spell.event.SpellCheckListener;
import com.swabunga.spell.event.SpellChecker;
import com.swabunga.spell.event.StringWordTokenizer;
public class Suggest {
public static class SuggestionListener implements SpellCheckListener {
public void spellingError(SpellCheckEvent event) {
System.out.println("Misspelling: " + event.getInvalidWord());
List suggestions = event.getSuggestions();
if (suggestions.isEmpty()) {
System.out.println("No suggestions found.");
} else {
System.out.print("Suggestions: ");
for (Iterator i = suggestions.iterator(); i.hasNext();) {
System.out.print(i.next());
if (i.hasNext()) {
System.out.print(", ");
}
}
System.out.println();
}
}
}
public static void main(String[] args) throws Exception {
if (args.length < 1) {
System.err.println("Usage: Suggest <dictionary file>");
System.exit(1);
}
SpellDictionary dictionary = new SpellDictionaryHashMap(new File(args[0]));
SpellChecker spellChecker = new SpellChecker(dictionary);
spellChecker.addSpellCheckListener(new SuggestionListener());
BufferedReader in = new BufferedReader(new InputStreamReader(System.in));
while (true) {
System.out.print("Enter line to spell check (return to exit): ");
String line = in.readLine();
if (line.length() == 0) {
break;
}
spellChecker.checkSpelling(new StringWordTokenizer(line));
}
}
}
|
The main() method
creates a SpellDictionary from a file
specified on the command line. The SpellDictionaryHashMap implementation stores the
words in memory, which is fast but not always appropriate for large
dictionaries. (Disk-based implementations are provided for
applications where memory is an issue.) The SpellDictionary is used to construct a SpellChecker object, which has a SpellCheckListener registered with it, before being
fed lines of input from standard input. The event-based design fits
naturally with the user-driven applications in which spell checkers are typically embedded. In this case, the listener (SuggestionListener) simply writes any misspellings
and a list of suggestions to standard output whenever it receives a
SpellCheckEvent. Listing 7 shows you a sample run.
Listing 7. Spell checking with Jazzy
Enter line to spell check (return to exit): choklut biskit Misspelling: choklut Suggestions: chocolate Misspelling: biskit Suggestions: biscuit Enter line to spell check (return to exit): |
Whereas this example is very simple, a more sophisticated application could use Jazzy's support for user dictionary management, performing such tasks as adding new words to the dictionary, ignoring words, automatically replacing repeat misspellings with a chosen correction, and so on. See the API documentation for SpellCheckEvent (in Resources) for details.
At the time of this writing, the Jazzy API is still alpha software in version 0.5. As a relatively young API, Jazzy is open to improvement and extension. For starters, Jazzy could more closely mirror some of the improvements being made to its cousin, Aspell. More ambitiously, Jazzy would be an ideal framework for prototyping a context- and grammar-aware spell checker using some of the features of natural language processing rather than a simple list of words.
As it is, Jazzy is a solid, although relatively basic API for developing spell checking software on the Java platform. Because Jazzy is open source, anyone can contribute to its ongoing development. The API can also be utilized as a framework and extended for in-house application development. See the Resources section to learn more about the algorithms discussed in the article, as well as the Java platform's Jazzy new spell checker API.
| Name | Size | Download method |
|---|---|---|
| j-jazzy.zip | 397KB | HTTP |
Information about download methods
- This Eclipse plug-in uses
Jazzy to provide spell-checking capabilities for Java, JavaScript, Java
properties files, XML, HTML, JSP, and PHP.
- The Art of Computer Programming, Volume 3, Sorting and Searching (Addison-Wesley; 1998), by Donald E. Knuth contains the definitive description of the Soundex algorithm.
- "Hanging on the Metaphone" (Computer Language, December 1990) by Lawrence Philips is where the first description of the Metaphone algorithm appeared.
- Chas Emerick of the Apache Jakarta Commons Lang project has proposed a solution to the problem of
using Levenshtein distance on very large strings, although the fix had not been incorporated into version 2.0 as of the current writing.
- For more (mathematical) details on Levenshtein distance, and related string distance metrics consult "Sequence comparison" by Christian Charras and Thierry Lecroq.
- Visit the Aspell homepage for a complete guide to the Aspell algorithm.
- Exact String Matching Algorithms (King's College London Publications; 2004) is a tour de force in the related area of exact text searching, also by Christian Charras and Thierry Lecroq.
- Try this interactive Java applet for calculating Levenshtein distance and showing alignments.
- Choosing a good word list can be just as important as using a good spell check algorithm.
- Download Jazzy, the Java open source spell checker.
- Download GNU Aspell, a free and open source spell checker.
- Download Apache Jakarta Commons Codec for an open source implementation of Soundex, Metaphone, and Double Metaphone.
- Download Apache Jakarta Commons Lang for an open source implementation of Levenshtein distance.
- You'll find articles about every aspect of Java programming in
the developerWorks Java technology
zone.
- Browse for books on these and other technical topics.
Tom White is lead Java developer at Kizoom, a leading UK software company in the delivery of personalized travel information to mobile devices. Clients include the UK's national train operator, the London public transport system, and UK national bus companies. Since its founding in 1999, Kizoom has used all the disciplines of extreme programming. Tom has been writing Java programs full time since 1996, using most of the standard and enterprise Java APIs, from client Swing GUIs and graphics to back-end messaging systems. He has a first class honors degree in mathematics from Cambridge University. When not programming, Tom enjoys making his young daughter laugh and watching 1930s Hollywood films. Contact Tom at tom@tiling.org.




