
None of us are too old to remember that strange time when writers had to use word processors without the benefits of an automated spellchecker. It was a time when the jokes came fast and furious, even to the point of The Guardian newspaper becoming known as The Grauniad because of the legendary typos sent to print by its tired subbing staff.
Pretty soon though, word processors started to come with spell-checking utilities that would check the text for spelling errors. At first they were activated manually - you had to select the option to check the spelling in your document - but soon Microsoft shipped a version of Word (Word 95, in, er, 1995) that was able to check your spelling as you typed and would underline misspelled words with a red squiggly line.
Alongside these user interface improvements came algorithms to enable the word-processing program to suggest corrections for the misspellings encountered in the document. Later still came grammar checkers, which are still in their infancy and can even now get things very wrong (for instance, as I type this paragraph, Word 2007's grammar checker is insisting that the phrase 'these user' in the first sentence should be changed to 'this user').
Nowadays, the spellchecker is so ingrained in word-processing applications that we no longer think about checking our spellings manually. Indeed, the jokes have now moved on to mock the sometimes-wacky suggestions that correction algorithms produce. And because we often assume that if there are no red squiggly lines in the document then the text is good, a new issue has arisen: the possibility of correctly spelled yet inappropriate words in the text going unnoticed (my personal issues are 'from' and 'form', and 'class' and 'calls').
So the spellchecker is now invaluable for people from all works of life. But how do they work, and how might we go about writing one?
Before we go any further however, if you enjoyed discovering how things work, don't forget to check out our article discussing how Google’s PageRank and indexing work!
There are two algorithms at play in the most basic kind of spellchecker. The first algorithm takes a body of text and splits it up, identifying the separate words. These words are then checked to see if they're spelt correctly using the second algorithm.
There are several ways to break up a chunk of text into its component words. Sometimes the run-time of the language you use may have a function that takes a string and returns an array of the words within it. Another option is to use the regular expressions package that comes with your run-time (especially if you're using one of the interpreted languages such as JavaScript, Perl, PHP or Python) to split up the words. These are relatively easy solutions, but we're going to take a look at the issue from the first principles of spellchecking, and instead use a state machine.
To create a state machine, we need first to determine the states. After that, we can define the transitions between the states, and what happens within those transitions. For our 'separation of text into words' state machine, we will be using two states that we can call 'InWord' and 'NotInWord'. Let's start off by looking at the NotInWord state.
The easiest transition to think about is the transition from the InWord state to the NotInWord state. I would categorise this as reading a white-space character (like a space, new line or tab) or as reading a punctuation character (like a comma, semicolon, colon, full stop or dash). This makes the transition from NotInWord to InWord much simpler: it's any character that isn't white space or punctuation. While we're in the InWord state we collect characters, appending them to a string (which we can call Word, imaginatively enough). In the transition to NotInWord, we release this Word value. the image below shows this very simple state machine.

It's worth noting that the reason we define the transitions in this way is that it's more culture-neutral than doing things the other way around, where we define the characters that can appear in a word. The problem with this is that there are many nonstandard characters, including accented characters such as the 'é' in 'café' or the 'ï' in 'naïve'. Because the number of characters that can be defined as white space or punctuation is much smaller than the number of possible characters constituting a word, it's easier to define the transitions using the former type.
Now that we can receive a series of words one at a time from the text, we can think about how to spellcheck them. The first thing is to assume that any words that have digits in them are spelt correctly. So words like '2009', '1st', and 'Q1' are always assumed to be correct spellings. In normal text, I would say that this is generally the best assumption to make. The alternative - flagging all words with digits in them - is likely to produce a lot of false negatives and therefore annoy the writer. A possible middle ground is to accept all words that consist of digits only and reject all others.
So once we have a candidate, how do we check its spelling? By far the simplest algorithm that we can use is a hash table or dictionary. The first step is to locate a complete word list. There are several to be had on the Internet; just Google 'word list' to find one. Write a routine to read this list (it'll usually be in a one-word-per-line format) and add each word to a hash table.
Now the process becomes much faster: for each word identified in the text, check to see if it's present in the hash table. The reason for using a hash table here is that this checking process is linear in time (it doesn't depend on the number of entries) compared to a simple array (time is proportional to the number of entries) or to a sorted array or binary search tree (time is proportional to the logarithm of the number of entries).
There are two main reasons for a misspelled word. The first is physical: the writer knows exactly how to spell the word, but in the process of getting his thoughts down as text, he mistypes it. He may hit a key that's next to the key he wanted ('thr' instead of 'the'), transpose a couple of letters ('teh'), miss a key completely ('te') or inadvertently manage to type an extra adjacent letter ('thge').
The second reason for a misspelling is that the writer has guessed how to spell a word. Here the misspelled word is likely to be phonetically similar to the actual word. An example is 'definately' instead of 'definitely'. Because there are two distinct reasons for misspellings, we have to use two different algorithms for identifying possible corrections. The algorithm used to correct words that have been mistyped is based on the Levenshtein method, which calculates the distance between the misspelled word and all the words in the word list (see the box on page 122). However, the original Levenshtein method is too lengthy for modern spellchecking. A far better algorithm for suggesting corrections is the Damerau algorithm, which generates variants of the mistyped word according to the errors listed below and checks them against the word list.
Let's see this method in action. Suppose our misspelled word is 'usre':

The above image shows the process behind generating alternative spellings from mistyped words.
So, for the cost of generating 237 words and checking each in the word hash table, we have three suggestions for correcting the original word: 'sure', 'user' and 'use'. As image below shows, these are the first three suggestions that Word 2007 produces for 'usre', suggesting that our spellchecker is working on the right lines. This is vastly faster than calculating the Levenshtein distance for every word in the list we're using, which contains 170,000 words.

Now that we've analysed how to suggest spelling corrections for a mistyped word, let's have a look at how to generate suggestions for a genuinely misspelled word (where the writer tries a phonetic approximation). Our previous example ('definately') would already have been corrected by the Damerau algorithm, so let's contrive a more difficult example: 'enonesiate' for 'enunciate'. In this case, it's time to try the Soundex method.
In 1918, Robert Russell and Margaret Odell devised and patented an algorithm for approximating and encoding the phonetic value of a word. They called it Soundex, and it's implemented like this:

The Soundex algorithm is a good approximation for encoding a word phonetically, but it has a big problem when used for spelling corrections: it requires that the initial letter of the phonetic misspelled guess be correct. (It would not notice that 'cereal' and 'serial' were the same phonetically, for example.) Other problems with the algorithm are that it generates fixed-size codes and that it does not distinguish between different-sounding phonemes.
In 1990, Lawrence Phillips devised another phonetic algorithm that corrects many of the issues with Soundex. It's called Metaphone, and it uses a much larger set of rules to create a phonetic encoding of a word. While more accurate, it is also much harder to implement. Eight years later, Phillips produced an improved version which he named the Double Metaphone algorithm. This version produces two codes for every word.
To use the Soundex method to suggest corrections, you would need to create another hash table that's populated with the Soundex codes for all of the words in the word list. Obviously there will be duplicates generated by the Soundex algorithm for all these words, so you would have to save a list of 'soundalikes' for each Soundex key in the hash table rather than the single original word. Once you have created this hash table, suggesting spelling corrections is merely a case of calculating the Soundex value for the misspelled word, keying it into the Soundex hash table and retrieving the list of soundalikes given for the misspelling. In our example, 'enonesiate' would get converted to a value of E552. Since E552 is also the Soundex for 'enunciate', it will be provided as a suggestion for the correction.
The Russian scientist and mathematician Vladimir Levenshtein devised the Levenshtein distance algorithm in 1965 as part of his investigations into information theory and error-correcting codes. The Levenshtein distance is the minimum number of edits needed to get from one word to another, where an edit is defined as the insertion of a letter, the deletion of a letter or the substitution of one letter for another.
For small strings, calculating the Levenshtein distance is a quick operation. However, calculating the edit distance between a misspelled word and all the words in a large word list could take a significant amount of time, especially as after the calculation the results have to be sorted into distance order.
In reality, we'd want to be much more choosy about what our spellchecker suggests as a spelling correction. Even though the Levenshtein distance from 'begind' to 'end' is only three (delete three letters, essentially), there's no way that any self-respecting writer would accept 'end' as a spelling correction for what was clearly supposed to be 'begin'. Clearly a more efficient algorithm was needed.
The Levenshtein distance is still an important concept, however. In his paper on spelling corrections in 1964, F J Damerau noticed that four errors caused 80 to 90 per cent of misspellings: leaving out a letter, adding a letter, using a wrong letter or transposing two adjacent letters. These errors all come to a Levenshtein distance of one. He created a variant algorithm based on this research called the Damerau-Levenshtein distance.
As you can see, the process of checking spellings and suggesting corrections is not an exact science, but there's no denying that it has made our lives a little easier and our publications a little less unpredictable.
Copyright Future Publishing Limited (company registered number 2008885), a company registered in England and Wales whose registered office is at Beauford Court, 30 Monmouth Street, Bath, BA1 2BW, UK
I think the labels for the transitions from InWord in the state machine diagram are labelled the wrong way around...
Submitted by Daniel Watkins on 27 May 2009 - 12:37pm.
Soundex was developed to encode names, not words. Anglo-Saxon surnames, if memory serves.
Submitted by SlowWalker on 27 May 2009 - 2:46pm.
It's constant in time, not linear, if it does not depend on its input size
Submitted by Anonymous on 28 May 2009 - 12:24am.
Using a hash or a binary search is questionable - the hash may be constant time, but the hashing function adds its own overhead, and there'll be a search anyway if there are collisions. On the other hand, the dictionary on my machine has 98568 words - a maximum of 17 comparisons in a binary search.
Otherwise, this is a good introduction to spellchecking.
Submitted by Izkata on 28 June 2009 - 8:43pm.
Izkata: The binary search has the disadvantage of poor locality of reference (reading from 17 different memory locations).
The dictionary is btw static, so you may be able to make a perfect hash function.
Hard to say what will be best in practice (yet easy to test). There is also the option of using a ternary search tree, which in theory should be faster than the normal binary search tree (since we only compare one character at a time, rather than a string compare).
Submitted by Allan Odgaard on 6 August 2009 - 7:35pm.
Post new comment