spell-checkinggoogle-wave

Context Specific Spelling Engine


I'm sure more than a few of you will have seen the Google Wave demonstration. I was wondering about the spell checking technology specificially. How revolutionary is a spell checker which works by figuring out where a word appears contextually within a sentence to make these suggestions ?

I haven't seen this technique before, but are there examples of this elsewhere?
and if so are there code examples and literature into its workings ?


Solution

  • My 2 cents. Given the fact that translate.google.com is a statistical machine translation engine and "The Unreasonable Effectiveness of Data" from A Halevy, P Norvig (Director of Research at Google) & F Pereira: I make the assumption (bet) that this is a statistically driven spell checker.

    How it could work: you collect a very large corpus of the language you want to spell check. You store this corpus as phrase-tables in adapted datastructures (suffix arrays for example if you have to count the n-grams subsets) that keep track of the count (an so an estimated probability of) the number of n-grams.

    For example, if your corpus is only constitued of:

    I had bean soup last diner.
    

    From this entry, you will generate the following bi-grams (sets of 2 words):

    I had, had bean, bean soup, soup last, last diner
    

    and the tri-grams (sets of 3 words):

    I had bean, had bean soup, bean soup last, soup last diner 
    

    But they will be pruned by tests of statistical relevance, for example: we can assume that the tri-gram

    I had bean 
    

    will disappear of the phrase-table.

    Now, spell checking is only going to look is this big phrase-tables and check the "probabilities". (You need a good infrastructure to store this phrase-tables in an efficient data structure and in RAM, Google has it for translate.google.com, why not for that ? It's easier than statistical machine translation.)

    Ex: you type

    I had been soup
    

    and in the phrase-table there is a

    had bean soup
    

    tri-gram with a much higher probability than what you just typed! Indeed, you only need to change one word (this is a "not so distant" tri-gram) to have a tri-gram with a much higher probability. There should be an evaluating function dealing with the trade-off distance/probability. This distance could even be calculated in terms of characters: we are doing spell checking, not machine translation.

    This is only my hypothetical opinion. ;)