pythontext-processingspell-checkingspelling

How to efficiently use spell correction for a large text corpus in Python


Consider the following for spell-correction:

from autocorrect import spell
import re

WORD = re.compile(r'\w+')
def reTokenize(doc):
    tokens = WORD.findall(doc)
    return tokens

text = ["Hi, welcmoe to speling.","This is jsut an exapmle, but cosnider a veri big coprus."]
def spell_correct(text):
    sptext = []
    for doc in text:
        sptext.append(' '.join([spell(w).lower() for w in reTokenize(doc)]))      
    return sptext    

print(spell_correct(text)) 

Here is the output for above piece of code:

enter image description here

How I can stop displaying the output in jupyter notebook? Particularly if we have a large number of text documents, it will be lots of outputs.

My second question is: how can I improve the speed and accuracy (please check the word "veri" in the output for example) of the code when applying on a large data? Is there any better way to do this? I appreciate your response and (alternative) solutions with better speed.


Solution

  • As @khelwood said in the comments, you should use autocorrect.Speller:

    from autocorrect import Speller
    import re
    
    
    spell=Speller(lang="en")
    WORD = re.compile(r'\w+')
    def reTokenize(doc):
        tokens = WORD.findall(doc)
        return tokens
    
    text = ["Hi, welcmoe to speling.","This is jsut an exapmle, but cosnider a veri big coprus."]
    def spell_correct(text):
        sptext = []
        for doc in text:
            sptext.append(' '.join([spell(w).lower() for w in reTokenize(doc)]))      
        return sptext    
    
    print(spell_correct(text)) 
    
    #Output
    #['hi welcome to spelling', 'this is just an example but consider a veri big corpus']
    

    As an alternative, you could use a list comprehension to maybe increase the speed, and also you could use the library pyspellchecker, which improves the accuracy of the word 'veri' in this case:

    from spellchecker import SpellChecker
    import re
    
    WORD = re.compile(r'\w+')
    spell = SpellChecker()
    
    def reTokenize(doc):
        tokens = WORD.findall(doc)
        return tokens
    
    text = ["Hi, welcmoe to speling.","This is jsut an exapmle, but cosnider a veri big coprus."]
    
    def spell_correct(text):
        sptext =  [' '.join([spell.correction(w).lower() for w in reTokenize(doc)])  for doc in text]    
        return sptext    
    
    print(spell_correct(text)) 
    

    Output:

    ['hi welcome to spelling', 'this is just an example but consider a very big corpus']