python text-processing spell-checking spelling

How to efficiently use spell correction for a large text corpus in Python

Consider the following for spell-correction:

from autocorrect import spell
import re

WORD = re.compile(r'\w+')
def reTokenize(doc):
    tokens = WORD.findall(doc)
    return tokens

text = ["Hi, welcmoe to speling.","This is jsut an exapmle, but cosnider a veri big coprus."]
def spell_correct(text):
    sptext = []
    for doc in text:
        sptext.append(' '.join([spell(w).lower() for w in reTokenize(doc)]))      
    return sptext    

print(spell_correct(text))

Here is the output for above piece of code:

How I can stop displaying the output in jupyter notebook? Particularly if we have a large number of text documents, it will be lots of outputs.

My second question is: how can I improve the speed and accuracy (please check the word "veri" in the output for example) of the code when applying on a large data? Is there any better way to do this? I appreciate your response and (alternative) solutions with better speed.

Solution

As @khelwood said in the comments, you should use autocorrect.Speller:

from autocorrect import Speller
import re


spell=Speller(lang="en")
WORD = re.compile(r'\w+')
def reTokenize(doc):
    tokens = WORD.findall(doc)
    return tokens

text = ["Hi, welcmoe to speling.","This is jsut an exapmle, but cosnider a veri big coprus."]
def spell_correct(text):
    sptext = []
    for doc in text:
        sptext.append(' '.join([spell(w).lower() for w in reTokenize(doc)]))      
    return sptext    

print(spell_correct(text)) 

#Output
#['hi welcome to spelling', 'this is just an example but consider a veri big corpus']

As an alternative, you could use a list comprehension to maybe increase the speed, and also you could use the library pyspellchecker, which improves the accuracy of the word 'veri' in this case:

from spellchecker import SpellChecker
import re

WORD = re.compile(r'\w+')
spell = SpellChecker()

def reTokenize(doc):
    tokens = WORD.findall(doc)
    return tokens

text = ["Hi, welcmoe to speling.","This is jsut an exapmle, but cosnider a veri big coprus."]

def spell_correct(text):
    sptext =  [' '.join([spell.correction(w).lower() for w in reTokenize(doc)])  for doc in text]    
    return sptext    

print(spell_correct(text))

Output:

['hi welcome to spelling', 'this is just an example but consider a very big corpus']