Consider the following for spell-correction:
from autocorrect import spell
import re
WORD = re.compile(r'\w+')
def reTokenize(doc):
tokens = WORD.findall(doc)
return tokens
text = ["Hi, welcmoe to speling.","This is jsut an exapmle, but cosnider a veri big coprus."]
def spell_correct(text):
sptext = []
for doc in text:
sptext.append(' '.join([spell(w).lower() for w in reTokenize(doc)]))
return sptext
print(spell_correct(text))
Here is the output for above piece of code:
How I can stop displaying the output in jupyter notebook? Particularly if we have a large number of text documents, it will be lots of outputs.
My second question is: how can I improve the speed and accuracy (please check the word "veri" in the output for example) of the code when applying on a large data? Is there any better way to do this? I appreciate your response and (alternative) solutions with better speed.
As @khelwood said in the comments, you should use autocorrect.Speller
:
from autocorrect import Speller
import re
spell=Speller(lang="en")
WORD = re.compile(r'\w+')
def reTokenize(doc):
tokens = WORD.findall(doc)
return tokens
text = ["Hi, welcmoe to speling.","This is jsut an exapmle, but cosnider a veri big coprus."]
def spell_correct(text):
sptext = []
for doc in text:
sptext.append(' '.join([spell(w).lower() for w in reTokenize(doc)]))
return sptext
print(spell_correct(text))
#Output
#['hi welcome to spelling', 'this is just an example but consider a veri big corpus']
As an alternative, you could use a list comprehension to maybe increase the speed, and also you could use the library pyspellchecker
, which improves the accuracy of the word 'veri'
in this case:
from spellchecker import SpellChecker
import re
WORD = re.compile(r'\w+')
spell = SpellChecker()
def reTokenize(doc):
tokens = WORD.findall(doc)
return tokens
text = ["Hi, welcmoe to speling.","This is jsut an exapmle, but cosnider a veri big coprus."]
def spell_correct(text):
sptext = [' '.join([spell.correction(w).lower() for w in reTokenize(doc)]) for doc in text]
return sptext
print(spell_correct(text))
Output:
['hi welcome to spelling', 'this is just an example but consider a very big corpus']