I am doing an exercise where I have to find out what are the incorrect spellings present in the text dataset using Python. I have checked multiple blogs but all of them show how to autocorrect incorrect spellings. I don't want to autocorrect it, I just want to separate the incorrect spellings from the dataset.
Sample Dataset:
1. Kurtas for women
2. parti wear dresses
3. denim jeans
4. overcot
Expected Output:
1. parti wear dresses
2. overcot
By using pyspellchecker, at each line, you can check if any of their words are unknown
and if so, keep the line and write
it to a new file. Eventually, you can also load_words
(custom ones like Kurtas
) to the dictionary in order to not be flagged as "misspeled".
#pip install from spellchecker
from spellchecker import SpellChecker
sp = SpellChecker() #language="en" by default
# add on more custom words if needed
sp.word_frequency.load_words(["Kurtas"])
with (
open("file.txt", "r") as in_f,
open("newf.txt", "w") as out_f
):
for l in in_f:
if sp.unknown(l.split()):
out_f.write(l)
Output (newf.txt) :
parti wear dresses
overcot