pythonmachine-learningnlpword2vecautocorrect

Get a dictionary of incorrect spelling words in a dataframe


Am working on sentiment analysis problem. Tried to use autocorrect but that requires a lot computing power which I don't have access to because of the size of corpus. So came up with a different approach of solving the problem by creating a dictionary of {key = 'incorrect', value = 'correct'} and then manually correcting all words.

The problem is that how should I get that dictionary of miss-spelled words in the dictionary. Is this link same as the solution to my problem?(Rather than misspelled words should I look for OOV words?)

And if not, please suggest some better method.

Code used for autocorrect:

!pip install autocorrect
from autocorrect import spell 
train['text'] = [' '.join([spell(i) for i in x.split()]) for x in train['text']]

Solution

  • How many times can you spell a word correctly? Only 1.

    Now, how many times can you spell a word incorrectly? I should say infinite.

    This answers your question:

    Rather than misspelled words should I look for OOV words?

    Now, how then can you get the features if they are misspelled? One way is to use "Levenstein Distance" (or minimum edit distance), which compares a misspelled word to your word dictionary, checking whether the distance from it to any of your words is small. That is probably what is behind the autocorrect package. You can check some more information about it in this link.

    So, in short, probably you have to either discard OOV words or employ some computational resources on them, since computers are not able to "guess" without doing some computation on top of it.