nlpspell-checkingmisspelling

how to handle misspelled words in documents for text mining tasks?


I have a set of informal documents (couple of thousands) which I want to apply topic modeling (MALLET) on. The problem is, there are a considerable number of misspelled words in the documents. Most are intentional, such as short-forms and local lingo like `'juz' -> 'just', 'alr' -> 'already'. A couple of these variations exists, due to the different authors' peculiar styles of writing.

After feeding them to MALLET, I kinda bothered that one of the topics generated is actually a set of misspelled stopwords. I believe these words are mostly used in the small subset of documents from the same author, hence MALLET picked it up.

My question is, do I spell-check and correct these sets of misspelled words, and perhaps save the corrected text somewhere, before conducting further tasks on them? I suppose this would meant that I do need to manually verify the corrections before committing right? What would be the most "efficient" way to do this?

Or do I actually ignore these misspelled words?


Solution

  • What do you do with stopwords at the moment? If you are doing topic modelling then it would make sense to filter them out. If so, why don't you filter out these terms too?

    [Edit in response to reply]

    There is some research about handling stopwords within LDA in a more principled way. There are two papers that spring to mind:

    1. Term Weighting Schemes for Latent Dirichlet Allocation
    2. Rethinking LDA: Why Priors Matter.

    [1] uses a term weighting scheme which apparently helps in a predictive task they set up, [2] uses a non-symmetric prior over the word distributions which apparently leads to a few topics which contain all the stop words, and other words common to the entire corpus.

    It seems to me that the best way to automatically infer stop words and other non-topic words in LDA is still a research question.