nlptext-classificationnaivebayescountvectorizerperplexity

How to compute the perplexity in text classification?


I'm doing dialect text classification with scikit learn, naive bayes and countvectorizer. So far I'm only doing 3 dialects text classification. I'm going to add a new dialect(or actually, the formal language for those dialects). The problem is, the new text that I'm going to add, shares a lot of words with the other 3 dialects. So I read the following in a research document:

We train an n-gram model for each dialect from the collected data. To train the MSA model, we select sentences from Arabic UN corpus and news collections. All the dialect and MSA models share the same vocabulary, thus perplexity can be compared properly. At classification time, given an input sentence, the classifier computes the perplexity for each dialect type and choose the one with minimum perplexity as the label.

They mean by MSA(Modern Standard Arabic) which is the formal language for those dialects. How are they calculating the perplexity? Are they just using naive bayes or there's more to it?


Solution

  • From what I see here, the quoted work is not using a Naive Bayes classifier at all; the approach is different to what you're suggesting.

    The proposed approach there is to train individual n-gram based language models for each dialect to be classified. To classify which dialect a given input is in, the input text is scored with each language model. The lower the perplexity according to an LM, the higher the probability. Therefore, if the LM trained on dialect A assigns lower perplexity (i.e. higher probability) to an input than dialect B does, it is more likely that the input text is in dialect A.

    Perplexity is the inverse probability of some text normalized by the number of words (source).

    For a sentence W,
    Perplexity(W) = P(W)^(-1/N), where N is the number of words in the sentence, and P(W) is the probability of W according to an LM.

    Therefore, the probability, and hence the perplexity, of the input according to each language model is computed, and these are compared to choose the most likely dialect.