pythonpandasnltkfrequency-distribution

Frequency distribution is not returning words but letters


I am trying to find what words appear the most often. But each time I run FreqDist it does not return the most common words but letters.

FreqDist({' ': 496, 'e': 306, 't': 205, 'a': 182, 's': 181, 'n': 160, 'o': 146, 'r': 142, 'i': 118, 'l': 110, ...})

Here is my code: newdf['tokens1'] = newdf['review'].apply(word_tokenize) newdf['tokens1'] = newdf['tokens1'].apply(str)

for i in range(newdf.shape[1]):
    # Add each comment.
    review_comments = review_comments + newdf['tokens1'][i]
from nltk.probability import FreqDist
fdist = FreqDist(review_comments)
fdist

returns

FreqDist({' ': 496, 'e': 306, 't': 205, 'a': 182, 's': 181, 'n': 160, 'o': 146, 'r': 142, 'i': 118, 'l': 110, ...})

Solution

  • You need first yo use nltk.word_tokenize:

    from nltk.tokenize import word_tokenize
    tokens = nltk.word_tokenize(review_comments)
    fdist = FreqDist(tokens)
    fdist