pythonnltkword-frequencyfind-occurrences

Print 10 most frequently occurring words of a text that including and excluding stopwords


I got the question from here with my changes. I have following code:

from nltk.corpus import stopwords
def content_text(text):
    stopwords = nltk.corpus.stopwords.words('english')
    content = [w for w in text if w.lower() in stopwords]
    return content

How can I print the 10 most frequently occurring words of a text that 1)including and 2)excluding stopwords?


Solution

  • There is a FreqDist function in nltk

    import nltk
    allWords = nltk.tokenize.word_tokenize(text)
    allWordDist = nltk.FreqDist(w.lower() for w in allWords)
    
    stopwords = nltk.corpus.stopwords.words('english')
    allWordExceptStopDist = nltk.FreqDist(w.lower() for w in allWords if w not in stopwords)    
    

    to extract 10 most common:

    mostCommon= allWordDist.most_common(10).keys()