I am looking to remove stopwords from text to optimise my frequency distribution results
My initial frequency distribution code is written:
# Determine the frequency distribution
from nltk.tokenize import word_tokenize
tokens = nltk.word_tokenize(review_comments)
fdist = FreqDist(tokens)
fdist
This returns
FreqDist({"'": 521, ',': 494, "'the": 22, 'a': 16, "'of": 16, "'is": 12, "'to": 10, "'for": 9, "'it": 8, "'that": 8, ...})
I want to remove the stopwords with the following code
# Delete all the alpanum.
# Filter out tokens that are neither alphabets nor numbers (to eliminate punctuation marks, etc.).
filtered = [word for word in review_comments if word.isalnum()]
# Remove all the stopwords
# Download the stopword list.
nltk.download ('stopwords')
from nltk.corpus import stopwords
# Create a set of English stopwords.
english_stopwords = set(stopwords.words('english'))
# Create a filtered list of tokens without stopwords.
filtered2 = [x for x in filtered if x.lower() not in english_stopwords]
# Define an empty string variable.
filtered2_string = ''
for value in filtered:
# Add each filtered token word to the string.
filtered2_string = filtered2_string + value + ''
Now I run the fdist again
from nltk.tokenize import word_tokenize
trial= nltk.word_tokenize(filtered2_string)
fdist1 = FreqDist(trial)
fdist1
This returns the code
FreqDist({'whenitcomestoadmsscreenthespaceonthescreenitselfisatanabsolutepremiumthefactthat50ofthisspaceiswastedonartandnotterriblyinformativeorneededartaswellmakesitcompletelyuselesstheonlyreasonthatigaveit2starsandnot1wasthattechnicallyspeakingitcanatleaststillstanduptoblockyournotesanddicerollsotherthanthatitdropstheballcompletelyanopenlettertogaleforce9yourunpaintedminiaturesareverynotbadyourspellcardsaregreatyourboardgamesaremehyourdmscreenshoweverarefreakingterribleimstillwaitingforasinglescreenthatisntpolluted': 1})
review_comments = ''
for i in range(newdf.shape[1]):
# Add each comment.
review_comments = review_comments + newdf['tokens1'][i]```
How do I get the stopwords to not remove the spaces and count the words individually?
I removed the stopwords and rerun the frequency distribution hoping to get the most frequent words.
Cleaning in NLP tasks is generally performed on tokens
rather than characters
of a string to leverage the inbuild functionalities/ methods. However, you can always do this from scratch using your own logic on characters as well, if you need to. The stopwords
in nltk
are in the form of tokens to use for clean up of your text corpus. You can add more tokens that you need to eliminate from your list. For e.g. if you need the english stopwords and punctuations removed, do something like:
import string
from nltk.tokenize import word_tokenize
tokens = word_tokenize(review_comments)
## Add any additional punctuations/ words you want to eliminate here, like below
english_stop_plus_punct = set(stopwords.words('english') + ["call"] +
list(string.punctuation + "“”’"))
filtered2 = [x for x in tokens if x.lower() not in english_stop_plus_punct]
fdist1 = nltk.FreqDist(filtered2)
fdist1
#### FreqDist({'presence': 3, 'meaning': 2, 'might': 2, 'Many': 1, 'psychologists': 1, 'knowing': 1, 'life': 1, 'drive': 1, 'look': 1, ...})
Example text from a write up on "meaning of life":
review_comments = """ Many psychologists call knowing your life’s meaning “presence,” and the drive to look for it “search.” They are not mutually exclusive: You might or might not search, whether you already have a sense of meaning or not. Some people low in presence don’t bother searching—they are “stuck.” Some are high in presence but keep searching—we can call them “seekers.” """