pythonnltktokenizefrequencystop-words

Removing stopwords also removes spaces between words during frequency distribution


I am looking to remove stopwords from text to optimise my frequency distribution results

My initial frequency distribution code is written:

# Determine the frequency distribution 
from nltk.tokenize import word_tokenize
tokens = nltk.word_tokenize(review_comments)
fdist = FreqDist(tokens)
fdist

This returns

FreqDist({"'": 521, ',': 494, "'the": 22, 'a': 16, "'of": 16, "'is": 12, "'to": 10, "'for": 9, "'it": 8, "'that": 8, ...})

I want to remove the stopwords with the following code

# Delete all the alpanum.
# Filter out tokens that are neither alphabets nor numbers (to eliminate punctuation marks, etc.).
filtered = [word for word in review_comments if word.isalnum()]

# Remove all the stopwords
# Download the stopword list.
nltk.download ('stopwords')
from nltk.corpus import stopwords

# Create a set of English stopwords.
english_stopwords = set(stopwords.words('english'))

# Create a filtered list of tokens without stopwords.
filtered2 = [x for x in filtered if x.lower() not in english_stopwords]

# Define an empty string variable.
filtered2_string = ''

for value in filtered:
    # Add each filtered token word to the string.
    filtered2_string = filtered2_string + value + ''
    

Now I run the fdist again

from nltk.tokenize import word_tokenize
trial= nltk.word_tokenize(filtered2_string)
fdist1 = FreqDist(trial)
fdist1

This returns the code

FreqDist({'whenitcomestoadmsscreenthespaceonthescreenitselfisatanabsolutepremiumthefactthat50ofthisspaceiswastedonartandnotterriblyinformativeorneededartaswellmakesitcompletelyuselesstheonlyreasonthatigaveit2starsandnot1wasthattechnicallyspeakingitcanatleaststillstanduptoblockyournotesanddicerollsotherthanthatitdropstheballcompletelyanopenlettertogaleforce9yourunpaintedminiaturesareverynotbadyourspellcardsaregreatyourboardgamesaremehyourdmscreenshoweverarefreakingterribleimstillwaitingforasinglescreenthatisntpolluted': 1})

review_comments = ''
for i in range(newdf.shape[1]):
    # Add each comment.
    review_comments = review_comments + newdf['tokens1'][i]```


How do I get the stopwords to not remove the spaces and count the words individually?




I removed the stopwords and rerun the frequency distribution hoping to get the most frequent words.

Solution

  • Cleaning in NLP tasks is generally performed on tokens rather than characters of a string to leverage the inbuild functionalities/ methods. However, you can always do this from scratch using your own logic on characters as well, if you need to. The stopwords in nltk are in the form of tokens to use for clean up of your text corpus. You can add more tokens that you need to eliminate from your list. For e.g. if you need the english stopwords and punctuations removed, do something like:

    import string
    from nltk.tokenize import word_tokenize
    
    tokens = word_tokenize(review_comments)
    
    ## Add any additional punctuations/ words you want to eliminate here, like below
    english_stop_plus_punct = set(stopwords.words('english') + ["call"] + 
                              list(string.punctuation + "“”’"))
    
    filtered2 = [x for x in tokens if x.lower() not in english_stop_plus_punct]
    
    fdist1 = nltk.FreqDist(filtered2)
    fdist1
    
    #### FreqDist({'presence': 3, 'meaning': 2, 'might': 2, 'Many': 1, 'psychologists': 1, 'knowing': 1, 'life': 1, 'drive': 1, 'look': 1, ...})
    

    Example text from a write up on "meaning of life":

    review_comments = """ Many psychologists call knowing your life’s meaning “presence,” and the drive to look for it “search.” They are not mutually exclusive: You might or might not search, whether you already have a sense of meaning or not. Some people low in presence don’t bother searching—they are “stuck.” Some are high in presence but keep searching—we can call them “seekers.” """