pythonword-cloud

WordCloud: problem with displaying bigrams


I want to implement a word cloud from scrapped Twitter data. The problem is that the word states occurs 214 times, while state - 64. There is only one tweet in which the combination of the words United State occurs. Despite this, my word cloud is formed with this combination instead of the correct one.

My code to generate world cloud:

raw_tweets = []

STOPWORDS = [
    'i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your',
    'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her',
    'hers', 'herself', 'it', 'its', 'itself', 'they', 'them', 'their', 'theirs',
    'themselves', 'what', 'which', 'who', 'would', 'whom', 'this', 'that', 'these', 'those',
    'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had',
    'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if',
    'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with',
    'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after',
    'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over',
    'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where',
    'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other',
    'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too',
    'very', 't', 'can', 'will', 'just', 'don', 'should', 'now'
]

for tweet in df['Tweet']:
    raw_tweets.append(tweet)

raw_string = ''.join(raw_tweets)
no_links = re.sub(r'http\S+', '', raw_string)
no_unicode = re.sub(r"\\[a-z][a-z]?[0-9]+", '', no_links)
no_special_characters = re.sub('[^A-Za-z ]+', '', no_unicode)

words = no_special_characters.split(" ")
words = [w for w in words if len(w) > 2]
words = [w.lower() for w in words]

import numpy as np
import matplotlib.pyplot as plt
import re
from PIL import Image
from wordcloud import WordCloud
from IPython.display import Image as im

mask = np.array(Image.open('Logo_location')) 

wc = WordCloud(background_color="white", max_words=2000, mask=mask, stopwords=STOPWORDS, relative_scaling=1)
wc.generate(','.join(words))

f = plt.figure(figsize=(13,13))
plt.imshow(wc, interpolation='bilinear')
plt.title('Twitter Generated Cloud', size=30)
plt.axis("off")
plt.show()

Generated word cloud;

Generated word cloud


Solution

  • ... the word "states" occurs 214 times, while "state" - 64 times. There is only one tweet in which the combination of the words "United States" occurs. Despite this, my word cloud is formed with this combination.

    You are generating a word cloud, not a keyphrase cloud. It just happens to be side by side for this particular mask, but the outcome might be different for a different mask. Also, you are performing .split(" ") on your join("")ed tweets, so the output is already a list of words. (I would highly suggest you use join(" "), otherwise, the ends and beginnings of the tweets would fuse together.)

    Your current code doesn't include phrases with 2 words, such as "United States". If you want to include them, you could:

    phrases = [words[i]+' '+words[i+1] for i in range(0, len(words)-1)]
    

    If you want to exclude phrases that do not occur less than once:

    unique_phrases = set(phrases) 
    repeated_phrases = [] 
    for phrase in unique_phrases:
        if " ".join(words).count(phrase) > 1:
            repeated_phrases.append(phrase)
    

    Combined, for the input of:

    tweets = ["I live in the states", "Stack Overflow", "United States of America", "Stack" ,"United States", "Overflow", "State", "States", "States of America"]
    

    The output would be:

    repeated_phrases = ['states of', 'united states', 'of america']
    

    Lastly, if you concatenate words and repeated_phrases and then generate a word cloud, the output would include both "State" and "United States". You'd want to play with the threshold of repeated phrases because 1 is too low but works for my short example.

    Edit; The docs mentions collocations parameter which generate bigrams for given input. You could also pass your words as wc.generate(" ".join(words)) which would generate bigrams by default but there will be still whole lot of visually small and meaningless bigrams such as "of states", etc.