pythonpandasemojisentiment-analysisemoticons

Emoji count and analysis using python pandas


I am working on a sentiment analysis topic and there are a lot of comments with emojis.

I would like to know if my code is correct or is there a way to optimize it as well?

Code to do smiley count

import pandas as pd
import regex as re
import emoji

# Assuming your DataFrame is called 'df' and the column with comments is 'Document'
comments = df['Document']

# Initialize an empty dictionary to store smiley counts and types
smiley_data = {'Smiley': [], 'Count': [], 'Type': []}

# Define a regular expression pattern to match smileys
pattern = r'([\U0001F600-\U0001F64F\U0001F300-\U0001F5FF\U0001F680-\U0001F6FF\U0001F1E0-\U0001F1FF])'

# Iterate over the comments
for comment in comments:
    # Extract smileys and their types from the comment
    smileys = re.findall(pattern, comment)
    
    # Increment the count and store the smileys and their types
    for smiley in smileys:
        if smiley in smiley_data['Smiley']:
            index = smiley_data['Smiley'].index(smiley)
            smiley_data['Count'][index] += 1
        else:
            smiley_data['Smiley'].append(smiley)
            smiley_data['Count'].append(1)
            smiley_data['Type'].append(emoji.demojize(smiley))
            
# Create a DataFrame from the smiley data
smiley_df = pd.DataFrame(smiley_data)

# Sort the DataFrame by count in descending order
smiley_df = smiley_df.sort_values(by='Count', ascending=False)

# Print the smiley data
smiley_df

I am majorly not sure if my below code block is getting all the smileys

# Define a regular expression pattern to match smileys
pattern = r'([\U0001F600-\U0001F64F\U0001F300-\U0001F5FF\U0001F680-\U0001F6FF\U0001F1E0-\U0001F1FF])'

would like to know what can I do with this analysis. something else on top of it - some charts maybe?

I am also sharing a test dataset that will generate similar smiley counts as those available in my real data. Please note that the test dataset only has known smileys if there is something else. it won't be there like in a real dataset.

Test Dataset

import random
import pandas as pd

smileys = ['๐Ÿ‘', '๐Ÿ‘Œ', '๐Ÿ˜', '๐Ÿป', '๐Ÿ˜Š', '๐Ÿ™‚', '๐Ÿ‘Ž', '๐Ÿ˜ƒ', '๐Ÿผ', '๐Ÿ’ฉ']

# Additional smileys to complete the required count
additional_smileys = ['๐Ÿ˜„', '๐Ÿ˜Ž', '๐Ÿคฉ', '๐Ÿ˜˜', '๐Ÿค—', '๐Ÿ˜†', '๐Ÿ˜‰', '๐Ÿ˜‹', '๐Ÿ˜‡', '๐Ÿฅณ', '๐Ÿ™Œ', '๐ŸŽ‰', '๐Ÿ”ฅ', '๐Ÿฅฐ', '๐Ÿคช', '๐Ÿ˜œ', '๐Ÿค“',
                      '๐Ÿ˜š', '๐Ÿคญ', '๐Ÿคซ', '๐Ÿ˜Œ', '๐Ÿฅฑ', '๐Ÿฅถ', '๐Ÿคฎ', '๐Ÿคก', '๐Ÿ˜‘', '๐Ÿ˜ด', '๐Ÿ™„', '๐Ÿ˜ฎ', '๐Ÿคฅ', '๐Ÿ˜ข', '๐Ÿค', '๐Ÿ™ˆ', '๐Ÿ™Š',
                      '๐Ÿ‘ฝ', '๐Ÿค–', '๐Ÿฆ„', '๐Ÿผ', '๐Ÿต', '๐Ÿฆ', '๐Ÿธ', '๐Ÿฆ‰']

# Combine the required smileys and additional smileys
all_smileys = smileys + additional_smileys

# Set a random seed for reproducibility
random.seed(42)

# Generate a single review
def generate_review(with_smiley=False):
    review = "This movie"
    if with_smiley:
        review += " " + random.choice(all_smileys)
    review += " is "
    review += random.choice(["amazing", "excellent", "fantastic", "brilliant", "great", "good", "okay", "average",
                             "mediocre", "disappointing", "terrible", "awful", "horrible"])
    review += random.choice(["!", "!!", "!!!", ".", "..", "..."]) + " "
    review += random.choice(["Highly recommended", "Definitely worth watching", "A must-see", "I loved it",
                             "Not worth your time", "Skip it"]) + random.choice(["!", "!!", "!!!"])
    return review

# Generate the random dataset
def generate_dataset():
    dataset = []
    review_count = 5000

    # Generate reviews with top smileys
    for smiley, count, _ in top_smileys:
        while count > 0:
            review = generate_review(with_smiley=True)
            if smiley in review:
                dataset.append(review)
                count -= 1

    # Generate reviews with additional smileys
    additional_smileys_count = len(additional_smileys)
    additional_smileys_per_review = review_count - len(dataset)
    additional_smileys_per_review = min(additional_smileys_per_review, additional_smileys_count)

    for _ in range(additional_smileys_per_review):
        review = generate_review(with_smiley=True)
        dataset.append(review)

    # Generate reviews without smileys
    while len(dataset) < review_count:
        review = generate_review()
        dataset.append(review)

    # Shuffle the dataset
    random.shuffle(dataset)
    return dataset

# List of top smileys and their counts
top_smileys = [
    ('๐Ÿ‘', 331, ':thumbs_up:'),
    ('๐Ÿ‘Œ', 50, ':OK_hand:'),
    ('๐Ÿ˜', 41, ':smiling_face_with_heart-eyes:'),
    ('๐Ÿป', 38, ':light_skin_tone:'),
    ('๐Ÿ˜Š', 35, ':smiling_face_with_smiling_eyes:'),
    ('๐Ÿ™‚', 14, ':slightly_smiling_face:'),
    ('๐Ÿ‘Ž', 12, ':thumbs_down:'),
    ('๐Ÿ˜ƒ', 12, ':grinning_face_with_big_eyes:'),
    ('๐Ÿผ', 10, ':medium-light_skin_tone:'),
    ('๐Ÿ’ฉ', 10, ':pile_of_poo:')
]

# Generate the dataset
dataset = generate_dataset()

# Create a data frame with 'Document' column
df = pd.DataFrame({'Document': dataset})

# Display the DataFrame
df

Thank you in advance!


Solution

  • Update

    If you prefer to use emoji package, you can do:

    import emoji
    
    text = df['Document'].str.cat(sep='\n')
    out = (pd.DataFrame(emoji.emoji_list(text)).value_counts('emoji')
             .rename_axis('Smiley').rename('Count').reset_index()
             .assign(Type=lambda x: x['Smiley'].apply(emoji.demojize)))
    

    Output:

    >>> out
       Smiley  Count                              Type
    0       ๐Ÿ‘    331                       :thumbs_up:
    1       ๐Ÿ‘Œ     50                         :OK_hand:
    2       ๐Ÿป     41                 :light_skin_tone:
    3       ๐Ÿ˜     41    :smiling_face_with_heart-eyes:
    4       ๐Ÿ˜Š     35  :smiling_face_with_smiling_eyes:
    5       ๐Ÿ™‚     15           :slightly_smiling_face:
    6       ๐Ÿ‘Ž     14                     :thumbs_down:
    7       ๐Ÿ˜ƒ     13     :grinning_face_with_big_eyes:
    8       ๐Ÿผ     10          :medium-light_skin_tone:
    9       ๐Ÿ’ฉ     10                     :pile_of_poo:
    10      ๐Ÿ˜œ      3        :winking_face_with_tongue:
    11      ๐Ÿฆ‰      3                             :owl:
    12      ๐Ÿค–      2                           :robot:
    13      ๐Ÿ˜‘      2             :expressionless_face:
    14      ๐Ÿ‘ฝ      2                           :alien:
    15      ๐Ÿคซ      2                   :shushing_face:
    16      ๐Ÿ˜ข      2                     :crying_face:
    17      ๐Ÿคช      2                       :zany_face:
    18      ๐Ÿ™ˆ      2              :see-no-evil_monkey:
    19      ๐Ÿ™Š      2            :speak-no-evil_monkey:
    20      ๐Ÿ˜‡      1          :smiling_face_with_halo:
    21      ๐Ÿคฎ      1                   :face_vomiting:
    22      ๐Ÿคญ      1       :face_with_hand_over_mouth:
    23      ๐Ÿคก      1                      :clown_face:
    24      ๐Ÿค—      1    :smiling_face_with_open_hands:
    25      ๐Ÿ™„      1          :face_with_rolling_eyes:
    26      ๐Ÿ˜†      1         :grinning_squinting_face:
    27      ๐Ÿธ      1                            :frog:
    28      ๐Ÿ˜ฎ      1            :face_with_open_mouth:
    29      ๐Ÿผ      1                           :panda:
    30      ๐Ÿ˜š      1   :kissing_face_with_closed_eyes:
    31      ๐Ÿ˜Ž      1    :smiling_face_with_sunglasses:
    32      ๐Ÿ˜˜      1             :face_blowing_a_kiss:
    

    You can use str.extractall to avoid a loop then use value_counts to count the number of occurences. Finally, "demojize" each smiley (the slowest part):

    out = (df['Document'].str.extractall(pattern).value_counts()
                         .rename_axis('Smiley').rename('Count').reset_index()
                         .assign(Type=lambda x: x['Smiley'].apply(emoji.demojize)))
    

    Output:

    >>> out
       Smiley  Count                              Type
    0       ๐Ÿ‘    331                       :thumbs_up:
    1       ๐Ÿ‘Œ     50                         :OK_hand:
    2       ๐Ÿป     41                 :light_skin_tone:
    3       ๐Ÿ˜     41    :smiling_face_with_heart-eyes:
    4       ๐Ÿ˜Š     35  :smiling_face_with_smiling_eyes:
    5       ๐Ÿ™‚     15           :slightly_smiling_face:
    6       ๐Ÿ‘Ž     14                     :thumbs_down:
    7       ๐Ÿ˜ƒ     13     :grinning_face_with_big_eyes:
    8       ๐Ÿ’ฉ     10                     :pile_of_poo:
    9       ๐Ÿผ     10          :medium-light_skin_tone:
    10      ๐Ÿ˜œ      3        :winking_face_with_tongue:
    11      ๐Ÿ˜‘      2             :expressionless_face:
    12      ๐Ÿ™ˆ      2              :see-no-evil_monkey:
    13      ๐Ÿ˜ข      2                     :crying_face:
    14      ๐Ÿ™Š      2            :speak-no-evil_monkey:
    15      ๐Ÿ‘ฝ      2                           :alien:
    16      ๐Ÿ˜Ž      1    :smiling_face_with_sunglasses:
    17      ๐Ÿ˜˜      1             :face_blowing_a_kiss:
    18      ๐Ÿ˜š      1   :kissing_face_with_closed_eyes:
    19      ๐Ÿธ      1                            :frog:
    20      ๐Ÿ˜‡      1          :smiling_face_with_halo:
    21      ๐Ÿ˜ฎ      1            :face_with_open_mouth:
    22      ๐Ÿ˜†      1         :grinning_squinting_face:
    23      ๐Ÿ™„      1          :face_with_rolling_eyes:
    24      ๐Ÿผ      1                           :panda:
    

    The pattern part is correct? I am not missing out on any emoticons?

    Your pattern is not right. I don't know the full list you want to extract but below you have a code to debug it:

    #     add latin1 codes --v
    pattern2 = '([\\U00000000-\\U000000FF\\U0001F600-\\U0001F64F\\U0001F300-\\U0001F5FF\\U0001F680-\\U0001F6FF\\U0001F1E0-\\U0001F1FF])'
    
    other = df['Document'].str.replace(pattern2, '', regex=True)
    print(other[other != ''])
    
    # Output / Missed emojis
    1149    ๐Ÿค—
    1238    ๐Ÿฆ‰
    1305    ๐Ÿคซ
    1424    ๐Ÿคซ
    1978    ๐Ÿคญ
    2611    ๐Ÿคฎ
    2623    ๐Ÿฆ‰
    2959    ๐Ÿคก
    3717    ๐Ÿคช
    4045    ๐Ÿฆ‰
    4067    ๐Ÿค–
    4699    ๐Ÿค–
    4975    ๐Ÿคช
    Name: Document, dtype: object