I'm currently studying the basics of data analysis with Python in Colab, and for that I'm using my IMDb watchlist as a dataset.
In the column Genres, several movie genres can be registered in the same cell (which makes things more difficult), and I'm trying to calculate the proportions of the genres presented in this dataset and then plot it with a pie or barh chart maybe.
So I created variables to store the value_counts()
of each genre as True
or False
, as you can see below:
action = df['Genres'].str.contains('Action').value_counts()
animation = df['Genres'].str.contains('Animation').value_counts()
biography = df['Genres'].str.contains('Biography').value_counts()
comedy = df['Genres'].str.contains('Comedy').value_counts()
crime = df['Genres'].str.contains('Crime').value_counts()
drama = df['Genres'].str.contains('Drama').value_counts()
documentary = df['Genres'].str.contains('Documentary').value_counts()
family = df['Genres'].str.contains('Family').value_counts()
fantasy = df['Genres'].str.contains('Fantasy').value_counts()
film_noir = df['Genres'].str.contains('Film-Noir').value_counts()
history = df['Genres'].str.contains('History').value_counts()
horror = df['Genres'].str.contains('Horror').value_counts()
mystery = df['Genres'].str.contains('Mystery').value_counts()
music = df['Genres'].str.contains('Music').value_counts()
musical = df['Genres'].str.contains('Musical').value_counts()
romance = df['Genres'].str.contains('Romance').value_counts()
scifi = df['Genres'].str.contains('Sci-Fi').value_counts()
sport = df['Genres'].str.contains('Sport').value_counts()
thriller = df['Genres'].str.contains('Thriller').value_counts()
war = df['Genres'].str.contains('War').value_counts()
western = df['Genres'].str.contains('Western').value_counts()
Then I put these variables into a DataFrame
:
genres = pd.DataFrame(
[action, animation, biography,
comedy, crime, drama,
documentary, family, fantasy,
film_noir, history, horror,
mystery, music, musical,
romance, scifi, sport,
thriller, war, western],
)
genres.head(5)
The problem is in the output:
I'd like it to display the variable names instead of 'Genres', as it's being show in the first column. Is it possible?
To avoid using a relatively slow for
loop :
Let's suppose with have the following dataframe
Genres
0 Comedy, Horror
1 Comedy, Drama, War
2 Mistery, Romance, Thriller
Proposed code
import pandas as pd
# create the original DataFrame
df = pd.DataFrame({'Genres': ['Comedy, Horror', 'Comedy, Drama, War', 'Mistery, Romance, Thriller']})
# split the genres by comma and remove leading spaces
df['Genres'] = df['Genres'].str.split(',').apply(lambda x: [i.strip() for i in x])
# explode the list into separate rows
df = df.explode('Genres')
# Counting Matrix using crosstab method
genre_counts = pd.crosstab(index=df.index, columns=df['Genres'], margins=False).to_dict('index')
genre_counts = pd.DataFrame(genre_counts)
# count the number of 0s and 1s in each row
counts = ( genre_counts.apply(lambda row: [sum(row == 0), sum(row == 1)], axis=1) )
# Final count with 2 columns 'False' and 'True'
counts = pd.DataFrame(counts.tolist(), index=counts.index, columns=['False', 'True'])
print(counts)
Vizualisation
False True
Comedy 1 2
Drama 2 1
Horror 2 1
Mistery 2 1
Romance 2 1
Thriller 2 1
War 2 1