How can I simplify the code below and make it more efficient using chained operations? Currently, I am creating intermediate objects and using a for
loop.
I use this data: https://www.kaggle.com/datasets/gregorut/videogamesales
Pb_Sales = df.groupby('Publisher')[['Global_Sales']].sum().sort_values('Global_Sales',ascending=False)
total = Pb_Sales['Global_Sales'].sum()
Pb_Sales['Proportion']=Pb_Sales['Global_Sales']/total
index_position = 0
threshold = 0.02
index_list = list(Pb_Sales.index)
for num in Pb_Sales['Proportion']:
if num < threshold:
index_list[index_position]= 'Other'
index_position=index_position+1
Pb_Sales.index = index_list
Pb_Sales = Pb_Sales.groupby(Pb_Sales.index).sum().sort_values('Global_Sales',ascending = False)
plt.title("most profitable publisehr")
plt.ylabel("")
chart_Pie= Pb_Sales['Global_Sales'].plot(kind = 'pie',figsize=(10,5),legend = False)
chart_Pie.set_ylabel("")
plt.show()
The approach below renders the visualisation using a chained operation. I don't use for
loops, and no intermediate variables are created in the global namespace.
The returned object is a matplotlib
figure, but you could also modify it to return the data used for the pie chart.
The dataset is relatively small, so I have focused on clarity rather than performance optimisation.
import pandas as pd
df = pd.read_csv('vgsales.csv')
(
df
#All-time sales per publisher
.groupby('Publisher', as_index=False)['Global_Sales'].sum()
#Optionally rename `Global_Sales` since it's now all-time sales
.rename(columns={'Global_Sales': 'alltime_sales'})
#Express as a percentage
.assign(
alltime_sales_percent=
lambda df_: df_['alltime_sales'].div(df_['alltime_sales'].sum()).mul(100)
)
#Publishers whose all-time sales is <2% get named "Other"
.assign(
major_publisher_name=
lambda df_: df_['Publisher'].mask(df_['alltime_sales_percent'] < 2, 'Other')
)
#Optionally view the current df before the final step
# Useful for debugging intermediate steps
.pipe(lambda df_: display(df_) or df_)
#Add up sales over the 'Other' publishers, and sort
.groupby('major_publisher_name')['alltime_sales_percent'].sum()
.sort_values(ascending=False)
#Plot
.plot.pie(
title='All-time sales share of major publishers',
ylabel='', figsize=(5, 5), cmap='tab20b'
)
);