[SOLVED] How to do pandas grouping, filtering, and a pie chart using chained operations?

How to do pandas grouping, filtering, and a pie chart using chained operations?

How can I simplify the code below and make it more efficient using chained operations? Currently, I am creating intermediate objects and using a for loop.

I use this data: https://www.kaggle.com/datasets/gregorut/videogamesales

Pb_Sales = df.groupby('Publisher')[['Global_Sales']].sum().sort_values('Global_Sales',ascending=False)
total = Pb_Sales['Global_Sales'].sum()
Pb_Sales['Proportion']=Pb_Sales['Global_Sales']/total
index_position = 0
threshold = 0.02
index_list = list(Pb_Sales.index)
for num in Pb_Sales['Proportion']:
    if num < threshold:
        index_list[index_position]= 'Other'
    index_position=index_position+1
Pb_Sales.index = index_list

Pb_Sales = Pb_Sales.groupby(Pb_Sales.index).sum().sort_values('Global_Sales',ascending = False)
plt.title("most profitable publisehr")
plt.ylabel("")
chart_Pie= Pb_Sales['Global_Sales'].plot(kind = 'pie',figsize=(10,5),legend = False)
chart_Pie.set_ylabel("")
plt.show()

Solution

The approach below renders the visualisation using a chained operation. I don't use for loops, and no intermediate variables are created in the global namespace.

The returned object is a matplotlib figure, but you could also modify it to return the data used for the pie chart.

The dataset is relatively small, so I have focused on clarity rather than performance optimisation.

import pandas as pd

df = pd.read_csv('vgsales.csv')

(
    df
    
    #All-time sales per publisher
    .groupby('Publisher', as_index=False)['Global_Sales'].sum()

    #Optionally rename `Global_Sales` since it's now all-time sales
    .rename(columns={'Global_Sales': 'alltime_sales'})

    #Express as a percentage
    .assign(
        alltime_sales_percent=
        lambda df_: df_['alltime_sales'].div(df_['alltime_sales'].sum()).mul(100)
    )

    #Publishers whose all-time sales is <2% get named "Other"
    .assign(
        major_publisher_name=
        lambda df_: df_['Publisher'].mask(df_['alltime_sales_percent'] < 2, 'Other')
    )

    #Optionally view the current df before the final step
    # Useful for debugging intermediate steps
    .pipe(lambda df_: display(df_) or df_)

    #Add up sales over the 'Other' publishers, and sort
    .groupby('major_publisher_name')['alltime_sales_percent'].sum()
    .sort_values(ascending=False)

    #Plot
    .plot.pie(
        title='All-time sales share of major publishers',
        ylabel='', figsize=(5, 5), cmap='tab20b'
    )
);