pythonpandasmatplotlib

How to do pandas grouping, filtering, and a pie chart using chained operations?


How can I simplify the code below and make it more efficient using chained operations? Currently, I am creating intermediate objects and using a for loop.

I use this data: https://www.kaggle.com/datasets/gregorut/videogamesales

Pb_Sales = df.groupby('Publisher')[['Global_Sales']].sum().sort_values('Global_Sales',ascending=False)
total = Pb_Sales['Global_Sales'].sum()
Pb_Sales['Proportion']=Pb_Sales['Global_Sales']/total
index_position = 0
threshold = 0.02
index_list = list(Pb_Sales.index)
for num in Pb_Sales['Proportion']:
    if num < threshold:
        index_list[index_position]= 'Other'
    index_position=index_position+1
Pb_Sales.index = index_list

Pb_Sales = Pb_Sales.groupby(Pb_Sales.index).sum().sort_values('Global_Sales',ascending = False)
plt.title("most profitable publisehr")
plt.ylabel("")
chart_Pie= Pb_Sales['Global_Sales'].plot(kind = 'pie',figsize=(10,5),legend = False)
chart_Pie.set_ylabel("")
plt.show()

Solution

  • The approach below renders the visualisation using a chained operation. I don't use for loops, and no intermediate variables are created in the global namespace.

    enter image description here

    The returned object is a matplotlib figure, but you could also modify it to return the data used for the pie chart.

    The dataset is relatively small, so I have focused on clarity rather than performance optimisation.

    import pandas as pd
    
    df = pd.read_csv('vgsales.csv')
    
    (
        df
        
        #All-time sales per publisher
        .groupby('Publisher', as_index=False)['Global_Sales'].sum()
    
        #Optionally rename `Global_Sales` since it's now all-time sales
        .rename(columns={'Global_Sales': 'alltime_sales'})
    
        #Express as a percentage
        .assign(
            alltime_sales_percent=
            lambda df_: df_['alltime_sales'].div(df_['alltime_sales'].sum()).mul(100)
        )
    
        #Publishers whose all-time sales is <2% get named "Other"
        .assign(
            major_publisher_name=
            lambda df_: df_['Publisher'].mask(df_['alltime_sales_percent'] < 2, 'Other')
        )
    
        #Optionally view the current df before the final step
        # Useful for debugging intermediate steps
        .pipe(lambda df_: display(df_) or df_)
    
        #Add up sales over the 'Other' publishers, and sort
        .groupby('major_publisher_name')['alltime_sales_percent'].sum()
        .sort_values(ascending=False)
    
        #Plot
        .plot.pie(
            title='All-time sales share of major publishers',
            ylabel='', figsize=(5, 5), cmap='tab20b'
        )
    );