pythonpandasdataframecategoriesisin

How do you remove already filtered category values from DataFrames from plots and pivot tables?


My dataframes show video game titles, platforms, year of release, revenue, etc.

I have filtered the original dataframe "df_samplegames", which has 29 different platforms (type category), into a new dataframe "df_finalgames" that should only include the 6 platforms: '3DS', 'PSV', 'WiiU', 'PS4', 'XOne', 'PC'.

# Original dataframe:
df_samplegames['platform'].unique()

# Result:
['GEN', 'NES', 'NG', 'SCD', 'SNES', ..., '3DS', 'PSV', 'WiiU', 'PS4', 'XOne']
Length: 29
Categories (29, object): ['GEN', 'NES', 'NG', 'SCD', ..., 'PSV', 'WiiU', 'PS4', 'XOne']
# New dataframe:
df_finalgames = df_samplegames[df_samplegames['platform'].isin(
    ['3DS', 'PSV', 'WiiU', 'PS4', 'XOne', 'PC'])].sort_values(by = ['year', 'platform']).reset_index(drop = True)
df_finalgames['platform'].unique()

# Result:
['PC', '3DS', 'PSV', 'WiiU', 'PS4', 'XOne']
Categories (6, object): ['PC', '3DS', 'PSV', 'WiiU', 'PS4', 'XOne']

But whenever I want to use the latter dataframe to create a pairplot, or a pivot table, it keeps including the filtered out platform categories.

# Pivot table creation:
df_platgames = df_finalgames[['name', 'platform', 'total_sales']]
pvt_platgames = df_platgames.pivot_table(
    values = 'total_sales', index = 'name', columns = 'platform', aggfunc = 'sum')
pvt_platgames

# Result, note the column headers:
platform    2600    3DO 3DS DC  DS  GB  GBA GC  GEN GG  ... SAT SCD SNES    TG16    WS  Wii WiiU    X360    XB  XOne
name                                                                                    
Beyblade Burst  0.0 0.0 0.03    0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
Fire Emblem Fates   0.0 0.0 1.67    0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
Frozen: Olaf's Quest    0.0 0.0 0.59    0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
# Etc...

How can I force my code to ignore those platforms that I don't need? Thank you all in advance.

Sample Data for df_samplegames:

        name    platform    year    genre   na_sales    eu_sales    jp_sales    other_sales critic_score    user_score  rating  total_sales
9706    Press Your Luck 2010 Edition    DS  2009    Misc    0.18    0.00    0.00    0.01    68.967679   7.125046    E   0.19
3526    Stadium Games   GBA 2003    Sports  0.03    0.01    0.00    0.00    68.967679   7.125046    RP  0.04
1865    Evil Dead: Hail to the King PS  2000    Adventure   0.15    0.10    0.00    0.02    51.000000   6.000000    M   0.27
5117    Connect Four / Perfection / Trouble GBA 2005    Misc    0.19    0.07    0.00    0.00    68.967679   7.125046    RP  0.26
340 Rampo   SAT 1995    Adventure   0.00    0.00    0.03    0.00    68.967679   7.125046    RP  0.03


Solution

  • Various methods (such as pivot_table) will include “unused” categories by default.

    df2.pivot_table(index="name", columns="platform", values="total_sales", 
                    aggfunc="sum")
    
    # platform  w  x  y  z
    # name
    # b         0  2  0  0
    # d         0  0  0  4
    

    observed=True will only include used categories.

    df2.pivot_table(index="name", columns="platform", values="total_sales", 
                    aggfunc="sum", observed=True)
    
    # platform    x    z
    # name
    # b         2.0  NaN
    # d         NaN  4.0
    

    Your code should actually warn (on at least v2.3.2) as the default will change.

    FutureWarning: The default value of observed=False is deprecated and will change to observed=True in a future version of pandas. Specify observed=False to silence this warning and retain the current behavior

    Example dfs

    import pandas as pd
    
    df1 = pd.DataFrame({
        "name": ["a", "b", "c", "d"],
        "platform": ["w", "x", "y", "z"],
        "total_sales": [1, 2, 3, 4]
    })
    
    df1 = df1.astype({"platform": "category"})
    df2 = df1[df1["platform"].isin(["x", "z"])]