My dataframes show video game titles, platforms, year of release, revenue, etc.
I have filtered the original dataframe "df_samplegames", which has 29 different platforms (type category), into a new dataframe "df_finalgames" that should only include the 6 platforms: '3DS', 'PSV', 'WiiU', 'PS4', 'XOne', 'PC'.
# Original dataframe:
df_samplegames['platform'].unique()
# Result:
['GEN', 'NES', 'NG', 'SCD', 'SNES', ..., '3DS', 'PSV', 'WiiU', 'PS4', 'XOne']
Length: 29
Categories (29, object): ['GEN', 'NES', 'NG', 'SCD', ..., 'PSV', 'WiiU', 'PS4', 'XOne']
# New dataframe:
df_finalgames = df_samplegames[df_samplegames['platform'].isin(
['3DS', 'PSV', 'WiiU', 'PS4', 'XOne', 'PC'])].sort_values(by = ['year', 'platform']).reset_index(drop = True)
df_finalgames['platform'].unique()
# Result:
['PC', '3DS', 'PSV', 'WiiU', 'PS4', 'XOne']
Categories (6, object): ['PC', '3DS', 'PSV', 'WiiU', 'PS4', 'XOne']
But whenever I want to use the latter dataframe to create a pairplot, or a pivot table, it keeps including the filtered out platform categories.
# Pivot table creation:
df_platgames = df_finalgames[['name', 'platform', 'total_sales']]
pvt_platgames = df_platgames.pivot_table(
values = 'total_sales', index = 'name', columns = 'platform', aggfunc = 'sum')
pvt_platgames
# Result, note the column headers:
platform 2600 3DO 3DS DC DS GB GBA GC GEN GG ... SAT SCD SNES TG16 WS Wii WiiU X360 XB XOne
name
Beyblade Burst 0.0 0.0 0.03 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
Fire Emblem Fates 0.0 0.0 1.67 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
Frozen: Olaf's Quest 0.0 0.0 0.59 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
# Etc...
How can I force my code to ignore those platforms that I don't need? Thank you all in advance.
Sample Data for df_samplegames:
name platform year genre na_sales eu_sales jp_sales other_sales critic_score user_score rating total_sales
9706 Press Your Luck 2010 Edition DS 2009 Misc 0.18 0.00 0.00 0.01 68.967679 7.125046 E 0.19
3526 Stadium Games GBA 2003 Sports 0.03 0.01 0.00 0.00 68.967679 7.125046 RP 0.04
1865 Evil Dead: Hail to the King PS 2000 Adventure 0.15 0.10 0.00 0.02 51.000000 6.000000 M 0.27
5117 Connect Four / Perfection / Trouble GBA 2005 Misc 0.19 0.07 0.00 0.00 68.967679 7.125046 RP 0.26
340 Rampo SAT 1995 Adventure 0.00 0.00 0.03 0.00 68.967679 7.125046 RP 0.03
Various methods (such as pivot_table
) will include “unused” categories by default.
df2.pivot_table(index="name", columns="platform", values="total_sales",
aggfunc="sum")
# platform w x y z
# name
# b 0 2 0 0
# d 0 0 0 4
observed=True
will only include used categories.
df2.pivot_table(index="name", columns="platform", values="total_sales",
aggfunc="sum", observed=True)
# platform x z
# name
# b 2.0 NaN
# d NaN 4.0
Your code should actually warn (on at least v2.3.2
) as the default will change.
FutureWarning: The default value of
observed=False
is deprecated and will change toobserved=True
in a future version of pandas. Specifyobserved=False
to silence this warning and retain the current behavior
import pandas as pd
df1 = pd.DataFrame({
"name": ["a", "b", "c", "d"],
"platform": ["w", "x", "y", "z"],
"total_sales": [1, 2, 3, 4]
})
df1 = df1.astype({"platform": "category"})
df2 = df1[df1["platform"].isin(["x", "z"])]