I am dramatically simplifying this question (compared to the initial draft).
Given a dataframe with a categorical type, if I subset the dataframe so it only contains a few of the categorical values, the subsetted dataframe still remembers the full list of category levels.
When I groupby
on this, smaller, dataframe, I continue to get values which I don't expect.
Here is some code which replicates this:
import pandas as pd
df = pd.DataFrame({
'colA':['a', 'b', 'b', 'c', 'c', 'c'],
'colB':1
})
df['colA'] = df['colA'].astype('category') #<= this is important
df
output
colA colB
0 a 1
1 b 1
2 b 1
3 c 1
4 c 1
5 c 1
df.groupby('colA').max()
output
colB
colA
a 1
b 1
c 1
smalldf = df.iloc[:3]
smalldf['colA']
output
0 a
1 b
2 b
Name: colA, dtype: category
Categories (3, object): ['a', 'b', 'c']
Notice above that 'c' is still in the category levels
smalldf.groupby('colA').max()
output
colB
colA
a 1.0
b 1.0
c NaN
This 'c' here is unexpected. However, I suppose it can be explained away because subsetting doesn't remove the category levels.
The problem gets worse when you write out the smaller dataframe
FEATHER_PATH = Path.joinpath(Path.home(), 'Downloads', 'test.feather')
smalldf.to_feather(FEATHER_PATH)
pd.read_feather(FEATHER_PATH)['colA']
0 a
1 b
2 b
Name: colA, dtype: category
Categories (3, object): ['a', 'b', 'c']
Say you write out the smaller dataframe and pass it to someone. They take a look at the file and only see colA
values of 'a' and 'b'. However, when they do a groupby, 'c' shows up. 'c' is no where to be found in the actual data!
They would have to explicitly look for category levels to find the 'c' hidden in there!
I'm not sure if this behavior is wrong, but can certainly can be extremely confusing!
Not sure if this can be left here as a warning to others.
It's not wrong, but it is a bit hidden in the docs. You can find it for example here:
DataFrame methods like DataFrame.sum() also show “unused” categories.
Groupby will also show “unused” categories
Regarding the writing/reading: Without it you would loose type information, which might hurt as well. So there's a tradeoff to balance.
You should be able to handle it by using the observed=True
option of .groupby()
:
observed: bool, default False
This only applies if any of the groupers are Categoricals. If True: only show observed values for categorical groupers. If False: show all values for categorical groupers.
For example:
smalldf.groupby('colA', observed=True).max()
delivers
colB
colA
a 1
b 1