When applying an aggregation to a grouped pandas DataFrame, the aggregated output appears to contains different values for aggregated all-missing-value-columns, depending on the type of the dataframe column. Below is a minimal example, containing one non-missing-value (an integer, a string and a tuple), one NaN
, and one None
each:
import pandas as pd
import numpy as np
a1 = pd.DataFrame({'a': [3, np.nan, None], 'b': [0,1,2]})
a2 = pd.DataFrame({'a': ['tree', np.nan, None], 'b': [0,1,2]})
a3 = pd.DataFrame({'a': [(0,1,2), np.nan, None], 'b': [0,1,2]})
a1.groupby('b')['a'].first()
a2.groupby('b')['a'].first()
a3.groupby('b')['a'].first()
a1.groupby('b')['a'].agg('first')
a2.groupby('b')['a'].agg('first')
a3.groupby('b')['a'].agg('first')
Looking at the dtypes
of column 'a'
, it can be seen that these are float64
, object
and object
for a1
, a2
and a3
, respectively. The None
in a1
is converted to NaN
at dataframe creation. Therefore I would have the following
Expected output behavior:
a1
: NaN
for rows 1 and 2 (that is the case)a2
: NaN
and None
for rows 1 and 2 (not the case)a3
: NaN
and None
for rows 1 and 2 (not the case)Actual output:
b
0 3.0
1 NaN
2 NaN
Name: a, dtype: float64
b
0 tree
1 None
2 None
Name: a, dtype: object
b
0 (0, 1, 2)
1 None
2 None
Name: a, dtype: object
Why does the aggregation change the data from NaN
to None
for row 1 in a2
and a3
? As the column is anyways of dtype object, there should be no issue in returning NaN
and None
for rows 1 and 2, respectively; and we are not in a scenario here, where any group to be aggregated contains both NaNs
and None
. The documentation (https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.first.html) is not very precise on this behavior either, it just mentions the returned value for all-NA-columns is NA.
Update:
As mentioned in @mozway's answer further below, for pure NaN/None-groups, skipna=False
can be used to preserve NaN and None respectively. However, this does not work when having both mixed non-missing-/missing-value and all-missing columns (e.g. [[np.nan, None, 'tree'],[np.nan, None]]
), where we still would like to get the first non-missing value, as that would require passing skipna=True
.
By default, groupby.first
removes the NaNs.
DataFrameGroupBy.first(numeric_only=False, min_count=-1, skipna=True)
Compute the first entry of each column within each group.
Defaults to skipping NA elements.
Thus, the aggregation ignores all your NaNs and outputs the default NA value for your dtype (NaN for numeric, None for object).
You should use skipna=False
:
a2.groupby('b')['a'].first(skipna=False)
# with agg
a3.groupby('b')['a'].agg('first', skipna=False)
Output:
# for a2
b
0 tree
1 NaN
2 None
Name: a, dtype: object
# for a3
b
0 (0, 1, 2)
1 NaN
2 None
Name: a, dtype: object
If you have an object Series and a mix of NaN/None, then (with skipna=False
) the first object is returned (as expected):
(pd.DataFrame({'a': [np.nan, None, None, np.nan, 'X'],
'b': [0,0,1,1,2]})
.groupby('b')['a'].first(skipna=False)
)
b
0 NaN
1 None
2 X
Name: a, dtype: object
first
function:If you want the first non-null or the first null keeping the original object:
def first(s):
return next(iter(s.dropna()), s.iloc[0])
(pd.DataFrame({'a': [np.nan, None, None, np.nan, np.nan, 'X'],
'b': [0,0,1,1,2,2]})
.groupby('b')['a'].agg(first)
)
Output:
b
0 NaN
1 None
2 X
Name: a, dtype: object