In the code example below I am grouping a pandas series using the same series but with a modified index.
The groups in the end make no sense. There is no warning or error.
Could you please help me understand what is going on? The modified index has clearly an effect but what exactly happens?
import pandas as pd
# define series
sf = pd.Series([10, 10, 20, 30, 30, 30], index=np.arange(6)+2)
print(sf)
# 2 10
# 3 10
# 4 20
# 5 30
# 6 30
# 7 30
# dtype: int64
# group by using the series itself <- makes sense
grouped = sf.groupby(sf)
for name, group in grouped:
print(f"Group: {name}")
print(group)
# Group: 10
# 2 10
# 3 10
# dtype: int64
# Group: 20
# 4 20
# dtype: int64
# Group: 30
# 5 30
# 6 30
# 7 30
# dtype: int64
# change index in the group by series and examine groups <- does not make sense
grouped = sf.groupby(sf.reset_index(drop=True))
for name, group in grouped:
print(f"Group: {name}")
print(group)
# Group: 20.0
# 2 10
# dtype: int64
# Group: 30.0
# 3 10
# 4 20
# 5 30
# dtype: int64
See here:
by : mapping, function, label, pd.Grouper or list of such
Used to determine the groups for the groupby. If
by
is a function, it’s called on each value of the object’s index. If adict
orSeries
is passed, theSeries
ordict
VALUES will be used to determine the groups (the Series’ values are first aligned; see.align()
method).
The crucial part is the alignment. So what's happening is essentially the following: Join both series on the index
sf.rename("ser_orig").to_frame().join(sf.reset_index(drop=True).rename("ser_reset"))
ser_orig ser_reset
2 10 20.0
3 10 30.0
4 20 30.0
5 30 30.0
6 30 NaN
7 30 NaN
and then group the result by the column ser_reset
:
for name, group in (
sf.rename("ser_orig").to_frame().join(sf.reset_index(drop=True).rename("ser_reset"))
.groupby("ser_reset")
):
print(f"Group: {name}")
print(group)
Group: 20.0
ser_orig ser_reset
2 10 20.0
Group: 30.0
ser_orig ser_reset
3 10 30.0
4 20 30.0
5 30 30.0