pythonpandasgroup-by

Why grouping a pandas series using the same series makes no sense?


In the code example below I am grouping a pandas series using the same series but with a modified index.

The groups in the end make no sense. There is no warning or error.

Could you please help me understand what is going on? The modified index has clearly an effect but what exactly happens?


import pandas as pd

# define series
sf = pd.Series([10, 10, 20, 30, 30, 30], index=np.arange(6)+2)
print(sf)
# 2    10
# 3    10
# 4    20
# 5    30
# 6    30
# 7    30
# dtype: int64

# group by using the series itself <- makes sense
grouped = sf.groupby(sf)
for name, group in grouped:
    print(f"Group: {name}")
    print(group)
# Group: 10
# 2    10
# 3    10
# dtype: int64
# Group: 20
# 4    20
# dtype: int64
# Group: 30
# 5    30
# 6    30
# 7    30
# dtype: int64

# change index in the group by series and examine groups <- does not make sense
grouped = sf.groupby(sf.reset_index(drop=True))
for name, group in grouped:
    print(f"Group: {name}")
    print(group)
#     Group: 20.0
# 2    10
# dtype: int64
# Group: 30.0
# 3    10
# 4    20
# 5    30
# dtype: int64

Solution

  • See here:

    by : mapping, function, label, pd.Grouper or list of such

    Used to determine the groups for the groupby. If by is a function, it’s called on each value of the object’s index. If a dict or Series is passed, the Series or dict VALUES will be used to determine the groups (the Series’ values are first aligned; see .align() method).

    The crucial part is the alignment. So what's happening is essentially the following: Join both series on the index

    sf.rename("ser_orig").to_frame().join(sf.reset_index(drop=True).rename("ser_reset"))
    
       ser_orig  ser_reset
    2        10       20.0
    3        10       30.0
    4        20       30.0
    5        30       30.0
    6        30        NaN
    7        30        NaN
    

    and then group the result by the column ser_reset:

    for name, group in (
        sf.rename("ser_orig").to_frame().join(sf.reset_index(drop=True).rename("ser_reset"))
        .groupby("ser_reset")
    ):
        print(f"Group: {name}")
        print(group)
    
    Group: 20.0
       ser_orig  ser_reset
    2        10       20.0
    Group: 30.0
       ser_orig  ser_reset
    3        10       30.0
    4        20       30.0
    5        30       30.0