pythonpandasdataframepandas-groupbypandas-apply

python pandas groupby/apply: what exactly is passed to the apply function?


Python newbie here. I'm trying to understand how the pandas groupby and apply methods work. I found this simple example, which I paste below:

import pandas as pd

ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
   'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
   'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
   'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
   'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}

df = pd.DataFrame(ipl_data)

The dataframe df looks like this:

      Team  Rank  Year  Points
0   Riders     1  2014     876
1   Riders     2  2015     789
2   Devils     2  2014     863
3   Devils     3  2015     673
4    Kings     3  2014     741
5    kings     4  2015     812
6    Kings     1  2016     756
7    Kings     1  2017     788
8   Riders     2  2016     694
9   Royals     4  2014     701
10  Royals     1  2015     804
11  Riders     2  2017     690

So far, so good. I would then like to transform my data so that from every group of teams I'd only keep the very first element from the Points column. Having first checked that df['Points'][0] does indeed give me the first Points element of df, I tried this:

df.groupby('Team').apply(lambda x : x['Points'][0])

thinking that the argument x to the lambda function is another pandas dataframe. However, python yields an error:

File "pandas/_libs/index.pyx", line 81, in pandas._libs.index.IndexEngine.get_value
File "pandas/_libs/index.pyx", line 89, in pandas._libs.index.IndexEngine.get_value
File "pandas/_libs/index.pyx", line 132, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/hashtable_class_helper.pxi", line 987, in pandas._libs.hashtable.Int64HashTable.get_item
File "pandas/_libs/hashtable_class_helper.pxi", line 993, in pandas._libs.hashtable.Int64HashTable.get_item
KeyError: 0

which seems to have something to do with a HashTable but I am unable to understand why. I then thought that maybe what is passed to the lambda is not a dataframe, so I ran this:

df.groupby('Team').apply(lambda x : (type(x), x.shape))

with output:

Team
Devils    (<class 'pandas.core.frame.DataFrame'>, (2, 4))
Kings     (<class 'pandas.core.frame.DataFrame'>, (3, 4))
Riders    (<class 'pandas.core.frame.DataFrame'>, (4, 4))
Royals    (<class 'pandas.core.frame.DataFrame'>, (2, 4))
kings     (<class 'pandas.core.frame.DataFrame'>, (1, 4))
dtype: object

which, IIUC, shows that the the argument to the lambda is indeed a pandas dataframe holding each team's subset of df.

I know I can get the desired result by running:

df.groupby('Team').apply(lambda x : x['Points'].iloc[0])

I just want to understand why df['Points'][0] works and x['Points'][0] doesn't from within the apply function. Thank you for reading!


Solution

  • When you call df.groupby('Team').apply(lambda x: ...) you are essentially chopping up the dataframe by Team and pass each chunk to the lambda function:

          Team  Rank  Year  Points
    0   Riders     1  2014     876
    1   Riders     2  2015     789
    8   Riders     2  2016     694
    11  Riders     2  2017     690
    ------------------------------
    2   Devils     2  2014     863
    3   Devils     3  2015     673
    ------------------------------
    4    Kings     3  2014     741
    6    Kings     1  2016     756
    7    Kings     1  2017     788
    ------------------------------
    5    kings     4  2015     812
    ------------------------------
    9   Royals     4  2014     701
    10  Royals     1  2015     804
    

    df['Points'][0] works because you are telling pandas to "get the value at label 0 of the Points series", which exists.

    .apply(lambda x: x['Points'][0]) doesn't work because only 1 chunk (Riders) has a label 0. Hence you get the Key Error.


    Having said that, apply is generic so it's pretty slow compared to the builtin vectorized aggregate functions. You can use first:

    df.groupby('Team')['Points'].first()