Python newbie here. I'm trying to understand how the pandas groupby and apply methods work. I found this simple example, which I paste below:
import pandas as pd
ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
df = pd.DataFrame(ipl_data)
The dataframe df
looks like this:
Team Rank Year Points
0 Riders 1 2014 876
1 Riders 2 2015 789
2 Devils 2 2014 863
3 Devils 3 2015 673
4 Kings 3 2014 741
5 kings 4 2015 812
6 Kings 1 2016 756
7 Kings 1 2017 788
8 Riders 2 2016 694
9 Royals 4 2014 701
10 Royals 1 2015 804
11 Riders 2 2017 690
So far, so good. I would then like to transform my data so that from every group of teams I'd only keep the very first element from the Points column. Having first checked that df['Points'][0]
does indeed give me the first Points
element of df
, I tried this:
df.groupby('Team').apply(lambda x : x['Points'][0])
thinking that the argument x
to the lambda
function is another pandas dataframe. However, python yields an error:
File "pandas/_libs/index.pyx", line 81, in pandas._libs.index.IndexEngine.get_value
File "pandas/_libs/index.pyx", line 89, in pandas._libs.index.IndexEngine.get_value
File "pandas/_libs/index.pyx", line 132, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/hashtable_class_helper.pxi", line 987, in pandas._libs.hashtable.Int64HashTable.get_item
File "pandas/_libs/hashtable_class_helper.pxi", line 993, in pandas._libs.hashtable.Int64HashTable.get_item
KeyError: 0
which seems to have something to do with a HashTable but I am unable to understand why. I then thought that maybe what is passed to the lambda
is not a dataframe, so I ran this:
df.groupby('Team').apply(lambda x : (type(x), x.shape))
with output:
Team
Devils (<class 'pandas.core.frame.DataFrame'>, (2, 4))
Kings (<class 'pandas.core.frame.DataFrame'>, (3, 4))
Riders (<class 'pandas.core.frame.DataFrame'>, (4, 4))
Royals (<class 'pandas.core.frame.DataFrame'>, (2, 4))
kings (<class 'pandas.core.frame.DataFrame'>, (1, 4))
dtype: object
which, IIUC, shows that the the argument to the lambda
is indeed a pandas dataframe holding each team's subset of df
.
I know I can get the desired result by running:
df.groupby('Team').apply(lambda x : x['Points'].iloc[0])
I just want to understand why df['Points'][0]
works and x['Points'][0]
doesn't from within the apply function. Thank you for reading!
When you call df.groupby('Team').apply(lambda x: ...)
you are essentially chopping up the dataframe by Team and pass each chunk to the lambda function:
Team Rank Year Points
0 Riders 1 2014 876
1 Riders 2 2015 789
8 Riders 2 2016 694
11 Riders 2 2017 690
------------------------------
2 Devils 2 2014 863
3 Devils 3 2015 673
------------------------------
4 Kings 3 2014 741
6 Kings 1 2016 756
7 Kings 1 2017 788
------------------------------
5 kings 4 2015 812
------------------------------
9 Royals 4 2014 701
10 Royals 1 2015 804
df['Points'][0]
works because you are telling pandas to "get the value at label 0 of the Points
series", which exists.
.apply(lambda x: x['Points'][0])
doesn't work because only 1 chunk (Riders
) has a label 0. Hence you get the Key Error.
Having said that, apply
is generic so it's pretty slow compared to the builtin vectorized aggregate functions. You can use first
:
df.groupby('Team')['Points'].first()