I have the following pandas
dataframe df
:
cluster tag amount name
1 0 200 Michael
2 1 1200 John
2 1 900 Daniel
2 0 3000 David
2 0 600 Jonny
3 0 900 Denisse
3 1 900 Mike
3 1 3000 Kely
3 0 2000 Devon
What I need to do is add another column in df
that writes for each row
, the name
(from the name column) that has the highest amount
where the tag
is 1. In other words, the solution looks like this:
cluster tag amount name highest_amount
1 0 200 Michael NaN
2 1 1200 John John
2 1 900 Daniel John
2 0 3000 David John
2 0 600 Jonny John
3 0 900 Denisse Kely
3 1 900 Mike Kely
3 1 3000 Kely Kely
3 0 2000 Devon Kely
I've tried something like this:
df.group('clusters')['name','amount'].transform('max')[df['tag']==1]
but the problem with this is that the name does note repeat on every row. It will look like this:
cluster tag amount name highest_amount
1 0 200 Michael NaN
2 1 1200 John John
2 1 900 Daniel John
2 0 3000 David NaN
2 0 600 Jonny NaN
3 0 900 Denisse NaN
3 1 900 Mike Kely
3 1 3000 Kely Kely
3 0 2000 Devon NaN
Can someone please let me know how to add a condition with split apply combine, and have the solution repeated on each row?
You can do this as a two-stage process. First calculate a mapping series, then map by cluster:
s = df.query('tag == 1')\
.sort_values('amount', ascending=False)\
.drop_duplicates('cluster')\
.set_index('cluster')['name']
df['highest_name'] = df['cluster'].map(s)
print(df)
cluster tag amount name highest_name
0 1 0 200 Michael NaN
1 2 1 1200 John John
2 2 1 900 Daniel John
3 2 0 3000 David John
4 2 0 600 Jonny John
5 3 0 900 Denisse Kely
6 3 1 900 Mike Kely
7 3 1 3000 Kely Kely
8 3 0 2000 Devon Kely
If you want to use groupby
, here's one way:
def func(x):
names = x.query('tag == 1').sort_values('amount', ascending=False)['name']
return names.iloc[0] if not names.empty else np.nan
df['highest_name'] = df['cluster'].map(df.groupby('cluster').apply(func))