I am trying to find a clean way to create a new column in a dataframe with the ranking of the group/subgroup based on the sum of a value. Here is a simple example :
df = pd.DataFrame({
"group": ["a", "a", "a", "a", "a", "b", "b", "b", "b", "b","c"],
"subgroup": ["i","ii","i","ii","i","ii","i","ii","i","ii","ii"],
"value": [2, 4, 2, 3, 5, 1, 2, 4, 1, 5, 11] })
The output I'd like to have is :
|group |subgroup| value | rank |
|------|--------|-------|------|
| "a" | "i" | 2 | 3 |
| "a" | "ii" | 4 | 4 |
| "a" | "i" | 2 | 3 |
| "a" | "ii" | 3 | 4 |
| "a" | "i" | 5 | 3 |
| "b" | "ii" | 1 | 2 |
| "b" | "i" | 2 | 5 |
| "b" | "ii" | 4 | 2 |
| "b" | "i" | 1 | 5 |
| "b" | "ii" | 5 | 2 |
| "c" | "i" | 11 | 1 |
because the sum of 'value' of the subgroup ["a","i"] is 9, making it the third biggest subgroup. I know it sounds easy to do, but I can't find an efficient way to do it with Pandas...
Use a groupby.sum
, then rank
and merge
:
out = df.merge(df.groupby(['group', 'subgroup'])['value']
.sum().rank(ascending=False)
.rename('rank').reset_index()
)
Output:
group subgroup value rank
0 a i 2 3.0
1 a ii 4 4.0
2 a i 2 3.0
3 a ii 3 4.0
4 a i 5 3.0
5 b ii 1 2.0
6 b i 2 5.0
7 b ii 4 2.0
8 b i 1 5.0
9 b ii 5 2.0
10 c ii 11 1.0
Intermediates:
group subgroup sum rank
0 a i 9 3.0
1 a ii 7 4.0
2 b i 3 5.0
3 b ii 10 2.0
4 c ii 11 1.0