In a dataframe I need to drop/filter out duplicate rows based the combined columns A and B. In the example DataFrame
A B C D
0 1 1 3 9
1 1 2 4 8
2 1 3 5 7
3 1 3 4 6
4 1 4 5 5
5 1 4 6 4
rows 2 and 3, and 5 and 6 are duplicates and one of them should be dropped, keeping the row with the lowest value of
2 * C + 3 * D
To do this, I created a new temporary score column, S
df['S'] = 2 * df['C'] + 3 * df['D']
and finally to return the index of the minimum value for S
df.loc[df.groupby(['A', 'B'])['S'].idxmin()]
del ['S']
The result is
A B C D
0 1 1 3 9
1 1 2 4 8
3 1 3 4 6
5 1 4 6 4
However, is there a more efficient way of doing this, without having to add (and later drop) a new column?
Here is a way by using sort_values()
and duplicated()
df.loc[~df.sort_values('A',key = lambda x: df['C'].mul(2).add(df['D'].mul(3))).duplicated(subset = ['A','B'])]
Output:
A B C D
0 1 1 3 9
1 1 2 4 8
3 1 3 4 6
5 1 4 6 4