I have a dataframe:
import pandas as pd
data = pd.DataFrame({"col1": ["a", "a", "a", "a", "a", "a"],
"col2": [0,0,0,1,1, 1],
"col3": [1,2,3,4,5, 6]})
data
col1 col2 col3
0 a 0 1
1 a 0 2
2 a 0 3
3 a 1 4
4 a 1 5
5 a 1 6
I'm trying to remove the duplicates based on col2 == 1
and keep the last entry
Using the below code I was able to keep the first and drop others.
data[~(data.duplicated(["col2"]) & data.col2.eq(1))]
col1 col2 col3
0 a 0 1
1 a 0 2
2 a 0 3
3 a 1 4
How to remove duplicates based on one category in a column and keep the last entry?
Desired Output
col1 col2 col3
0 a 0 1
1 a 0 2
2 a 0 3
3 a 1 6
Use boolean indexing with help of groupby.cumcount
(or duplicated
):
# is the row not a 1 in col2?
m1 = data['col2'].ne(1)
# is the row the last of the group?
m2 = data.groupby('col2').cumcount(ascending=False).eq(0)
# or
# m2 = ~data['col2'].duplicated(keep='last')
# keep rows matching either condition
out = data[m1|m2]
Reversed logic:
# is the group a 1?
m1 = data['col2'].eq(1)
# is the row NOT the last one?
m2 = data['col2'].duplicated(keep='last')
# drop the rows matching both conditions
out = data[~(m1&m2)]
Output:
col1 col2 col3
0 a 0 1
1 a 0 2
2 a 0 3
5 a 1 6