pythonpandasdata-manipulationdrop-duplicates

Pandas drop duplicates based on one group and keep the last value


I have a dataframe:

import pandas as pd

data = pd.DataFrame({"col1": ["a", "a", "a", "a", "a", "a"],
                     "col2": [0,0,0,1,1, 1],
                     "col3": [1,2,3,4,5, 6]})

data


  col1  col2    col3
0   a   0   1
1   a   0   2
2   a   0   3
3   a   1   4
4   a   1   5
5   a   1   6

I'm trying to remove the duplicates based on col2 == 1 and keep the last entry

Using the below code I was able to keep the first and drop others.

data[~(data.duplicated(["col2"]) & data.col2.eq(1))]
col1    col2    col3
0   a   0   1
1   a   0   2
2   a   0   3
3   a   1   4

How to remove duplicates based on one category in a column and keep the last entry?

Desired Output

  col1  col2    col3
0   a   0   1
1   a   0   2
2   a   0   3
3   a   1   6

Solution

  • Use boolean indexing with help of groupby.cumcount (or duplicated):

    # is the row not a 1 in col2?
    m1 = data['col2'].ne(1)
    # is the row the last of the group?
    m2 = data.groupby('col2').cumcount(ascending=False).eq(0)
    # or
    # m2 = ~data['col2'].duplicated(keep='last')
    
    # keep rows matching either condition
    out = data[m1|m2]
    

    Reversed logic:

    # is the group a 1?
    m1 = data['col2'].eq(1)
    # is the row NOT the last one?
    m2 = data['col2'].duplicated(keep='last')
    
    # drop the rows matching both conditions
    out = data[~(m1&m2)]
    

    Output:

      col1  col2  col3
    0    a     0     1
    1    a     0     2
    2    a     0     3
    5    a     1     6