pythonpandasdataframeduplicatesdrop-duplicates

Drop all duplicate rows across multiple columns in Python Pandas


The pandas drop_duplicates function is great for "uniquifying" a dataframe. I would like to drop all rows which are duplicates across a subset of columns. Is this possible?

    A   B   C
0   foo 0   A
1   foo 1   A
2   foo 1   B
3   bar 1   A

As an example, I would like to drop rows which match on columns A and C so this should drop rows 0 and 1.


Solution

  • This is much easier in pandas now with drop_duplicates and the keep parameter.

    import pandas as pd
    df = pd.DataFrame({"A":["foo", "foo", "foo", "bar"], "B":[0,1,1,1], "C":["A","A","B","A"]})
    df.drop_duplicates(subset=['A', 'C'], keep=False)