pandasdataframedistinct-values

How to filter a pandas dataframe by unique column values


I have a pandas data frame with emails and I want to extract only the unique emails per row. I tried the code below but it does not work. It returns no change to the original data frame. Here is the original data frame: original data frame Here is the wanted data frame: enter image description here

df = pd.DataFrame({'z':[1,2,3,4],'a':['dave@hotmail.com','peter@cnn.com','paul@pbs.com','dave@hotmail.com'], 'b':['mary@hotmail.com','peter@cnn.com','paul@pbs.com','mike@hotmail.com'],'c':['jane@hotmail.com','peter@cnn.com','paul@pbs.com','mike@hotmail.com']})
df.to_csv('../output/try.csv', index=False)

df = pd.read_csv('../output/try.csv')
df2 = df.drop_duplicates(subset=['a', 'b', 'c'])
df2.to_csv('../output/try2.csv', index=False)

I've seen solutions that work with numbers in the columns but I have strings and for some reason it does not work with email strings. I tried the following code but it does nothing. df2 = df.drop_duplicates(subset=['a', 'b', 'c'])


Solution

  • DataFrame.drop_duplicates will check for duplicate rows in the subset along the index axis but here you need to check for duplicates along each row so you have to apply this function on each row along column axis.

    cols = ['a', 'b', 'c']
    df[cols] = df[cols].apply(pd.Series.drop_duplicates, axis=1)
    

       z                 a                 b                 c
    0  1  dave@hotmail.com  mary@hotmail.com  jane@hotmail.com
    1  2     peter@cnn.com               NaN               NaN
    2  3      paul@pbs.com               NaN               NaN
    3  4  dave@hotmail.com  mike@hotmail.com               NaN