[SOLVED] How to filter a pandas dataframe by unique column values

How to filter a pandas dataframe by unique column values

I have a pandas data frame with emails and I want to extract only the unique emails per row. I tried the code below but it does not work. It returns no change to the original data frame. Here is the original data frame: Here is the wanted data frame:

df = pd.DataFrame({'z':[1,2,3,4],'a':['dave@hotmail.com','peter@cnn.com','paul@pbs.com','dave@hotmail.com'], 'b':['mary@hotmail.com','peter@cnn.com','paul@pbs.com','mike@hotmail.com'],'c':['jane@hotmail.com','peter@cnn.com','paul@pbs.com','mike@hotmail.com']})
df.to_csv('../output/try.csv', index=False)

df = pd.read_csv('../output/try.csv')
df2 = df.drop_duplicates(subset=['a', 'b', 'c'])
df2.to_csv('../output/try2.csv', index=False)

I've seen solutions that work with numbers in the columns but I have strings and for some reason it does not work with email strings. I tried the following code but it does nothing. df2 = df.drop_duplicates(subset=['a', 'b', 'c'])

Solution

DataFrame.drop_duplicates will check for duplicate rows in the subset along the index axis but here you need to check for duplicates along each row so you have to apply this function on each row along column axis.

cols = ['a', 'b', 'c']
df[cols] = df[cols].apply(pd.Series.drop_duplicates, axis=1)

   z                 a                 b                 c
0  1  dave@hotmail.com  mary@hotmail.com  jane@hotmail.com
1  2     peter@cnn.com               NaN               NaN
2  3      paul@pbs.com               NaN               NaN
3  4  dave@hotmail.com  mike@hotmail.com               NaN