I have a pandas data frame with emails and I want to extract only the unique emails per row. I tried the code below but it does not work. It returns no change to the original data frame. Here is the original data frame: Here is the wanted data frame:
df = pd.DataFrame({'z':[1,2,3,4],'a':['dave@hotmail.com','peter@cnn.com','paul@pbs.com','dave@hotmail.com'], 'b':['mary@hotmail.com','peter@cnn.com','paul@pbs.com','mike@hotmail.com'],'c':['jane@hotmail.com','peter@cnn.com','paul@pbs.com','mike@hotmail.com']})
df.to_csv('../output/try.csv', index=False)
df = pd.read_csv('../output/try.csv')
df2 = df.drop_duplicates(subset=['a', 'b', 'c'])
df2.to_csv('../output/try2.csv', index=False)
I've seen solutions that work with numbers in the columns but I have strings and for some reason it does not work with email strings. I tried the following code but it does nothing. df2 = df.drop_duplicates(subset=['a', 'b', 'c'])
DataFrame.drop_duplicates
will check for duplicate rows in the subset along the index axis but here you need to check for duplicates along each row so you have to apply this function on each row along column axis.
cols = ['a', 'b', 'c']
df[cols] = df[cols].apply(pd.Series.drop_duplicates, axis=1)
z a b c
0 1 dave@hotmail.com mary@hotmail.com jane@hotmail.com
1 2 peter@cnn.com NaN NaN
2 3 paul@pbs.com NaN NaN
3 4 dave@hotmail.com mike@hotmail.com NaN