I have a data frame where the columns values are list and want to find the differences between two columns.
data={'NAME':['JOHN','MARY','CHARLIE'],
'A':[[1,2,3],[2,3,4],[3,4,5]],
'B':[[2,3,4],[3,4,5],[4,5,6]]}
df=pd.DataFrame(data)
Why doesn't it work?
df = df.assign(X1 = lambda x: [y for y in x['A'] if y not in x['B']])
I get error :
TypeError: unhashable type: 'list'
I don't understand why?
So, this is where lambdas get interesting. These two lambdas will have the same result:
df = df.assign(X1 = lambda x: [y for y in x['A']]) #unvectorized, x is the entire DataFrame
df = df.assign(X1 = lambda x: x['A']) #vectorized, x is a single row
One (lengthy) way to do what you are asking is to iterate through each row, and then compare the nested lists:
df = df.assign(X1 = lambda x: [[y for y in x['A'][i] if y not in x['B'][i]] for i in range(len(x['A']))])
which can be simplified to one of the following
df = df.assign(X1 = [[y for y in r.A if y not in r.B] for i, r in df.iterrows()]) #similar structure to your initial solution
df = df.assign(X2 = [list(set(r.A).difference(r.B)) for i, r in df.iterrows()]) #more efficient, especially for larger sets
Edit: Pulling in elements of @user19077881's answer, the row-wise operations do work if you use df.apply instead of df.assign
df['X1'] = df.apply(lambda r: [y for y in r.A if y not in r.B], axis=1)
df['X1'] = df.apply(lambda r: list(set(r.A).difference(r.B)), axis=1)