[SOLVED] Intersection of 2 columns within a single Dataframe pandas

Intersection of 2 columns within a single Dataframe pandas

import pandas as pd

df = pd.DataFrame({'Environment': [['AppleOS X','postgres','Apache','tomcat']], 'Description': [['Apache', 'Commons', 'Base32', 'decoding', 'invalid', 'rejecting', '.', 'via','valid', '.']] })

                             Environment                                                                Description
0  [AppleOS X, postgres, Apache, tomcat]  [Apache, Commons, Base32, decoding, invalid, rejecting, ., via, valid, .]

I am new to Pandas and dataframes, and I have to doubt in finding the intersection of two columns mentioned above.

Objective:

Environment and Description are two columns in a dataframe. The objective is to create a new column with the intersection of strings present in the first two columns.

Existing Implementation:

def f(param):
    return set.intersection(set(param['Environment']),set(param['Description']))

df['unique_words'] = df.apply(f, axis=1)
print(df['unique_words'])

This set intersection syntax is something I referred in https://www.kite.com/python/answers/how-to-find-the-intersection-of-two-lists-in-python

Problem:

I am not sure how the above syntax works, but it returns with {}

Expected Output:

As ['Apache'] is present in both the columns, it should be the value in the new column created in the dataframe.

Kindly let me know if anyone had done a similar function or any help is appreciated.

Solution

use set.intersection
map lowercase to the values in the list
In terms of natural langue processing, the list values should all be converted to lowercase.

# assumes only the two columns in the dataframe
df['common_words'] = df.apply(lambda x: list(set(map(str.lower, x[0])).intersection(map(str.lower, x[1]))), axis=1)

# if there are many columns, specify the two desired columns to compare
df['common_words'] = df[['Environment', 'Description']].apply(lambda x: list(set(map(str.lower, x[0])).intersection(map(str.lower, x[1]))), axis=1)

# display(df)
                             Environment                                                                Description common_words
0  [AppleOS X, postgres, Apache, tomcat]  [Apache, Commons, Base32, decoding, invalid, rejecting, ., via, valid, .]     [apache]