I have two topic models, topics1
and topics2
. They were created from very similar but different datasets. As a result, the words representing each topic/cluster as well as the topic numbers will be different for each dataset. A toy example looks like:
import pandas as pd
topics1 = pd.DataFrame({'topic_num':[1,2,3],
'words':[['red','blue','green'],
['blue','sky','cloud'],
['eat','food','nomnom']]
})
topics2 = pd.DataFrame({'topic_num':[1,2,3],
'words':[['blue','sky','airplane'],
['blue','green','yellow'],
['mac','bit','byte']]
})
For each topic in topics1
, I would like to find the topic in topics2
with the maximum number of matches. In the above example, in topics1
topic_num 1 would match topic_num 2 intopics2
and topic_num 2 in topics1
would match topic_num 1 in topics2
. In both of these cases, 2 of the 3 words in each row match across dataframes.
Is there a way to find this using built-in pandas
functions such as eq()
? My solution just iterated across every word in topics1
and eery word in topics2
.
You will have to compute all combinations in any case.
You could use sets and broadcast the comparisons:
# set the topic_num as index
# convert the Series of lists to Series of sets
s1 = topics1.set_index('topic_num')['words'].map(set)
s2 = topics2.set_index('topic_num')['words'].map(set)
# convert to numpy arrays
# broadcast the intersection of s1 and s2
# to form a square array, convert back to DataFrame
# get the length of the intersection
tmp = (pd.DataFrame(s1.to_numpy()[:, None]
&s2.to_numpy(),
index=s1.index, columns=s2.index
)
.map(len) #.applymap(len)
)
# get the col id with the max length per row
# mask if the length was 0
out = tmp.idxmax(axis=1).where(tmp.gt(0).any(axis=1))
Output:
topic_num
1 2.0
2 1.0
3 NaN
dtype: float64
Intermediate tmp
:
topic_num 1 2 3
topic_num
1 1 2 0
2 2 1 0
3 0 0 0
Note that this logic can be wrapped by cdist
as rightly suggested by @Onyambu, however I would first convert to set to avoid doing it repeatedly in the cdist
(which would be expensive). This should be done in a bit weird way by encapsulating the sets in list since cdist
requires a 2D input:
from scipy.spatial.distance import cdist
dst = lambda x, y : len(x[0] & y[0])
val = cdist([[set(x)] for x in topics1['words']],
[[set(x)] for x in topics2['words']], dst)
pos = val.argmax(1)
out = pd.DataFrame({'topic1': topics1['topic_num'],
'topic2': (topics2['topic_num']
.iloc[pos]
.where(val[topics1.index, pos]!=0)
.values
)})
# topic1 topic2
# 0 1 2.0
# 1 2 1.0
# 2 3 NaN