I have two datasets: dataset1 & dataset2 (image link provided), which have a common column called SAX
which is a string object.
dataset1=
SAX
0 glngsyu
1 zicobgm
2 eerptow
3 cqbsynt
4 zvmqben
.. ...
475 rfikekw
476 bnbzvqx
477 rsuhgax
478 ckhloio
479 lbzujtw
480 rows × 2 columns
and
dataset2=
SAX
0 glngsyu
1 zicobgm
2 eerptow
3 cqbsynt
4 zvmqben
.. ...
475 rfikekw
476 bnbzvqx
477 rsuhgax
478 ckhloio
479 lbzujtw
480 rows × 2 columns
I need the output to be a column of minimum number of edits (operations) required to convert "SAX" column of dataset1 to "SAX" of dataset2. Is there a way to accomplish that?
Thanks.
Using Levenshtein distance from the textdistance module:
from textdistance import levenshtein
# Merge the two columns in one dataframe
df = dataset1[['SAX']].merge(dataset2[['SAX']], left_index=True, right_index=True, suffixes=('_1', '_2'))
# Compute the Levenshtein distance
df['distance'] = df.apply(lambda x: levenshtein.distance(x['SAX_1'], x['SAX_2']), axis=1)