I use the SequenceMatcher ratio to match two dataframe with the best ratio.
I want to check first if the score A and AA is good then check if the score between B is BB is good then if the score between C and CC is good, then I add the line
A B C
0 pizza ze 3
1 polo fe 5
2 ninja fi NaN
AA BB CC
0 za ze NaN
1 po ka 8
2 fe fe 6
3 pizza fi 3
4 polo ko 5
5 ninja 3 pizza
i want dataframe like this:
A B C AA BB CC score
0 pizza ze 3 pizza ze 3 100
1 polo fe 5 polo ko 5 75
2 ninja fi NaN ninja 3 pizza 30
I tried this function, but it doesn't work:
from difflib import SequenceMatcher
def similar(a, b):
ratio = SequenceMatcher(None, a, b).ratio()
return ratio
order = []
score = []
for index, row in df1.iterrows():
maxima = [similar(row['A'], j) for j in df2['AA']]
best_ratio = max(maxima)
if best_ratio > 0.9:
maxima2 = [similar(row['B'], j) for j in df2['BB']]
best_ratio2 = max(maxima2)
if best_ratio2 > 0.9:
maxima3 = [similar(row['C'], j) for j in
df2['CC']]
best_ratio = max(maxima3)
best_row = np.argmax(maxima3)
order.append(best_row)
score.append(best_ratio)
df2 = df2.iloc[order].reset_index()
merge = pd.concat([df1, df2], axis=1)
The best is to use tf idf to find the best ratio.