I have been trying the last 2 weeks to solve this problem, and i am almost at the goal.
Case: Overall depiction of what i am trying
Example: Let us say i have Cell X1 and i match it which each cell in Y(1,2,3) X1 match the most with Y3.
UPDATED What i have:
This code is able to match with sequencematcher and print the matches, however i only get one output match instead of a list of maximum matches:
import pandas as pd
from difflib import SequenceMatcher
data1 = {'Fruit': ['Apple','Pear','mango','Pinapple'],
'nr1': [22000,25000,27000,35000],
'nr2': [1,2,3,4]}
data2 = {'Fruit': ['Apple','Pear','mango','Pinapple'],
'nr1': [22000,25000,27000,35000],
'nr2': [1,2,3,4]}
df1 = pd.DataFrame(data1, columns = ['Fruit', 'nr1', 'nr2'])
df2 = pd.DataFrame(data2, columns = ['nr1','Fruit', 'nr2'])
#Single out specefic columns to match
col1=(df1.iloc[:,[0]])
col2=(df2.iloc[:,[1]])
#function to match 2 values similarity
def similar(a,b):
ratio = SequenceMatcher(None, a, b).ratio()
matches = a, b
return ratio, matches
for i in col1:
print(max(similar(i,j) for j in col2))
Output: (1.0, ('Fruit', 'Fruit'))
How do i fix so that it will give me all the max matches and how do i extract the respective rows the matches are located in?
This should work:
import pandas as pd
import numpy as np
from difflib import SequenceMatcher
def similar(a, b):
ratio = SequenceMatcher(None, a, b).ratio()
return ratio
data1 = {'Fruit': ['Apple', 'Pear', 'mango', 'Pinapple'],
'nr1': [22000, 25000, 27000, 35000],
'nr2': [1, 2, 3, 4]}
data2 = {'Fruit': ['Apple', 'mango', 'peer', 'Pinapple'],
'nr1': [22000, 25000, 27000, 35000],
'nr2': [1, 2, 3, 4]}
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
order = []
for index, row in df1.iterrows():
maxima = [similar(row['Fruit'], j) for j in df2['Fruit']]
best_ratio = max(maxima)
best_row = np.argmax(maxima)
order.append(best_row)
df2 = df2.iloc[order].reset_index()
pd.concat([df1, df2], axis=1)