pythonexcelpandasdataframesequencematcher

Python How to loop sequence match Dataframes through specific columns and extra the rows


I have been trying the last 2 weeks to solve this problem, and i am almost at the goal.

Case: Overall depiction of what i am trying

Example: Let us say i have Cell X1 and i match it which each cell in Y(1,2,3) X1 match the most with Y3.

UPDATED What i have:

This code is able to match with sequencematcher and print the matches, however i only get one output match instead of a list of maximum matches:

import pandas as pd
from difflib import SequenceMatcher

data1 = {'Fruit': ['Apple','Pear','mango','Pinapple'],
        'nr1': [22000,25000,27000,35000],
        'nr2': [1,2,3,4]}

data2 = {'Fruit': ['Apple','Pear','mango','Pinapple'],
        'nr1': [22000,25000,27000,35000],
        'nr2': [1,2,3,4]}

df1 = pd.DataFrame(data1, columns = ['Fruit', 'nr1', 'nr2'])
df2 = pd.DataFrame(data2, columns = ['nr1','Fruit', 'nr2'])

#Single out specefic columns to match
col1=(df1.iloc[:,[0]])
col2=(df2.iloc[:,[1]])

#function to match 2 values similarity
def similar(a,b):
    ratio = SequenceMatcher(None, a, b).ratio()
    matches = a, b
    return ratio, matches

for i in col1:
    print(max(similar(i,j) for j in col2))

Output: (1.0, ('Fruit', 'Fruit'))

How do i fix so that it will give me all the max matches and how do i extract the respective rows the matches are located in?


Solution

  • This should work:

    import pandas as pd
    import numpy as np
    from difflib import SequenceMatcher
    
    
    def similar(a, b):
        ratio = SequenceMatcher(None, a, b).ratio()
        return ratio
    
    
    data1 = {'Fruit': ['Apple', 'Pear', 'mango', 'Pinapple'],
             'nr1': [22000, 25000, 27000, 35000],
             'nr2': [1, 2, 3, 4]}
    
    data2 = {'Fruit': ['Apple', 'mango', 'peer', 'Pinapple'],
             'nr1': [22000, 25000, 27000, 35000],
             'nr2': [1, 2, 3, 4]}
    
    df1 = pd.DataFrame(data1)
    df2 = pd.DataFrame(data2)
    
    order = []
    for index, row in df1.iterrows():
        maxima = [similar(row['Fruit'], j) for j in df2['Fruit']]
        best_ratio = max(maxima)
        best_row = np.argmax(maxima)
    
        order.append(best_row)
    
    df2 = df2.iloc[order].reset_index()
    
    pd.concat([df1, df2], axis=1)