pythonpandasdataframeif-statementsequencematcher

if and statement between to pandas dataframes


I have 2 datasets, using data from df1 I want to identify duplicate data in df2 using 4 conditions.

If a row of df1 'Name' column matches more than 80% with any row of 'Name' column in df2

(AND)

(df1['Class'] == df2['Class'] (OR) df1['Amt $'] == df2['Amt $'])

(AND)

If row of 'Category' column in df1 matches more than 80% with any row item of 'Category' column in df2

if all conditions are met then keep only the new data in df2 and delete the other rows.

df1

Name    Class   Amt $   Category
Apple      1    5       Fruit
Banana     2    8       Fruit
Cat        3    4       Animal

df2

Index   Name              Class Amt $   Category
    1   Apple is Red       1    5       Fruit
    2   Banana             2    8       fruits
    3   Cat is cute        3    4       animals
    4   Green Apple        1    5       fruis
    5   Banana is Yellow   2    8       fruet
    6   Cat                3    4       anemal
    7   Apple              1    5       anemal
    8   Ripe Banana        2    8       frut
    9   Royal Gala Apple   1    5       Fruit
    10  Cats               3    4       animol
    11  Green Banana       2    8       Fruit
    12  Green Apple        1    5       fruits
    13  White Cat          3    4       Animal
    14  Banana is sweet    2    8       appel
    15  Apple is Red       1    5       fruits
    16  Ginger Cat         3    4       fruits
    17  Cat house          3    4       animals
    18  Royal Gala Apple   1    5       fret
    19  Banana is Yellow   2    8       fruit market
    20  Cat is cute        3    4       anemal


for i in df1['Name']:
    for u in df2['Name']:
        for k in df1['Class']:
            for l in df2['Class']:
                for m in df1['Amt $']:
                    for n in df2['Amt $']:
                        for o in df1['Category']:
                            for p in df2['Category']:
                                if SequenceMatcher(None, i, u).ratio() > .8 and k == l and m == n and SequenceMatcher(None, o, p).ratio() > 0.8:
                                    print(i, u)

Desired output dataframe should like something like this:

Name              Class Amt $   Category
Apple is Red        1   5       Fruit
Banana              2   8       fruits
Cat is cute         3   4       animals
Green Apple         1   5       fruis
Banana is Yellow    2   8       fruet
Cat                 3   4       anemal
Ripe Banana         2   8       frut
Royal Gala Apple    1   5       Fruit
Cats                3   4       animol
Green Banana        2   8       Fruit
Green Apple         1   5       fruits
White Cat           3   4       Animal
Apple is Red        1   5       fruits
Cat house           3   4       animals
Banana is Yellow    2   8       fruit market
Cat is cute         3   4       anemal

Please help me with the best solution! :)


Solution

  • First you have to iterate through your both dfs and match using the conditions and set a variable in df2.

    df2['match'] = False
    for idx2, row2 in df2.iterrows():
        match = False
        for idx1, row1 in df1.iterrows():
            if (SequenceMatcher(None, row1['Name'], row2['Name']).ratio())>=0.8 and \
                    (SequenceMatcher(None, row1['Category'], row2['Category']).ratio())>=0.8 and \
                    (row1['Class'] == row2['Class'] or row1['Amt $'] == row2['Amt $']):
                match = True
                break
        df2.at[idx2, 'match'] = match
    

    Once you have the matches, then you remove the duplicates from the ones that are matches df2['match']==True.

    df2[df2['match']==True].drop_duplicates(keep='first')
    

    Next you can join the above result with the non-matches df2['match']==False

    df2[df2['match']==False].append(df2[df2['match']==True].drop_duplicates(keep='first'))
    

    Here I am assuming you want to remove direct duplicates. Do you want to remove the duplicates based on the conditions or direct duplicates?

    Based on the test data set you have here 'Apple' and 'Apple is red' are 80% match. But SequenceMatcher(None, 'Apple', 'Apple is Red').ratio() gives only 0.5882352941176471. Similarly SequenceMatcher(None, 'Fruit', 'fruits').ratio() is only 0.7272727272727273. Are you expecting anything else here? or is the expected result not right?

    Anyway, I hope this gives you an idea on the approach.

    EDIT 1 If you want to get the matching df1['Name'].

    I have only reset df2['match'] as a string instead of boolean and assigned df1['Name'] to df2['match'] instead of assigning it to True. Then in the final df I am concatenating the df2 rows that has df2['match']==False and the non duplicate rows of df2['match']==True. Hope this helps.

    df2['match'] = ''
    for idx2, row2 in df2.iterrows():
        match = ''
        for idx1, row1 in df1.iterrows():
            if (SequenceMatcher(None, row1['Name'], row2['Name']).ratio())>=0.5 and \
                    (SequenceMatcher(None, row1['Category'], row2['Category']).ratio())>=0.5 and \
                    (row1['Class'] == row2['Class'] or row1['Amt $'] == row2['Amt $']):
                match = row1['Name']
                break
        df2.at[idx2, 'match'] = match
    
    
    print(df2[df2['match']==''].append(df2[df2['match']!=''].drop_duplicates(keep='first')))