pythonpandasdataframestring-matchingrapidfuzz

Using rapidfuzz on a dataframe


I have 4 columns which are BuisnessID, Name, BuisnessID_y, Name_y and I want to match Name with Name_y with a 90% similarity score, and if not 90% then drop those rows. Sample input

df

BusinessID      NAME        BusinessID_y  NAME_y

1013120869  MANOJ WANKHADE  1013404164    SLIMI
1013120869  MANOJ WANKHADE  1013831688    AMOL SHAHAKAR
1013120869  MANOJ WANKHADE  1013376009    PRATHMESH AGRAWAL
1013120869  MANOJ WANKHADE  1013376009    PRATHMESH AGRAWAL
1013120869  MANOJ WANKHADE  1013478922    AMBRISH PANDRIKAR

I am new to python and am not sure how to do this. Also, I have 500k records so any another approach other rapid-fuzz would be great


Solution

  • >>> import pandas as pd
    >>> import rapidfuzz
    >>> df['matching_ratio'] = df.apply(lambda x:rapidfuzz.fuzz.ratio(x.NAME, x.NAME_y), axis=1).to_list()
    >>> df
    
       BusinessID            NAME  BusinessID_y             NAME_y  matching_ratio
    0  1013120869  MANOJ WANKHADE    1013404164              SLIMI       10.526316
    1  1013120869  MANOJ WANKHADE    1013831688      AMOL SHAHAKAR       44.444444
    2  1013120869  MANOJ WANKHADE    1013376009  PRATHMESH AGRAWAL       25.806452
    3  1013120869  MANOJ WANKHADE    1013376009  PRATHMESH AGRAWAL       25.806452
    4  1013120869  MANOJ WANKHADE    1013478922  AMBRISH PANDRIKAR       38.709677
    
    
    >>> df[df.matching_ratio > 26] # change this '26' value to '90' as your requirmetn
    
       BusinessID            NAME  BusinessID_y             NAME_y  matching_ratio
    1  1013120869  MANOJ WANKHADE    1013831688      AMOL SHAHAKAR       44.444444
    4  1013120869  MANOJ WANKHADE    1013478922  AMBRISH PANDRIKAR       38.709677