pythonpandasdataframesimilarityfuzzy-comparison

Change Typo Column Values with Right Word based on Columns in Other Dataframe


I have two dataframe, the first one is location ,

location = pd.DataFrame({'city': ['RIYADH','SEOUL','BUSAN','TOKYO','OSAKA'],
                    'country': ['Saudi Arabia','South Korea','South Korea','Japan','Japan']})

the other one is customer,

customer = pd.DataFrame({'id': [1001,2002,3003,4004,5005,6006,7007,8008,9009],
                        'city': ['tokio','Sorth KOREA','riadh','JAPANN','tokyo','osako','Arab Saudi','SEOUL','buSN']})

I want to change the typo word in location column in customer dataframe with the right one in city/country from location dataframe. So the output will be like this:

id    location
1001  TOKYO
2002  South Korea
3003  RIYADH
4004  Japan
5005  TOKYO
6006  OSAKA
7007  Saudi Arabia
8008  SEOUL
9009  BUSAN

Solution

  • A possible solution, based on RapidFuzz:

    from rapidfuzz import process
    
    out = (customer.assign(
        aux = customer['city']
        .map(lambda x: 
            process.extractOne(x, location['city']+'*'+location['country'])[0])))
    
    out[['aux1', 'aux2']] = out['aux'].str.split(r'\*', expand=True)
    out['city'] = out.apply(lambda x: 
        process.extractOne(x['city'], x.loc['aux1':'aux2'])[0], axis=1)
    out = out.drop(columns=['aux', 'aux1', 'aux2'])
    

    Output:

         id          city
    0  1001         TOKYO
    1  2002   South Korea
    2  3003        RIYADH
    3  4004         Japan
    4  5005         TOKYO
    5  6006         OSAKA
    6  7007  Saudi Arabia
    7  8008         SEOUL
    8  9009         BUSAN
    

    EDIT

    This tries to offer a solution for the OP's below comment:

    from rapidfuzz import process
    
    def get_match(x, y, score):
        match = process.extractOne(x, y)
        return np.nan if match[1] < score else match[0]
    
    out = (customer.assign(
        aux=customer['city']
        .map(lambda x:
             process.extractOne(x, location['city']+'*'+location['country'])[0])))
    
    out[['aux1', 'aux2']] = out['aux'].str.split(r'\*', expand=True)
    out['city'] = out.apply(lambda x: get_match(
        x['city'], x.loc['aux1':'aux2'], 92), axis=1)
    out = out.drop(columns=['aux', 'aux1', 'aux2'])
    

    Output:

         id   city
    0  1001    NaN
    1  2002    NaN
    2  3003    NaN
    3  4004    NaN
    4  5005  TOKYO
    5  6006    NaN
    6  7007    NaN
    7  8008  SEOUL
    8  9009    NaN