pythonpandasregexdataframe

Literal match of strings in dataframe to other dataframe with multiple match options


I have a dataframe (df) with values in a column 'country' that I wish to standardize, using another dataframe called 'country_codes'. A value from df can match against any item from 'country_codes', but the resulting dataframe should contain the corresponding country_code['country'] value, i.e, the standard value.

The code mainly works and does indeed return the standard country value, but the regex doesn't match the exact string. It matches too little (in this case: 'Example1').

Additional question: is there a possibility of preserving the 'year' data in the final output dataframe, without specifying the name 'year', as there could be multiple float columns.

Below shows the function and desired outputs:

def match_country_codes(df, country_codes):
    # Create a regex pattern to match whole words
    pattern = '|'.join(rf'\b{re.escape(c)}\b' for c in country_codes[['country', 'alpha1', 'alpha2']].values.flatten())
    # new column for matches between pattern and df['country'] items
    df['matched_country'] = df['country'].str.extract(f'({pattern})', flags=re.IGNORECASE)
    # Merge with 'country_codes' dataframe to get the full country names
    # merge over 3 frames for all columns
    df1 = df.merge(country_codes, left_on='matched_country', right_on='country', how='left')
    df2 = df.merge(country_codes, left_on='matched_country', right_on='alpha1', how='left')
    df3 = df.merge(country_codes, left_on='matched_country', right_on='alpha2', how='left')

    dataframes = [df1, df2, df3]
    # merge all dataframes together on '[['country_y']]'
    result = reduce(merge_dataframes, dataframes)
    # Drop rows with None or NaN values in the 'country_y' column
    result = result.dropna(subset=['country_y'])
    # return result
    return result

Sample dataframe:

df = pd.DataFrame({'country': ['foobar', 'foo and bar', 'Example1 and', 'PQR'], 
                  'year':[2018, 2019, 'NA',2017] 
                      })
country_codes = pd.DataFrame({'country': ['FooBar', 'Example1', 'foo and bar and foo', 'Example'],
                   'alpha1': ['foobar', 'Bosnia', 'ABC', 'DEF'],
                   'alpha2': ['GHI', 'JKL', 'MNO', 'PQR']             
                             })

Output:

result = match_country_codes(df, country_codes)
result

Desired Output:

data = {'country_y': ['FooBar', 'Example']
           }

index_values = [0, 3]

desired_output = pd.DataFrame(data, index=index_values)
desired_output

Thank you


Solution

  • A slight modification to your approach: join the incoming dataframe to country_codes on each of the columns (using set_index) then take the first non-NaN value in each row's country values.

    df['country_y'] = (df
        .join(country_codes.set_index('country', drop=False), on='country', rsuffix='_1')
        .join(country_codes.set_index('alpha1'), on='country', rsuffix='_2')
        .join(country_codes.set_index('alpha2'), on='country', rsuffix='_3')
        [['country_1', 'country_2', 'country_3']]
        .groupby(lambda r: 'X', axis=1)
        .first()
    )
    df = df.dropna(subset='country_y')
    

    For this sample data:

    data = {
      'country': [None, 'foobar', 'foo and bar', 'Example1 and', 'PQR', 'Example', 'Bosnia', None, 'JKL', 'foobar'],
      'year': [None, None, 2018, 2019, 2017, 2020, 2017, None, 2019, 2019]
    }
    df = pd.DataFrame(data)
    

    Input:

            country    year
    0          None     NaN
    1        foobar     NaN
    2   foo and bar  2018.0
    3  Example1 and  2019.0
    4           PQR  2017.0
    5       Example  2020.0
    6        Bosnia  2017.0
    7          None     NaN
    8           JKL  2019.0
    9        foobar  2019.0
    

    The output will be:

       country    year country_y
    1   foobar     NaN    FooBar
    4      PQR  2017.0   Example
    5  Example  2020.0   Example
    6   Bosnia  2017.0  Example1
    8      JKL  2019.0  Example1
    9   foobar  2019.0    FooBar