I have a dataframe (df) with values in a column 'country' that I wish to standardize, using another dataframe called 'country_codes'. A value from df can match against any item from 'country_codes', but the resulting dataframe should contain the corresponding country_code['country'] value, i.e, the standard value.
The code mainly works and does indeed return the standard country value, but the regex doesn't match the exact string. It matches too little (in this case: 'Example1').
Additional question: is there a possibility of preserving the 'year' data in the final output dataframe, without specifying the name 'year', as there could be multiple float columns.
Below shows the function and desired outputs:
def match_country_codes(df, country_codes):
# Create a regex pattern to match whole words
pattern = '|'.join(rf'\b{re.escape(c)}\b' for c in country_codes[['country', 'alpha1', 'alpha2']].values.flatten())
# new column for matches between pattern and df['country'] items
df['matched_country'] = df['country'].str.extract(f'({pattern})', flags=re.IGNORECASE)
# Merge with 'country_codes' dataframe to get the full country names
# merge over 3 frames for all columns
df1 = df.merge(country_codes, left_on='matched_country', right_on='country', how='left')
df2 = df.merge(country_codes, left_on='matched_country', right_on='alpha1', how='left')
df3 = df.merge(country_codes, left_on='matched_country', right_on='alpha2', how='left')
dataframes = [df1, df2, df3]
# merge all dataframes together on '[['country_y']]'
result = reduce(merge_dataframes, dataframes)
# Drop rows with None or NaN values in the 'country_y' column
result = result.dropna(subset=['country_y'])
# return result
return result
Sample dataframe:
df = pd.DataFrame({'country': ['foobar', 'foo and bar', 'Example1 and', 'PQR'],
'year':[2018, 2019, 'NA',2017]
})
country_codes = pd.DataFrame({'country': ['FooBar', 'Example1', 'foo and bar and foo', 'Example'],
'alpha1': ['foobar', 'Bosnia', 'ABC', 'DEF'],
'alpha2': ['GHI', 'JKL', 'MNO', 'PQR']
})
Output:
result = match_country_codes(df, country_codes)
result
Desired Output:
data = {'country_y': ['FooBar', 'Example']
}
index_values = [0, 3]
desired_output = pd.DataFrame(data, index=index_values)
desired_output
Thank you
A slight modification to your approach: join
the incoming dataframe to country_codes
on each of the columns (using set_index
) then take the first non-NaN value in each row's country
values.
df['country_y'] = (df
.join(country_codes.set_index('country', drop=False), on='country', rsuffix='_1')
.join(country_codes.set_index('alpha1'), on='country', rsuffix='_2')
.join(country_codes.set_index('alpha2'), on='country', rsuffix='_3')
[['country_1', 'country_2', 'country_3']]
.groupby(lambda r: 'X', axis=1)
.first()
)
df = df.dropna(subset='country_y')
For this sample data:
data = {
'country': [None, 'foobar', 'foo and bar', 'Example1 and', 'PQR', 'Example', 'Bosnia', None, 'JKL', 'foobar'],
'year': [None, None, 2018, 2019, 2017, 2020, 2017, None, 2019, 2019]
}
df = pd.DataFrame(data)
Input:
country year
0 None NaN
1 foobar NaN
2 foo and bar 2018.0
3 Example1 and 2019.0
4 PQR 2017.0
5 Example 2020.0
6 Bosnia 2017.0
7 None NaN
8 JKL 2019.0
9 foobar 2019.0
The output will be:
country year country_y
1 foobar NaN FooBar
4 PQR 2017.0 Example
5 Example 2020.0 Example
6 Bosnia 2017.0 Example1
8 JKL 2019.0 Example1
9 foobar 2019.0 FooBar