pythonpandasdataframepython-unicodepandas-merge

pandas' dataframes merge challenge with identical strings but different unicodes


I have a problem using pd.merge when some of the rows in the two columns in the two datasets I use to merge the two datasets have different unicodes even though the strings are identical. Here is one example:

I have two datasets data1 and data2 both of which have 2 columns in common, 'state' and 'county', which will be the columns I use to merge the two datasets. I checked datatype for both 'state' and 'county' in two datasets. They are all of class 'str'.

By using

data_merge = pd.merge(data1, data2, on=['county','state'],how='right') 

I should have a match between data1 row 308 and data2 row 20691 but it's not a match due to the 'county' in data1 row 308 and the 'county' in row 20691 have different unicodes:

enter image description here

I looked into the unicode of these two words (unicode1 is the unicode of county in data1 and unicode2 is the unicode of county in data2) and they are indeed different:

enter image description here

How do I go about merging these two datasets with this issue? Is there a way I can tell pd.merge to ignore the unicode differences? Thank you!


Solution

  • Well, the difference in unicode is a lowercase k vs a capital K - which you can prove to yourself with print(chr(107),chr(75)). Merging on lowercase will probably solve your problem.

    However, let's say you like the capitalizations you have on the right side, and want to preserve them. Because you're doing a right merge, we can know that the indexes will be the same coming out of the merge as going in. Therefore, copy the capitalized values before the merge and replace them after.

    import pandas as pd
    
    print(chr(107), chr(75))  # k K
    
    data1 = pd.DataFrame({'county': ['mckinley', 'bell'], 'state': ['NM', 'tx'],'amount': [1, 2]})
    data2 = pd.DataFrame({'county': ['mckinley', 'beLL', 'harris'], 'state': ['NM', 'TX','tx']})
    
    #Optional: To preserve caps
    orig_caps = data2[['county','state']].copy()
    
    for item in ['county', 'state']:
        data1[item] = data1[item].str.lower()
        data2[item] = data2[item].str.lower()
    
    data_merge = pd.merge(data1, data2, on=['county', 'state'], how='right')
    data_merge[['county','state']] = orig_caps
    print(data_merge)
    

    Yields:

         county state  amount
    0  mckinley    NM     1.0
    1      beLL    TX     2.0
    2    harris    tx     NaN