pythonpandasdataframemergemultiple-columns

Python - Edit and Replace column values in a dataframe based on a column in another dataframe


I'm new to python and struck in a problem.

df1 = pd.DataFrame({'col1': ['apple', 'banana', 'cherry', 'apple', 'cherry']})
df2 = pd.DataFrame({'col1': ['app Banana', 'Cherry', 'banana', 'apple', 'bnapple', 'apple ch']})

Suppose I have df2 and df1 which have col1 values. df1 has correct full names but may have duplicates (which I need to keep in both dataframes due to other col data being different).

df2 has either bad names or incorrect case, which I want to correct. I want df2 column to reflect what df1 col says. the resulting column for df2 would be -

df2 = pd.DataFrame({'col1': ['banana', 'cherry', 'banana', 'apple', 'apple', 'apple']})

TIA!

I don't know how to. I tried merge query of dataframe but that just gave out everything in df1 + df2.


Solution

  • Craft a regex from df1 and extract from df2:

    import re
    pat = '(%s)' % '|'.join(map(re.escape, df1['col1'].unique()))
    # '(apple|banana|cherry)'
    
    df2['out'] = df2['col1'].str.extract(pat, flags=re.I, expand=False)
    
    # variant to have lowercase
    df2['out2'] = df2['col1'].str.casefold().str.extract(pat, flags=re.I, expand=False)
    

    If you only want to match full words of df2 modify the regex to add word boundaries (\b):

    pat = r'\b(%s)\b' % '|'.join(map(re.escape, df1['col1'].unique()))
    

    Output:

             col1     out    out2
    0  app Banana  Banana  banana
    1      Cherry  Cherry  cherry
    2      banana  banana  banana
    3       apple   apple   apple
    4     bnapple   apple   apple
    5    apple ch   apple   apple