pythonpandasdataframemergelines-of-code

Find matching value in column and create another column pandas dataframe


Suppose I have the following dataframe:

ID  Country Employee    Location
1   AE       Jay        AAA
2   AE       Mary       aa
3   AE       Peter      bbb
3   AE       Peter      ddd
6   DK       Donk       ddd
7   CZ       Cesar      fff
7   CZ       Cesar      GGg
7   CZ       Cesar      
8   CZ       Carlos     #

I need to use the below dataframe to confirm if the Location values are valid (according to their Country) and create an extra column named "Legacy Location Name" with the following:

lookup df:

Country Location
AE      bbb
AE      aaa
AE      ccc
DK      ddd
DK      eee
DK      fff
CZ      ggg
CZ      hhh

Output Expected

ID  Country Employee    Location    Legacy Location
1   AE      Jay         AAA         CORRECT VALUE
2   AE      Mary        bbb         aa
3   AE      Peter       bbb         CORRECT VALUE
3   AE      Peter       bbb         ddd
6   DK      Donk        ddd         CORRECT VALUE
7   CZ      Cesar       ggg         fff
7   CZ      Cesar       GGg         CORRECT VALUE
7   CZ      Cesar                   LOCATION NOT PROVIDED
8   CZ      Carlos      ggg         #

What is the best way to achieve it?

Thank you!


Solution

  • Not complicated, but requires many steps:

    s = (lookup_df.drop_duplicates('Country')
         .set_index('Country')['Location']
         )
    
    out = (df
     # handle location independently of case
     .assign(Location=df['Location'].str.casefold())
     # identify the correct values by merging 
     .merge(lookup_df.assign(**{'Legacy Location': 'CORRECT VALUE'}),
              how='left')
     # replace invalid locations
     .assign(**{'Location': lambda d: df['Location'].mask(d['Legacy Location'].isna()).fillna(df['Country'].map(s).mask(df['Location'].isna())),
     # add previous invalid locations
                'Legacy Location': lambda d: d['Legacy Location'].fillna(df['Location'].fillna('LOCATION NOT PROVIDED'))})
     
     )
    
    print(out)
    

    NB. Assuming all empty cells are NaNs for simplicity.

    Output:

       ID Country Employee Location        Legacy Location
    0   1      AE      Jay      AAA          CORRECT VALUE
    1   2      AE     Mary      bbb                     aa
    2   3      AE    Peter      bbb          CORRECT VALUE
    3   3      AE    Peter      bbb                    ddd
    4   6      DK     Donk      ddd          CORRECT VALUE
    5   7      CZ    Cesar      ggg                    fff
    6   7      CZ    Cesar      GGg          CORRECT VALUE
    7   7      CZ    Cesar      NaN  LOCATION NOT PROVIDED
    8   8      CZ   Carlos      ggg                      #