python pandas dataframe group-by lines-of-code

Search for LOV columns in dataframe and replace with codes using other dataframe

Imagine I have a dirty dataframe of employees with their ID, and Contract related information per country.

Some columns of this dataframe are LOV columns (depending on the country, some columns are LOV for just one country, others some or all of them) and some LOV columns are mandatory and some are not (that is just used to understand if a blank value is accepted or not).

We would need to check, using another mapping dataframe:

if the values provided exist in the mapping dataframe and,
if so, replace the value provided with the corresponding code on the dataframe.

If the value provided is not on the list, create a new column on the main dataframe named "Errors" where it says the name of the column it errored (if more than 1 column, maybe save the name in a list on that column).

So from this dataframe:

ID  Country Contract      Type
1   CZ      Permanent     BOFF
1   ES      Fixed-term    .
2   CZ      Contractor    Front-Office
3   PT      Permanent   
4   PT      2022-01-01    Employee
4   PT      Fixed-term    Office
4   ES                    Employee
5   SK      Permanent     Employee

And using this mapping:

Country   Field       Values          Code      Mandadory
CZ        Contract    Permanent       PE        Yes
CZ        Contract    Fixed-term      FX        Yes
CZ        Contract    Contractor      CT        Yes
ES        Contract    Permanent       PERMA     No
SK        Contract    Permanent       PER-01    Yes
SK        Contract    Fixed-term      FIX-01    Yes
ES        Type        Office          OFF       Yes
CZ        Type        Back-Office     BOFF      Yes 
CZ        Type        Front-Office    FOFF      Yes 
PT        Type        Employee        EMP       No
PT        Type        Front-Office    FRONT     No

Would result in this dataframe:

ID  Country Contract    Type       Errors
1   CZ      PE          BOFF       ['Type']
1   ES      Fixed-term  .          ['Contract','Type']
2   CZ      CT          FOFF    
3   PT      Permanent       
4   PT      2022-01-01  FRONT      ['Type']
4   PT      Fixed-term  Office     ['Type']
4   ES      Employee               ['Contract','Type']
5   SK      PER-01      Employee

Thank you so much for the support!

Solution

With the dataframes you provided:

import pandas as pd

df = pd.DataFrame(
    {
        "ID": [1, 1, 2, 3, 4, 4, 4, 5],
        "Country": ["CZ", "ES", "CZ", "PT", "PT", "PT", "ES", "SK"],
        "Contract": [
            "Permanent",
            "Fixed-term",
            "Contractor",
            "Permanent",
            "2022-01-01",
            "Fixed-term",
            "",
            "Permanent",
        ],
        "Type": [
            "BOFF",
            ".",
            "Front-Office",
            "",
            "Employee",
            "Office",
            "Employee",
            "Employee",
        ],
    }
)

other = pd.DataFrame(
    {
        "Country": ["CZ", "CZ", "CZ", "ES", "SK", "SK", "ES", "CZ", "CZ", "PT", "PT"],
        "Field": [
            "Contract",
            "Contract",
            "Contract",
            "Contract",
            "Contract",
            "Contract",
            "Type",
            "Type",
            "Type",
            "Type",
            "Type",
        ],
        "Values": [
            "Permanent",
            "Fixed-term",
            "Contractor",
            "Permanent",
            "Permanent",
            "Fixed-term",
            "Office",
            "Back-Office",
            "Front-Office",
            "Employee",
            "Front-Office",
        ],
        "Code": [
            "PE",
            "FX",
            "CT",
            "PERMA",
            "PER-01",
            "FIX-01",
            "OFF",
            "BOFF",
            "FOFF",
            "EMP",
            "FRONT",
        ],
        "Mandadory": [
            "Yes",
            "Yes",
            "Yes",
            "No",
            "Yes",
            "Yes",
            "Yes",
            "Yes",
            "Yes",
            "No",
            "No",
        ],
    }
)

Here is one way to do it with Pandas merge and apply:

# Merge dataframes on relevant columns with filter on "Contract"
new_df = pd.merge(
    left=df,
    right=other.loc[other["Field"] == "Contract", ["Country", "Values", "Code"]],
    how="left",
    left_on=["Country", "Contract"],
    right_on=["Country", "Values"],
)

# Update "Contract" column
new_df["Contract"] = new_df.apply(
    lambda x: x["Code"] if not pd.isna(x["Code"]) else x["Contract"], axis=1
)

# Add new "Errors" column
new_df["Errors"] = new_df.apply(
    lambda x: [] if not pd.isna(x["Code"]) else ["Contract"], axis=1
)
new_df = new_df.drop(columns=["Values", "Code"])

# Merge dataframes on relevant columns with filter on "Type"
new_df = pd.merge(
    left=new_df,
    right=other.loc[other["Field"] == "Type", ["Country", "Values", "Code"]],
    how="left",
    left_on=["Country", "Type"],
    right_on=["Country", "Values"],
)

# Update "Type" column
new_df["Type"] = new_df.apply(
    lambda x: x["Code"] if not pd.isna(x["Code"]) else x["Type"], axis=1
)

# Update "Errors" column
new_df["Errors"] = new_df.apply(
    lambda x: x["Errors"] if not pd.isna(x["Code"]) else x["Errors"] + ["Type"], axis=1
)
new_df = new_df.drop(columns=["Values", "Code"])

Then:

print(new_df)
# Output

   ID Country    Contract      Type            Errors
0   1      CZ          PE      BOFF            [Type]
1   1      ES  Fixed-term         .  [Contract, Type]
2   2      CZ          CT      FOFF                []
3   3      PT   Permanent            [Contract, Type]
4   4      PT  2022-01-01       EMP        [Contract]
5   4      PT  Fixed-term    Office  [Contract, Type]
6   4      ES              Employee  [Contract, Type]
7   5      SK      PER-01  Employee            [Type]