pythonpandasdataframegroup-bylines-of-code

Search for LOV columns in dataframe and replace with codes using other dataframe


Imagine I have a dirty dataframe of employees with their ID, and Contract related information per country.

Some columns of this dataframe are LOV columns (depending on the country, some columns are LOV for just one country, others some or all of them) and some LOV columns are mandatory and some are not (that is just used to understand if a blank value is accepted or not).

We would need to check, using another mapping dataframe:

If the value provided is not on the list, create a new column on the main dataframe named "Errors" where it says the name of the column it errored (if more than 1 column, maybe save the name in a list on that column).

So from this dataframe:

ID  Country Contract      Type
1   CZ      Permanent     BOFF
1   ES      Fixed-term    .
2   CZ      Contractor    Front-Office
3   PT      Permanent   
4   PT      2022-01-01    Employee
4   PT      Fixed-term    Office
4   ES                    Employee
5   SK      Permanent     Employee

And using this mapping:

Country   Field       Values          Code      Mandadory
CZ        Contract    Permanent       PE        Yes
CZ        Contract    Fixed-term      FX        Yes
CZ        Contract    Contractor      CT        Yes
ES        Contract    Permanent       PERMA     No
SK        Contract    Permanent       PER-01    Yes
SK        Contract    Fixed-term      FIX-01    Yes
ES        Type        Office          OFF       Yes
CZ        Type        Back-Office     BOFF      Yes 
CZ        Type        Front-Office    FOFF      Yes 
PT        Type        Employee        EMP       No
PT        Type        Front-Office    FRONT     No

Would result in this dataframe:

ID  Country Contract    Type       Errors
1   CZ      PE          BOFF       ['Type']
1   ES      Fixed-term  .          ['Contract','Type']
2   CZ      CT          FOFF    
3   PT      Permanent       
4   PT      2022-01-01  FRONT      ['Type']
4   PT      Fixed-term  Office     ['Type']
4   ES      Employee               ['Contract','Type']
5   SK      PER-01      Employee    

Thank you so much for the support!


Solution

  • With the dataframes you provided:

    import pandas as pd
    
    df = pd.DataFrame(
        {
            "ID": [1, 1, 2, 3, 4, 4, 4, 5],
            "Country": ["CZ", "ES", "CZ", "PT", "PT", "PT", "ES", "SK"],
            "Contract": [
                "Permanent",
                "Fixed-term",
                "Contractor",
                "Permanent",
                "2022-01-01",
                "Fixed-term",
                "",
                "Permanent",
            ],
            "Type": [
                "BOFF",
                ".",
                "Front-Office",
                "",
                "Employee",
                "Office",
                "Employee",
                "Employee",
            ],
        }
    )
    
    other = pd.DataFrame(
        {
            "Country": ["CZ", "CZ", "CZ", "ES", "SK", "SK", "ES", "CZ", "CZ", "PT", "PT"],
            "Field": [
                "Contract",
                "Contract",
                "Contract",
                "Contract",
                "Contract",
                "Contract",
                "Type",
                "Type",
                "Type",
                "Type",
                "Type",
            ],
            "Values": [
                "Permanent",
                "Fixed-term",
                "Contractor",
                "Permanent",
                "Permanent",
                "Fixed-term",
                "Office",
                "Back-Office",
                "Front-Office",
                "Employee",
                "Front-Office",
            ],
            "Code": [
                "PE",
                "FX",
                "CT",
                "PERMA",
                "PER-01",
                "FIX-01",
                "OFF",
                "BOFF",
                "FOFF",
                "EMP",
                "FRONT",
            ],
            "Mandadory": [
                "Yes",
                "Yes",
                "Yes",
                "No",
                "Yes",
                "Yes",
                "Yes",
                "Yes",
                "Yes",
                "No",
                "No",
            ],
        }
    )
    

    Here is one way to do it with Pandas merge and apply:

    # Merge dataframes on relevant columns with filter on "Contract"
    new_df = pd.merge(
        left=df,
        right=other.loc[other["Field"] == "Contract", ["Country", "Values", "Code"]],
        how="left",
        left_on=["Country", "Contract"],
        right_on=["Country", "Values"],
    )
    
    # Update "Contract" column
    new_df["Contract"] = new_df.apply(
        lambda x: x["Code"] if not pd.isna(x["Code"]) else x["Contract"], axis=1
    )
    
    # Add new "Errors" column
    new_df["Errors"] = new_df.apply(
        lambda x: [] if not pd.isna(x["Code"]) else ["Contract"], axis=1
    )
    new_df = new_df.drop(columns=["Values", "Code"])
    
    # Merge dataframes on relevant columns with filter on "Type"
    new_df = pd.merge(
        left=new_df,
        right=other.loc[other["Field"] == "Type", ["Country", "Values", "Code"]],
        how="left",
        left_on=["Country", "Type"],
        right_on=["Country", "Values"],
    )
    
    # Update "Type" column
    new_df["Type"] = new_df.apply(
        lambda x: x["Code"] if not pd.isna(x["Code"]) else x["Type"], axis=1
    )
    
    # Update "Errors" column
    new_df["Errors"] = new_df.apply(
        lambda x: x["Errors"] if not pd.isna(x["Code"]) else x["Errors"] + ["Type"], axis=1
    )
    new_df = new_df.drop(columns=["Values", "Code"])
    

    Then:

    print(new_df)
    # Output
    
       ID Country    Contract      Type            Errors
    0   1      CZ          PE      BOFF            [Type]
    1   1      ES  Fixed-term         .  [Contract, Type]
    2   2      CZ          CT      FOFF                []
    3   3      PT   Permanent            [Contract, Type]
    4   4      PT  2022-01-01       EMP        [Contract]
    5   4      PT  Fixed-term    Office  [Contract, Type]
    6   4      ES              Employee  [Contract, Type]
    7   5      SK      PER-01  Employee            [Type]