pythonpandasjoinouter-join

How to Populate Null Values in Columns After Outer Join in Python Pandas


My goal is to join two dataframes from different sources in Python using Pandas and then fill null values in columns with corresponding values in the same column.

The dataframes have similar columns, but some text/object columns may have different values due to variations in the data sources. For instance, the "Name" column in one dataframe might contain "Nick M." while in the other it's "Nick Maison". However, certain columns such as "Date" (formatted as YYYY-MM-DD), "Order ID" (numeric), and "Employee ID" (numeric) have consistent values across both dataframes (we join dataframes based on them). Worth mentioning, some columns may not even exist in one or another dataframe, but should also be filled.

import pandas as pd

# Create DataFrame df1

df1_data = {

'Date (df1)': ['2024-03-18', '2024-03-18', '2024-03-18', '2024-03-18', '2024-03-18', "2024-03-19", "2024-03-19"],
'Order Id (df1)': [1, 2, 3, 4, 5, 1, 2],
'Employee Id (df1)': [825, 825, 825, 825, 825, 825, 825],
'Name (df1)': ['Nick M.', 'Nick M.', 'Nick M.', 'Nick M.', 'Nick M.', 'Nick M.', 'Nick M.'],
'Region (df1)': ['SD', 'SD', 'SD', 'SD', 'SD', 'SD', 'SD'],
'Value (df1)': [25, 37, 18, 24, 56, 77, 25]

}

df1 = pd.DataFrame(df1_data)

# Create DataFrame df2

df2_data = {

'Date (df2)': ['2024-03-18', '2024-03-18', '2024-03-18', "2024-03-19", "2024-03-19", "2024-03-19", "2024-03-19"],
'Order Id (df2)': [1, 2, 3, 1, 2, 3, 4],
'Employee Id (df2)': [825, 825, 825, 825, 825, 825, 825],  
'Name (df2)': ['Nick Mason', 'Nick Mason', 'Nick Mason', 'Nick Mason', 'Nick Mason', 'Nick Mason', 'Nick Mason'],  
'Region (df2)': ['San Diego', 'San Diego', 'San Diego', 'San Diego', 'San Diego', 'San Diego', 'San Diego'],  
'Value (df2)': [25, 37, 19, 22, 17, 9, 76]  

}

df2 = pd.DataFrame(df2_data)

# Combine DataFrames

outer_joined_df = pd.merge(

                            df1,
                            df2,
                            how = 'outer',
                            left_on = ['Date (df1)', 'Employee Id (df1)', "Order Id (df1)"],
                            right_on = ['Date (df2)', 'Employee Id (df2)', "Order Id (df2)"]

                        )

# Display the result

outer_joined_df

Here is the output of joined dataframes. Null values colored in yellow should be filled.

enter image description here

I tried below code and it works for Date, Order Id and Employee Id columns as expected (because they are the same across two dataframes and we join based on them), but not for other, because they may have different values. Basically, the logic in this code is if Null, then fill with values from the same row in specified column. However, since values may be different, filled column becomes messy, because it has multiple variations of the same value.

outer_joined_df['Date (df1)'] = outer_joined_df['Date (df1)'].combine_first(outer_joined_df['Date (df2)'])
outer_joined_df['Date (df2)'] = outer_joined_df['Date (df2)'].combine_first(outer_joined_df['Date (df1)'])

outer_joined_df['Order Id (df1)'] = outer_joined_df['Order Id (df1)'].combine_first(outer_joined_df['Order Id (df2)'])
outer_joined_df['Order Id (df2)'] = outer_joined_df['Order Id (df2)'].combine_first(outer_joined_df['Order Id (df1)'])

outer_joined_df['Employee Id (df1)'] = outer_joined_df['Employee Id (df1)'].combine_first(outer_joined_df['Employee Id (df2)'])
outer_joined_df['Employee Id (df2)'] = outer_joined_df['Employee Id (df2)'].combine_first(outer_joined_df['Employee Id (df1)'])

outer_joined_df['Name (df1)'] = outer_joined_df['Name (df1)'].combine_first(outer_joined_df['Name (df2)'])
outer_joined_df['Name (df2)'] = outer_joined_df['Name (df2)'].combine_first(outer_joined_df['Name (df1)'])

outer_joined_df['Region (df1)'] = outer_joined_df['Region (df1)'].combine_first(outer_joined_df['Region (df2)'])
outer_joined_df['Region (df2)'] = outer_joined_df['Region (df2)'].combine_first(outer_joined_df['Region (df1)'])

Here is the output:

enter image description here

As you can see, it populated the data, but not the way I want.

Output I need:

enter image description here


Solution

  • # a list with all column names, minus `(dfx)`
    columns = ["Date", "Order Id", "Employee Id", "Name", "Region", "Value"]
    
    # create a dict with a relation between values in df1 and df2, both ways
    value_relations = {}
    for col in columns:
        relations = (
            outer_joined_df[[f"{col} (df1)", f"{col} (df2)"]]
            .drop_duplicates()
            .dropna()
            .to_dict("tight")
            .get("data")
        )
        value_relations[col] = {k: v for k, v in relations}
        value_relations[col].update({v: k for k, v in relations})
    
        # fill values of df1 with the related value of df2
        outer_joined_df[f"{col} (df1)"] = outer_joined_df[f"{col} (df1)"].fillna(
            outer_joined_df[f"{col} (df2)"].map(value_relations[col])
        )
        # fill values of df2 with the related value of df1
        outer_joined_df[f"{col} (df2)"] = outer_joined_df[f"{col} (df2)"].fillna(
            outer_joined_df[f"{col} (df1)"].map(value_relations[col])
        )
    
       Date (df1)  Order Id (df1)  Employee Id (df1) Name (df1) Region (df1)  ...  Order Id (df2) Employee Id (df2)  Name (df2)  Region (df2) Value (df2)
    0  2024-03-18             1.0              825.0    Nick M.           SD  ...             1.0             825.0  Nick Mason     San Diego        25.0
    1  2024-03-18             2.0              825.0    Nick M.           SD  ...             2.0             825.0  Nick Mason     San Diego        37.0
    2  2024-03-18             3.0              825.0    Nick M.           SD  ...             3.0             825.0  Nick Mason     San Diego        19.0
    3  2024-03-18             4.0              825.0    Nick M.           SD  ...             NaN             825.0  Nick Mason     San Diego         NaN
    4  2024-03-18             5.0              825.0    Nick M.           SD  ...             NaN             825.0  Nick Mason     San Diego         NaN
    5  2024-03-19             1.0              825.0    Nick M.           SD  ...             1.0             825.0  Nick Mason     San Diego        22.0
    6  2024-03-19             2.0              825.0    Nick M.           SD  ...             2.0             825.0  Nick Mason     San Diego        17.0
    7  2024-03-19             3.0              825.0    Nick M.           SD  ...             3.0             825.0  Nick Mason     San Diego         9.0
    8  2024-03-19             NaN              825.0    Nick M.           SD  ...             4.0             825.0  Nick Mason     San Diego        76.0
    

    If you want to fill the remaining null values, add this at the end of each loop:

        # fill remaining null values of df1
        outer_joined_df[f"{col} (df1)"] = outer_joined_df[f"{col} (df1)"].fillna(
            outer_joined_df[f"{col} (df2)"]
        )
        # fill remaining null values of df2
        outer_joined_df[f"{col} (df2)"] = outer_joined_df[f"{col} (df2)"].fillna(
            outer_joined_df[f"{col} (df1)"]
        )
    
       Date (df1)  Order Id (df1)  Employee Id (df1) Name (df1) Region (df1)  ...  Order Id (df2) Employee Id (df2)  Name (df2)  Region (df2) Value (df2)
    0  2024-03-18             1.0              825.0    Nick M.           SD  ...             1.0             825.0  Nick Mason     San Diego        25.0
    1  2024-03-18             2.0              825.0    Nick M.           SD  ...             2.0             825.0  Nick Mason     San Diego        37.0
    2  2024-03-18             3.0              825.0    Nick M.           SD  ...             3.0             825.0  Nick Mason     San Diego        19.0
    3  2024-03-18             4.0              825.0    Nick M.           SD  ...             4.0             825.0  Nick Mason     San Diego        24.0
    4  2024-03-18             5.0              825.0    Nick M.           SD  ...             5.0             825.0  Nick Mason     San Diego        56.0
    5  2024-03-19             1.0              825.0    Nick M.           SD  ...             1.0             825.0  Nick Mason     San Diego        22.0
    6  2024-03-19             2.0              825.0    Nick M.           SD  ...             2.0             825.0  Nick Mason     San Diego        17.0
    7  2024-03-19             3.0              825.0    Nick M.           SD  ...             3.0             825.0  Nick Mason     San Diego         9.0
    8  2024-03-19             4.0              825.0    Nick M.           SD  ...             4.0             825.0  Nick Mason     San Diego        76.0