My goal is to join two dataframes from different sources in Python using Pandas and then fill null values in columns with corresponding values in the same column.
The dataframes have similar columns, but some text/object columns may have different values due to variations in the data sources. For instance, the "Name" column in one dataframe might contain "Nick M." while in the other it's "Nick Maison". However, certain columns such as "Date" (formatted as YYYY-MM-DD), "Order ID" (numeric), and "Employee ID" (numeric) have consistent values across both dataframes (we join dataframes based on them). Worth mentioning, some columns may not even exist in one or another dataframe, but should also be filled.
import pandas as pd
# Create DataFrame df1
df1_data = {
'Date (df1)': ['2024-03-18', '2024-03-18', '2024-03-18', '2024-03-18', '2024-03-18', "2024-03-19", "2024-03-19"],
'Order Id (df1)': [1, 2, 3, 4, 5, 1, 2],
'Employee Id (df1)': [825, 825, 825, 825, 825, 825, 825],
'Name (df1)': ['Nick M.', 'Nick M.', 'Nick M.', 'Nick M.', 'Nick M.', 'Nick M.', 'Nick M.'],
'Region (df1)': ['SD', 'SD', 'SD', 'SD', 'SD', 'SD', 'SD'],
'Value (df1)': [25, 37, 18, 24, 56, 77, 25]
}
df1 = pd.DataFrame(df1_data)
# Create DataFrame df2
df2_data = {
'Date (df2)': ['2024-03-18', '2024-03-18', '2024-03-18', "2024-03-19", "2024-03-19", "2024-03-19", "2024-03-19"],
'Order Id (df2)': [1, 2, 3, 1, 2, 3, 4],
'Employee Id (df2)': [825, 825, 825, 825, 825, 825, 825],
'Name (df2)': ['Nick Mason', 'Nick Mason', 'Nick Mason', 'Nick Mason', 'Nick Mason', 'Nick Mason', 'Nick Mason'],
'Region (df2)': ['San Diego', 'San Diego', 'San Diego', 'San Diego', 'San Diego', 'San Diego', 'San Diego'],
'Value (df2)': [25, 37, 19, 22, 17, 9, 76]
}
df2 = pd.DataFrame(df2_data)
# Combine DataFrames
outer_joined_df = pd.merge(
df1,
df2,
how = 'outer',
left_on = ['Date (df1)', 'Employee Id (df1)', "Order Id (df1)"],
right_on = ['Date (df2)', 'Employee Id (df2)', "Order Id (df2)"]
)
# Display the result
outer_joined_df
Here is the output of joined dataframes. Null values colored in yellow should be filled.
I tried below code and it works for Date, Order Id and Employee Id columns as expected (because they are the same across two dataframes and we join based on them), but not for other, because they may have different values. Basically, the logic in this code is if Null, then fill with values from the same row in specified column. However, since values may be different, filled column becomes messy, because it has multiple variations of the same value.
outer_joined_df['Date (df1)'] = outer_joined_df['Date (df1)'].combine_first(outer_joined_df['Date (df2)'])
outer_joined_df['Date (df2)'] = outer_joined_df['Date (df2)'].combine_first(outer_joined_df['Date (df1)'])
outer_joined_df['Order Id (df1)'] = outer_joined_df['Order Id (df1)'].combine_first(outer_joined_df['Order Id (df2)'])
outer_joined_df['Order Id (df2)'] = outer_joined_df['Order Id (df2)'].combine_first(outer_joined_df['Order Id (df1)'])
outer_joined_df['Employee Id (df1)'] = outer_joined_df['Employee Id (df1)'].combine_first(outer_joined_df['Employee Id (df2)'])
outer_joined_df['Employee Id (df2)'] = outer_joined_df['Employee Id (df2)'].combine_first(outer_joined_df['Employee Id (df1)'])
outer_joined_df['Name (df1)'] = outer_joined_df['Name (df1)'].combine_first(outer_joined_df['Name (df2)'])
outer_joined_df['Name (df2)'] = outer_joined_df['Name (df2)'].combine_first(outer_joined_df['Name (df1)'])
outer_joined_df['Region (df1)'] = outer_joined_df['Region (df1)'].combine_first(outer_joined_df['Region (df2)'])
outer_joined_df['Region (df2)'] = outer_joined_df['Region (df2)'].combine_first(outer_joined_df['Region (df1)'])
Here is the output:
As you can see, it populated the data, but not the way I want.
Output I need:
# a list with all column names, minus `(dfx)`
columns = ["Date", "Order Id", "Employee Id", "Name", "Region", "Value"]
# create a dict with a relation between values in df1 and df2, both ways
value_relations = {}
for col in columns:
relations = (
outer_joined_df[[f"{col} (df1)", f"{col} (df2)"]]
.drop_duplicates()
.dropna()
.to_dict("tight")
.get("data")
)
value_relations[col] = {k: v for k, v in relations}
value_relations[col].update({v: k for k, v in relations})
# fill values of df1 with the related value of df2
outer_joined_df[f"{col} (df1)"] = outer_joined_df[f"{col} (df1)"].fillna(
outer_joined_df[f"{col} (df2)"].map(value_relations[col])
)
# fill values of df2 with the related value of df1
outer_joined_df[f"{col} (df2)"] = outer_joined_df[f"{col} (df2)"].fillna(
outer_joined_df[f"{col} (df1)"].map(value_relations[col])
)
Date (df1) Order Id (df1) Employee Id (df1) Name (df1) Region (df1) ... Order Id (df2) Employee Id (df2) Name (df2) Region (df2) Value (df2)
0 2024-03-18 1.0 825.0 Nick M. SD ... 1.0 825.0 Nick Mason San Diego 25.0
1 2024-03-18 2.0 825.0 Nick M. SD ... 2.0 825.0 Nick Mason San Diego 37.0
2 2024-03-18 3.0 825.0 Nick M. SD ... 3.0 825.0 Nick Mason San Diego 19.0
3 2024-03-18 4.0 825.0 Nick M. SD ... NaN 825.0 Nick Mason San Diego NaN
4 2024-03-18 5.0 825.0 Nick M. SD ... NaN 825.0 Nick Mason San Diego NaN
5 2024-03-19 1.0 825.0 Nick M. SD ... 1.0 825.0 Nick Mason San Diego 22.0
6 2024-03-19 2.0 825.0 Nick M. SD ... 2.0 825.0 Nick Mason San Diego 17.0
7 2024-03-19 3.0 825.0 Nick M. SD ... 3.0 825.0 Nick Mason San Diego 9.0
8 2024-03-19 NaN 825.0 Nick M. SD ... 4.0 825.0 Nick Mason San Diego 76.0
If you want to fill the remaining null values, add this at the end of each loop:
# fill remaining null values of df1
outer_joined_df[f"{col} (df1)"] = outer_joined_df[f"{col} (df1)"].fillna(
outer_joined_df[f"{col} (df2)"]
)
# fill remaining null values of df2
outer_joined_df[f"{col} (df2)"] = outer_joined_df[f"{col} (df2)"].fillna(
outer_joined_df[f"{col} (df1)"]
)
Date (df1) Order Id (df1) Employee Id (df1) Name (df1) Region (df1) ... Order Id (df2) Employee Id (df2) Name (df2) Region (df2) Value (df2)
0 2024-03-18 1.0 825.0 Nick M. SD ... 1.0 825.0 Nick Mason San Diego 25.0
1 2024-03-18 2.0 825.0 Nick M. SD ... 2.0 825.0 Nick Mason San Diego 37.0
2 2024-03-18 3.0 825.0 Nick M. SD ... 3.0 825.0 Nick Mason San Diego 19.0
3 2024-03-18 4.0 825.0 Nick M. SD ... 4.0 825.0 Nick Mason San Diego 24.0
4 2024-03-18 5.0 825.0 Nick M. SD ... 5.0 825.0 Nick Mason San Diego 56.0
5 2024-03-19 1.0 825.0 Nick M. SD ... 1.0 825.0 Nick Mason San Diego 22.0
6 2024-03-19 2.0 825.0 Nick M. SD ... 2.0 825.0 Nick Mason San Diego 17.0
7 2024-03-19 3.0 825.0 Nick M. SD ... 3.0 825.0 Nick Mason San Diego 9.0
8 2024-03-19 4.0 825.0 Nick M. SD ... 4.0 825.0 Nick Mason San Diego 76.0