I'm trying to merge two dataframes in Pandas based on a common column, ID. After merging, I want to update a column, Target Value, in my original dataframe (df) using a column from the second dataframe (reference). However, not all rows are being updated as expected, even though the ID values are exact match. For example, it would update only 4 out of 10 rows I have in my dataframe.
This is my original dataframe (df):
ID Other Column Target Value
0 ID00123 Value A NaN
1 ID00456 Value B NaN
2 ID00789 Value C NaN
3 ID01012 Value D NaN
4 ID01345 Value E NaN
This is my reference dataframe (reference):
ID Reference Value
0 ID00123 RefVal1
1 ID00456 RefVal2
2 ID00789 RefVal3
3 ID01012 RefVal4
4 ID01345 RefVal5
This is my last attempt of merging two dataframes and updating the original dataframe:
# Merge dataframes on 'ID'
merged = df.merge(reference, on='ID', how='left', indicator=True)
# Create a mask for rows to update
mask = (
(merged['_merge'] == 'both') & # Exists in both DataFrames
(df['Target Value'].isna()) # 'Target Value' is NaN
)
# Update 'Target Value' in df
df.loc[mask, 'Target Value'] = merged.loc[mask, 'Reference Value']
This is how merged dataframe looks like:
ID Other Column Target Value Reference Value _merge
0 ID00123 Value A NaN RefVal1 both
1 ID00456 Value B NaN RefVal2 both
2 ID00789 Value C NaN RefVal3 both
3 ID01012 Value D NaN RefVal4 both
4 ID01345 Value E NaN RefVal5 both
And this is how my original dataframe (df) looks like after 'updating' it:
ID Other Column Target Value
0 ID00123 Value A RefVal1
1 ID00456 Value B RefVal2
2 ID00789 Value C NaN # This row remains unchanged
3 ID01012 Value D NaN # This row remains unchanged
4 ID01345 Value E NaN # This row remains unchanged
I have tried inner join as well, still only 4 out of 10 Target Values would be updated.
I have checked datatypes of columns, both were Object.
Stripped leading/trailing whitespace and converted all values to lowercase:
df['ID'] = df['ID'].str.strip().str.lower()
reference['ID'] = reference['ID'].str.strip().str.lower()
Checked for unmatched values:
unmatched_df = set(df['ID']).difference(reference['ID'])
unmatched_reference = set(reference['ID']).difference(df['ID'])
print(unmatched_df, unmatched_reference) # Both sets are empty
Furthermore, there are no duplicate rows: there are literally only 10 rows in the entire dataframe.
This is not the first time I encountered this issue, but I have managed to work around it. The same issues. It matches everything - merge table gives me all matched rows, however, it updates only part of the matched rows. (For some reason, once it was not working on the deduplicated table, but it worked fine on the table with duplicated - and I am talking about table with over 50k records).
Thank you for help!
The index is not preserved during a merge
. Thus if your original df
doesn't have a RangeIndex, assigning values from merged_df
won't align the data properly.
You could reset_index
before the merge
, and set_index
before assignment:
# Merge dataframes on 'ID'
merged = df.reset_index().merge(reference, on='ID', how='left', indicator=True)
# Create a mask for rows to update
mask = (
(merged['_merge'] == 'both') & # Exists in both DataFrames
(df['Target Value'].isna()) # 'Target Value' is NaN
).to_numpy()
# Update 'Target Value' in df
df.loc[mask, 'Target Value'] = merged.set_index('index').loc[mask, 'Reference Value']
Example output:
ID Other Column Target Value
0 ID00123 Value A RefVal1
1 ID00456 Value B RefVal2
2 ID00789 Value C RefVal3
3 ID01012 Value D RefVal4
4 ID01345 Value E RefVal5