I'm trying to merge two pandas DataFrames on multiple columns. It is a many-to-one relationship. There are many of the same values in df1
but only value in df2
.
These are the example DataFrames :
df1 = pd.DataFrame({'date':['2025-03-01','2025-03-01', '2025-03-02'],'period':[1,1,2], 'company':['aa','aa','b']})
df2 = pd.DataFrame({'date':['2025-03-01', '2025-03-02'],'period':[1,2],'company':['aa','b'], 'value':[4,8]})
The result I want is this (where the value '4' is only joined once on the first match and subsequent rows are filled with empty space or null):
date period company value
0 2025-03-01 1 aa 4
1 2025-03-01 1 aa null
2 2025-03-02 2 b 8
I'm trying a left join - but I get the incorrect output:
date period company value
0 2025-03-01 1 aa 4
1 2025-03-01 1 aa 4
2 2025-03-02 2 b 8
merged = df1.merge(df2, on=['date','period','company'], how='left')
merged.loc[merged.duplicated(subset=['date','period','company']), 'value'] = np.nan
This worked for me, and it would work cleanly with large data sets as well. I replicated the code in google colab and got the following result.