I have a dataframe df1
that has a year column and numeric columns. The values of year in df1
are not unique. I would like to normalize the numeric columns of df1
at a yearly level using a specific function. For this purpose, I have another dataframe df2
that has a year column, and with the same numeric columns as in df1
. To do the renormalization, I want to divide all numeric columns in a given year with df2
for that year.
Here's sample code that achieves what I want but I'm sure there's a "smart" way of doing the same procedure.
import pandas as pd
df1 = pd.DataFrame({
'year': [1, 1, 0, 0],
'B': [4, 2, 1, 5],
'C': [5, 3, 2, 6],
'D': ["Good 1", "Good 2", "Good 1", "Good 2"]
})
df1
year B C D
0 1 4 5 Good 1
1 1 2 3 Good 2
2 0 1 2 Good 1
3 0 5 6 Good 2
df2 = pd.DataFrame({'year': [1, 0], 'B': [3, 5], 'C': [4, 7]})
df2
year B C
0 1 3 4
1 0 5 7
vars_to_replace = ['B', 'C']
This works:
for year in df1.year.unique():
df1.loc[df1.year == year, vars_to_replace] /= df2.loc[df2.year == year, vars_to_replace].values
df1
year B C D
0 1 1.333333 1.250000 Good 1
1 1 0.666667 0.750000 Good 2
2 0 0.200000 0.285714 Good 1
3 0 1.000000 0.857143 Good 2
The code achieves what I want but I'm not too happy with having to use for-loops. I've tried other alternatives but can't make it work. Any ideas? I've tried the following approach but it doesn't work because df1
isn't unique in year.
df1.set_index("year", inplace=True)
df2.set_index("year", inplace=True)
df1.loc[:, vars_to_replace] /= df2
The last command returns a "ValueError: cannot reindex on an axis with duplicate labels"
NOTE: after doing research on the questions and answers concerning this topic on SO, I was unable to find a case that covers a multicolumn operation where the indexes (year in df1
in my example) are not unique.
You were really close in your attempt. Just don't set df1
's index to year
(since it has duplicates). You could consider two alternatives:
year
plus something elsee.g. keep the original index with append=True
df1.set_index("year", inplace=True, append=True)
df2.set_index("year", inplace=True)
df1[vars_to_replace] /= df2
df1
B C D
year
0 1 1.333333 1.250000 Good 1
1 1 0.666667 0.750000 Good 2
2 0 0.200000 0.285714 Good 1
3 0 1.000000 0.857143 Good 2
df1
's index altogetherdf2.set_index("year", inplace=True)
df1[vars_to_replace] /= df2.reindex(df1['year']).reset_index(drop=True)
df1
year B C D
0 1 1.333333 1.250000 Good 1
1 1 0.666667 0.750000 Good 2
2 0 0.200000 0.285714 Good 1
3 0 1.000000 0.857143 Good 2
Note: I'm using .reset_index(drop=True)
since df1
's index is the default RangeIndex
. If df1
had a non-default index, then the correct option would be .set_axis(df1.index)
; however if df1.index
has duplicates, you'll need to find another way.