pythonpandas

Pandas - Smarter way to divide some columns of a dataframe by another dataframe


I have a dataframe df1 that has a year column and numeric columns. The values of year in df1 are not unique. I would like to normalize the numeric columns of df1 at a yearly level using a specific function. For this purpose, I have another dataframe df2 that has a year column, and with the same numeric columns as in df1. To do the renormalization, I want to divide all numeric columns in a given year with df2 for that year.

Here's sample code that achieves what I want but I'm sure there's a "smart" way of doing the same procedure.

import pandas as pd

df1 = pd.DataFrame({
    'year': [1, 1, 0, 0],
    'B': [4, 2, 1, 5],
    'C': [5, 3, 2, 6],
    'D': ["Good 1", "Good 2", "Good 1", "Good 2"]
})
df1
   year  B  C       D
0     1  4  5  Good 1
1     1  2  3  Good 2
2     0  1  2  Good 1
3     0  5  6  Good 2
df2 = pd.DataFrame({'year': [1, 0], 'B': [3, 5], 'C': [4, 7]})
df2
   year  B  C
0     1  3  4
1     0  5  7
vars_to_replace = ['B', 'C']

This works:

for year in df1.year.unique():
    df1.loc[df1.year == year, vars_to_replace] /= df2.loc[df2.year == year, vars_to_replace].values

df1
   year         B         C       D
0     1  1.333333  1.250000  Good 1
1     1  0.666667  0.750000  Good 2
2     0  0.200000  0.285714  Good 1
3     0  1.000000  0.857143  Good 2

The code achieves what I want but I'm not too happy with having to use for-loops. I've tried other alternatives but can't make it work. Any ideas? I've tried the following approach but it doesn't work because df1 isn't unique in year.

df1.set_index("year", inplace=True)
df2.set_index("year", inplace=True)
df1.loc[:, vars_to_replace] /= df2

The last command returns a "ValueError: cannot reindex on an axis with duplicate labels"

NOTE: after doing research on the questions and answers concerning this topic on SO, I was unable to find a case that covers a multicolumn operation where the indexes (year in df1 in my example) are not unique.


Solution

  • You were really close in your attempt. Just don't set df1's index to year (since it has duplicates). You could consider two alternatives:

    Use year plus something else

    e.g. keep the original index with append=True

    df1.set_index("year", inplace=True, append=True)
    df2.set_index("year", inplace=True)
    df1[vars_to_replace] /= df2
    df1
    
                   B         C       D
      year                            
    0 1     1.333333  1.250000  Good 1
    1 1     0.666667  0.750000  Good 2
    2 0     0.200000  0.285714  Good 1
    3 0     1.000000  0.857143  Good 2
    

    Avoid changing df1's index altogether

    df2.set_index("year", inplace=True)
    df1[vars_to_replace] /= df2.reindex(df1['year']).reset_index(drop=True)
    df1
    
       year         B         C       D
    0     1  1.333333  1.250000  Good 1
    1     1  0.666667  0.750000  Good 2
    2     0  0.200000  0.285714  Good 1
    3     0  1.000000  0.857143  Good 2
    

    Note: I'm using .reset_index(drop=True) since df1's index is the default RangeIndex. If df1 had a non-default index, then the correct option would be .set_axis(df1.index); however if df1.index has duplicates, you'll need to find another way.