pythonpandasdataframedictionary

In Pandas, how to reference and use a value from a dictionary based on column AND index values in a dataframe?


I've data about how my times people are sick in certain locations (location A and B) at certain times (index of dates). I need to divide each value by the population in that location (column) AND at that time (index), which references a separate dictionary.

Eg dataframe:

import pandas as pd
data = [{'A': 1, 'B': 3}, {'A': 2, 'B': 20}, {'A': "Unk", 'B': 50}]
df = pd.DataFrame(data, index=[pd.to_datetime("2019-12-31")
                               , pd.to_datetime("2020-12-30")
                               , pd.to_datetime("2020-12-31")])
Out:
            A   B
2019-12-31  1   3
2020-12-30  2   20
2021-12-31  Unk 50

Population dictionary (location_year):

dic = {"A_2019": 100, "B_2019": 200, "A_2020": 120, "B_2020": 150}

While it's not necessary to have the output in the same df, the output I'm trying to achieve would be:

            A   B    A1     B1
2019-12-31  1   3   0.01    0.015
2020-12-30  2   20  0.017   0.133
2021-12-31  Unk 50  nan     0.333

I've tried lots of different approaches, but almost always get an unhashable type error.

for col in df.columns:
     df[col + "1"] = df[col]/dic[col + "_" + df.index.strftime("%Y")]
Out: `TypeError: unhashable type: 'Index

I guess I don't understand how pandas is parsing the df.index value to the dictionary(?). Can this be fixed, or is another approach necessary?


Solution

  • You can create a Series from your dictionary, then unstack to DataFrame, reindex/set_axis, perform your operation and join with add_suffix:

    def split(k):
        x, y = k.split('_')
        return (int(y), x)
    
    # ensure using NaNs for missing values, not strings
    df = df.replace('Unk', pd.NA).convert_dtypes()
    
    # reshape to match the original DataFrame
    tmp = (pd.Series({split(k): v for k, v in dic.items()})
             .unstack()
             .reindex(df.index.year)  # match years in df.index
             .set_axis(df.index)      # restore full dates
          )
    
    # divide, add_suffix, join
    out = df.join(df.div(tmp).add_suffix('1'))
    # or
    # out = df.join(tmp.rdiv(df), rsuffix='1')
    

    Output:

                   A   B        A1        B1
    2019-12-31     1   3      0.01     0.015
    2020-12-30     2  20  0.016667  0.133333
    2020-12-31  <NA>  50      <NA>  0.333333
    

    Intermediate tmp:

                  A    B
    2019-12-31  100  200
    2020-12-30  120  150
    2020-12-31  120  150
    

    Variant

    Here we rather create a tmp with only the years, and use an intermediate rename step to perform the alignment:

    def split(k):
        x, y = k.split('_')
        return (int(y), x)
    
    df = df.replace('Unk', pd.NA).convert_dtypes()
    
    tmp = (pd.Series({split(k): v for k, v in dic.items()})
             .unstack()
          )
    
    out = df.join(df.rename(lambda x: x.year).div(tmp)
                    .add_suffix('1').set_axis(df.index)
                 )
    

    Intermediate tmp:

            A    B
    2019  100  200
    2020  120  150