I've data about how my times people are sick in certain locations (location A and B) at certain times (index of dates). I need to divide each value by the population in that location (column) AND at that time (index), which references a separate dictionary.
Eg dataframe:
import pandas as pd
data = [{'A': 1, 'B': 3}, {'A': 2, 'B': 20}, {'A': "Unk", 'B': 50}]
df = pd.DataFrame(data, index=[pd.to_datetime("2019-12-31")
, pd.to_datetime("2020-12-30")
, pd.to_datetime("2020-12-31")])
Out:
A B
2019-12-31 1 3
2020-12-30 2 20
2021-12-31 Unk 50
Population dictionary (location_year):
dic = {"A_2019": 100, "B_2019": 200, "A_2020": 120, "B_2020": 150}
While it's not necessary to have the output in the same df, the output I'm trying to achieve would be:
A B A1 B1
2019-12-31 1 3 0.01 0.015
2020-12-30 2 20 0.017 0.133
2021-12-31 Unk 50 nan 0.333
I've tried lots of different approaches, but almost always get an unhashable type error.
for col in df.columns:
df[col + "1"] = df[col]/dic[col + "_" + df.index.strftime("%Y")]
Out: `TypeError: unhashable type: 'Index
I guess I don't understand how pandas is parsing the df.index value to the dictionary(?). Can this be fixed, or is another approach necessary?
You can create a Series from your dictionary, then unstack
to DataFrame, reindex
/set_axis
, perform your operation and join
with add_suffix
:
def split(k):
x, y = k.split('_')
return (int(y), x)
# ensure using NaNs for missing values, not strings
df = df.replace('Unk', pd.NA).convert_dtypes()
# reshape to match the original DataFrame
tmp = (pd.Series({split(k): v for k, v in dic.items()})
.unstack()
.reindex(df.index.year) # match years in df.index
.set_axis(df.index) # restore full dates
)
# divide, add_suffix, join
out = df.join(df.div(tmp).add_suffix('1'))
# or
# out = df.join(tmp.rdiv(df), rsuffix='1')
Output:
A B A1 B1
2019-12-31 1 3 0.01 0.015
2020-12-30 2 20 0.016667 0.133333
2020-12-31 <NA> 50 <NA> 0.333333
Intermediate tmp
:
A B
2019-12-31 100 200
2020-12-30 120 150
2020-12-31 120 150
Here we rather create a tmp
with only the years, and use an intermediate rename
step to perform the alignment:
def split(k):
x, y = k.split('_')
return (int(y), x)
df = df.replace('Unk', pd.NA).convert_dtypes()
tmp = (pd.Series({split(k): v for k, v in dic.items()})
.unstack()
)
out = df.join(df.rename(lambda x: x.year).div(tmp)
.add_suffix('1').set_axis(df.index)
)
Intermediate tmp
:
A B
2019 100 200
2020 120 150