Setup:
# create a MultiIndex
dfx = pd.MultiIndex.from_product([
list('ab'),
list('cd'),
list('xyz'),
], names=['idx1', 'idx2', 'idx3'])
# create a dataframe that fits the index
df = pd.DataFrame([None, .9, -.08, -2.11, 1.09, .38, None, None, -.37, -.86, 1.51, -.49], columns=['random_data'])
df.set_index(dfx, inplace=True)
Output:
random_data
idx1 idx2 idx3
a c x NaN
y 0.90
z -0.08
d x -2.11
y 1.09
z 0.38
b c x NaN
y NaN
z -0.37
d x -0.86
y 1.51
z -0.49
Within this index hierarchy, I am trying to accomplish the following:
[idx1, idx2, idx3]
, fill NaN
with the group mean of [idx1, idx2
][idx1, idx2, idx3]
, fill NaN
with the group mean of [idx1]
I have tried df.apply(lambda col: col.fillna(col.groupby(by='idx1').mean()))
as a way to solve #2, but I haven't been able to get it to work.
UPDATE
OK, so I have this solved in parts, but still at a loss about how to apply these conditionally:
For case #1:
df.unstack().apply(lambda col: col.fillna(col.mean()), axis=1).stack()
.
I verified that the correct value was filled by looking at this:
df.groupby(by=['idx1', 'idx2']).mean()
,
but it also replaces the missing values that I am trying to handle differently in case #2.
Similarly for #2:
df.unstack().unstack().apply(lambda col: col.fillna(col.mean()), axis=1).stack().stack()
verified the values replaced were correct by looking at
df.groupby(by=['idx1']).mean()
but it also applies to case #1, which I don't want.
OK, solved it.
First, I made a dataframe containing counts by group of non-missing values:
truth_table = df.apply(lambda row: row.count(), axis = 1).groupby(by=['idx1', 'idx2']).sum()
>> truth_table
idx1 idx2
a c 2
d 3
b c 1
d 3
dtype: int64
Then set up a dataframe (one for each case I'm trying to resolve) containing the group means:
means_ab = x.groupby(by=['idx1']).mean()
>> means_ab
idx1
a 0.0360
b -0.0525
means_abcd = x.groupby(by=['idx1', 'idx2']).mean()
>> means_abcd
idx1 idx2
a c 0.410000
d -0.213333
b c -0.370000
d 0.053333
Given the structure of my data, I know:
Case #1 is analogous to truth_table
having exactly one missing value in a given index grouping of [idx1, idx2]
(e.g., these are the NaN
values I want to replace with values from means_abcd
)
Case #2 is analogous to truth_table
having more than one missing value in a given index grouping of [idx1, idx2]
(e.g., these are the NaN
values I want to replace with values from means_ab
fix_case_2 = df.combine_first(df[truth_table > 1].fillna(means_ab, axis=1))
>> fix_case_2
idx1 idx2 idx3
a c x NaN
y 0.9000
z -0.0800
d x -2.1100
y 1.0900
z 0.3800
b c x -0.0525 *
y -0.0525 *
z -0.3700
d x -0.8600
y 1.5100
z -0.4900
df = fix_case_2.combine_first(df[truth_table == 1].fillna(means_abcd, axis=1))
>> df
idx1 idx2 idx3
a c x 0.4100 *
y 0.9000
z -0.0800
d x -2.1100
y 1.0900
z 0.3800
b c x -0.0525 *
y -0.0525 *
z -0.3700
d x -0.8600
y 1.5100
z -0.4900