pandasmulti-indexfillna

In a pandas dataframe with a MultiIndex, how to conditionally fill missing values with group means?


Setup:

# create a MultiIndex
dfx = pd.MultiIndex.from_product([
    list('ab'),
    list('cd'),
    list('xyz'),
], names=['idx1', 'idx2', 'idx3'])

# create a dataframe that fits the index
df = pd.DataFrame([None, .9, -.08, -2.11, 1.09, .38, None, None, -.37, -.86, 1.51, -.49], columns=['random_data'])
df.set_index(dfx, inplace=True)

Output:

                        random_data
idx1    idx2    idx3    
a       c       x       NaN
                y       0.90
                z      -0.08

        d       x      -2.11
                y       1.09
                z       0.38

b       c       x       NaN
                y       NaN
                z      -0.37

        d       x      -0.86
                y       1.51
                z      -0.49

Within this index hierarchy, I am trying to accomplish the following:

  1. When a value is missing within [idx1, idx2, idx3], fill NaN with the group mean of [idx1, idx2]
  2. When multiple values are missing within [idx1, idx2, idx3], fill NaN with the group mean of [idx1]

I have tried df.apply(lambda col: col.fillna(col.groupby(by='idx1').mean())) as a way to solve #2, but I haven't been able to get it to work.

UPDATE

OK, so I have this solved in parts, but still at a loss about how to apply these conditionally:

For case #1:

df.unstack().apply(lambda col: col.fillna(col.mean()), axis=1).stack().

I verified that the correct value was filled by looking at this:

df.groupby(by=['idx1', 'idx2']).mean(),

but it also replaces the missing values that I am trying to handle differently in case #2.

Similarly for #2:

df.unstack().unstack().apply(lambda col: col.fillna(col.mean()), axis=1).stack().stack()

verified the values replaced were correct by looking at

df.groupby(by=['idx1']).mean()

but it also applies to case #1, which I don't want.


Solution

  • OK, solved it.

    First, I made a dataframe containing counts by group of non-missing values:

    truth_table = df.apply(lambda row: row.count(), axis = 1).groupby(by=['idx1', 'idx2']).sum()
    
    >> truth_table
    
    idx1  idx2
    a     c       2
          d       3
    b     c       1
          d       3
    dtype: int64
    

    Then set up a dataframe (one for each case I'm trying to resolve) containing the group means:

    means_ab = x.groupby(by=['idx1']).mean()
    
    >> means_ab
    
    idx1    
    a    0.0360
    b   -0.0525
    
    means_abcd = x.groupby(by=['idx1', 'idx2']).mean()
    
    >> means_abcd
    
    idx1    idx2    
    a       c    0.410000
    d           -0.213333
    b       c   -0.370000
    d            0.053333
    

    Given the structure of my data, I know:

    fix_case_2 = df.combine_first(df[truth_table > 1].fillna(means_ab, axis=1))
    
    >> fix_case_2
    
    idx1    idx2    idx3    
    a       c       x     NaN
                    y     0.9000
                    z    -0.0800
            d       x    -2.1100
                    y     1.0900
                    z     0.3800
    b       c       x    -0.0525 *
                    y    -0.0525 *
                    z    -0.3700
            d       x    -0.8600
                    y     1.5100
                    z    -0.4900
    
    df = fix_case_2.combine_first(df[truth_table == 1].fillna(means_abcd, axis=1))
    
    >> df
    
    idx1    idx2    idx3    
    a       c       x     0.4100 *
                    y     0.9000
                    z    -0.0800
            d       x    -2.1100
                    y     1.0900
                    z     0.3800
    b       c       x    -0.0525 *
                    y    -0.0525 *
                    z    -0.3700
            d       x    -0.8600
                    y     1.5100
                    z    -0.4900