pythonpandasperformancecoding-efficiency

How to apply a function to multiple multiindex columns in Pandas?


Given a multiindex columns

         a                      ...                              
         E1                      ...        E3                    
         g1        g2        g3  ...        g1        g2        g3
0  0.548814  0.715189  0.602763  ...  0.437587  0.891773  0.963663
1  0.383442  0.791725  0.528895  ...  0.087129  0.020218  0.832620
2  0.778157  0.870012  0.978618  ...  0.118274  0.639921  0.143353
3  0.944669  0.521848  0.414662  ...  0.568434  0.018790  0.617635
4  0.612096  0.616934  0.943748  ...  0.697631  0.060225  0.666767
5  0.670638  0.210383  0.128926  ...  0.438602  0.988374  0.102045
6  0.208877  0.161310  0.653108  ...  0.158970  0.110375  0.656330
7  0.138183  0.196582  0.368725  ...  0.096098  0.976459  0.468651
8  0.976761  0.604846  0.739264  ...  0.296140  0.118728  0.317983
9  0.414263  0.064147  0.692472  ...  0.093941  0.575946  0.929296
[10 rows x 9 columns]

I would like to apply multiple column filtered by the second level (i.e., E1, E2, E3) to a functions (e.g., ration_type1 ,ration_type2, or can be more in actual implementation).

For example. Assume we are to compute the second level of E1 under the function ration_type1 and ration_type2. Then we only process the following df

    a                    
         E1                    
         g1        g2        g3
0  0.548814  0.715189  0.602763
1  0.383442  0.791725  0.528895
  .................
8  0.976761  0.604846  0.739264
9  0.414263  0.064147  0.692472

To generalise all second level, I than rely on the list comprehension as below

for each of the ration_type1 and ration_type2.

all_df1 = [ration_type1(df.loc[:, (slice(None), dgroup, slice(None))]) for dgroup in [`E1`, `E2`, `E3`]]


all_df2 = [ration_type2(df.loc[:, (slice(None), dgroup, slice(None))]) for dgroup in [`E1`, `E2`, `E3`]]

before concat it back to the orignal df.

However, I wonder whether there is more elegant and compact way than the list comprehension approach. This is because, in real life implementation, there can be more ration function.

The full code is as below

 import numpy as np

import pandas as pd

np.random.seed(0)

arr = np.random.rand(10,9)

tuples = [('a', 'E1', 'g1'), ('a', 'E1', 'g2'), ('a', 'E1', 'g3'), ('a', 'E2', 'g1'), ('a', 'E2', 'g2'),
          ('a', 'E2', 'g3'), ('a', 'E3', 'g1'), ('a', 'E3', 'g2'), ('a', 'E3', 'g3')]
df = pd.DataFrame(data=arr, columns=pd.MultiIndex.from_tuples(tuples))

print(df)
def ration_type1(df):
    """
    (g3+g2)/g1
    # Ugly way since have to convert to numpy 1st
    """

    print(df)
    dration = 'ration_type1'
    l1, l2, _ = df.columns.tolist()[0]
    total = df.loc[:, (slice(None), slice(None), 'g2')].to_numpy() + \
            df.loc[:, (slice(None), slice(None), 'g3')].to_numpy()
    arr = total / df.loc[:, (slice(None), slice(None), 'g1')].to_numpy()

    return pd.DataFrame(data=arr, columns=pd.MultiIndex.from_tuples([(l1, l2, dration)]))


def ration_type2(df):
    """
    (g3+g2+g1)/g1
    # Ugly way since have to convert to numpy 1st
    """
    dration = 'ration_type2'
    l1, l2, _ = df.columns.tolist()[0]
    total = df.loc[:, (slice(None), slice(None), 'g1')].to_numpy() + \
            df.loc[:, (slice(None), slice(None), 'g2')].to_numpy() + \
            df.loc[:, (slice(None), slice(None), 'g3')].to_numpy()
    arr = total / df.loc[:, (slice(None), slice(None), 'g1')].to_numpy()

    return pd.DataFrame(data=arr, columns=pd.MultiIndex.from_tuples([(l1, l2, dration)]))


level1_name = list(set(df.columns.get_level_values(1)))

all_df1 = [ration_type1(df.loc[:, (slice(None), dgroup, slice(None))]) for dgroup in level1_name]
all_df2 = [ration_type2(df.loc[:, (slice(None), dgroup, slice(None))]) for dgroup in level1_name]

df1 = pd.concat(all_df1, axis=1)
df2 = pd.concat(all_df2, axis=1)

df=pd.concat([df,df1,df2],axis=1)

Expected output.

          a                      ...                                       
         E1                      ...                        E2           E3
         g1        g2        g3  ... ration_type2 ration_type2 ration_type2
0  0.548814  0.715189  0.602763  ...     3.401458     2.962896     5.240151
1  0.383442  0.791725  0.528895  ...     4.444124     2.754497    10.788191
2  0.778157  0.870012  0.978618  ...     3.375653     2.554145     7.622516
3  0.944669  0.521848  0.414662  ...     1.991363     5.650758     2.119612
4  0.612096  0.616934  0.943748  ...     3.549735     2.168255     2.042087
5  0.670638  0.210383  0.128926  ...     1.505949     3.960760     3.486126
6  0.208877  0.161310  0.653108  ...     4.899035     3.806001     5.822965
7  0.138183  0.196582  0.368725  ...     5.091008     2.138921    16.037821
8  0.976761  0.604846  0.739264  ...     2.376088    11.283905     2.474676
9  0.414263  0.064147  0.692472  ...     2.826423     2.391873    17.023361

[10 rows x 15 columns]

I am thinking to do something like using apply

# function for prepending 'Geek'
def multiply_by_2(number):
    return 2 * number
 
# executing the function
df[["Integers", "Float"]] = df[["Integers", "Float"]].apply(multiply_by_2)

But, I am having difficulties (due to my limited knowledge)to do it since my example involving multiindex columns


Solution

  • Not so easy if working with MultiIndex - solution filter levels with rename g values to ration_type1, ration_type2 for possible divide MultiIndex DataFrames:

    idx = pd.IndexSlice
    c = {'g1':'ration_type1','g2':'ration_type1','g3':'ration_type1'}
    df1 = df.loc[:, idx[:,:,['g3','g2']]].rename(columns=c).groupby(level=[0,1,2], axis=1).sum()
    df11 = df1.div(df.xs('g1', level=2, axis=1, drop_level=False).rename(columns=c))
    
    
    c1 = {'g1':'ration_type2','g2':'ration_type2','g3':'ration_type2'}
    df2 = df.rename(columns=c1).groupby(level=[0,1,2], axis=1).sum()
    df22 = df2.div(df.xs('g1', level=2, axis=1, drop_level=False).rename(columns=c1))
    
    df=pd.concat([df,df11,df22],axis=1)
    

    Simplier is reshape first:

    df1 = df.stack([0,1])
    df1['ration_type1'] = df1[['g2','g3']].sum(axis=1).div(df1['g1'])
    df1['ration_type2'] = df1.sum(axis=1).div(df1['g1'])
    

    print(df1)
                  g1        g2        g3  ration_type1  ration_type2
    0 a E1  0.548814  0.715189  0.602763      2.401458      7.777186
        E2  0.544883  0.423655  0.645894      1.962896      6.565312
        E3  0.437587  0.891773  0.963663      4.240151     14.929992
    1 a E1  0.383442  0.791725  0.528895      3.444124     13.426259
        E2  0.568045  0.925597  0.071036      1.754497      5.843159
        E3  0.087129  0.020218  0.832620      9.788191    123.129174
    2 a E1  0.778157  0.870012  0.978618      2.375653      6.428577
        E2  0.799159  0.461479  0.780529      1.554145      4.498872
        E3  0.118274  0.639921  0.143353      6.622516     63.615316
    3 a E1  0.944669  0.521848  0.414662      0.991363      3.040793
        E2  0.264556  0.774234  0.456150      4.650758     23.230266
        E3  0.568434  0.018790  0.617635      1.119612      4.089254
    4 a E1  0.612096  0.616934  0.943748      2.549735      7.715318
        E2  0.681820  0.359508  0.437032      1.168255      3.881690
        E3  0.697631  0.060225  0.666767      1.042087      3.535837
    5 a E1  0.670638  0.210383  0.128926      0.505949      2.260380
        E2  0.315428  0.363711  0.570197      2.960760     13.347233
        E3  0.438602  0.988374  0.102045      2.486126      9.154429
    6 a E1  0.208877  0.161310  0.653108      3.899035     23.565714
        E2  0.253292  0.466311  0.244426      2.806001     14.884143
        E3  0.158970  0.110375  0.656330      4.822965     36.161882
    7 a E1  0.138183  0.196582  0.368725      4.091008     34.696743
        E2  0.820993  0.097101  0.837945      1.138921      3.526168
        E3  0.096098  0.976459  0.468651     15.037821    172.521382
    8 a E1  0.976761  0.604846  0.739264      1.376088      3.784915
        E2  0.039188  0.282807  0.120197     10.283905    273.710140
        E3  0.296140  0.118728  0.317983      1.474676      7.454332
    9 a E1  0.414263  0.064147  0.692472      1.826423      7.235273
        E2  0.566601  0.265389  0.523248      1.391873      4.848404
        E3  0.093941  0.575946  0.929296     16.023361    187.592593
    

    Last reshape to original MultiIndex:

    df = df1.unstack([1,2]).reorder_levels([1,2,0], axis=1)