pythonpandasgroup-byapply

Group by and apply multiple custom functions on multiple columns in python pandas


Consider the following dataframe example:

id  date        hrz tenor   1       2       3       4
AAA 16/03/2010  2   6m      0.54    0.54    0.78    0.19
AAA 30/03/2010  2   6m      0.05    0.67    0.20    0.03
AAA 13/04/2010  2   6m      0.64    0.32    0.13    0.20
AAA 27/04/2010  2   6m      0.99    0.53    0.38    0.97
AAA 11/05/2010  2   6m      0.46    0.90    0.11    0.14
AAA 25/05/2010  2   6m      0.41    0.06    0.96    0.31
AAA 08/06/2010  2   6m      0.19    0.73    0.58    0.80
AAA 22/06/2010  2   6m      0.40    0.95    0.14    0.56
AAA 06/07/2010  2   6m      0.22    0.74    0.85    0.94
AAA 20/07/2010  2   6m      0.34    0.17    0.03    0.77
AAA 03/08/2010  2   6m      0.13    0.32    0.39    0.95
AAA 16/03/2010  2   1y      0.54    0.54    0.78    0.19
AAA 30/03/2010  2   1y      0.05    0.67    0.20    0.03
AAA 13/04/2010  2   1y      0.64    0.32    0.13    0.20
AAA 27/04/2010  2   1y      0.99    0.53    0.38    0.97
AAA 11/05/2010  2   1y      0.46    0.90    0.11    0.14
AAA 25/05/2010  2   1y      0.41    0.06    0.96    0.31
AAA 08/06/2010  2   1y      0.19    0.73    0.58    0.80
AAA 22/06/2010  2   1y      0.40    0.95    0.14    0.56
AAA 06/07/2010  2   1y      0.22    0.74    0.85    0.94
AAA 20/07/2010  2   1y      0.34    0.17    0.03    0.77
AAA 03/08/2010  2   1y      0.13    0.32    0.39    0.95

How can I grouby the variables id, hrz and tenor and apply the following custom functions across the dates?

 def ks_test(x):
    return scipy.stats.kstest(np.sort(x), 'uniform')[0]

 def cvm_test(x):
    n = len(x)
    i = np.arange(1, n + 1)
    x = np.sort(x)
    w2 = (1 / (12 * n)) + np.sum((x - ((2 * i - 1) / (2 * n))) ** 2)
    return w2

The desired output is the following dataframe (figure results are just examples):

id   hrz    tenor   test        1       2       3       4
AAA  2      6m      ks_test     0.04    0.06    0.02    0.03
AAA  2      6m      cvm_test    0.09    0.17    0.03    0.05
AAA  2      1y      ks_test     0.04    0.06    0.02    0.03
AAA  2      1y      cvm_test    0.09    0.17    0.03    0.05

Solution

  • Use GroupBy.agg with DataFrame.stack for reshape last level of MultiIndex in columns:

    cols = ['id','hrz', 'tenor']
    out = (df.groupby(cols)[df.columns.difference(cols + ['date'], sort=False)]
            .agg([ks_test, cvm_test])
            .rename_axis([None, 'test'], axis=1)
            .stack(future_stack=True)
            .reset_index())
    
    print (out)
        id  hrz tenor      test         1         2         3         4
    0  AAA    2    1y   ks_test  0.278182  0.166364  0.254545  0.224545
    1  AAA    2    1y  cvm_test  0.220803  0.044730  0.158839  0.118321
    2  AAA    2    6m   ks_test  0.278182  0.166364  0.254545  0.224545
    3  AAA    2    6m  cvm_test  0.220803  0.044730  0.158839  0.118321
    

    How it working:

    print (df.groupby(cols)[df.columns.difference(cols +['date'], sort=False)]
            .agg([ks_test, cvm_test]))
    
                          1                   2                  3            \
                    ks_test  cvm_test   ks_test cvm_test   ks_test  cvm_test   
    id  hrz tenor                                                              
    AAA 2   1y     0.278182  0.220803  0.166364  0.04473  0.254545  0.158839   
            6m     0.278182  0.220803  0.166364  0.04473  0.254545  0.158839   
    
                          4            
                    ks_test  cvm_test  
    id  hrz tenor                      
    AAA 2   1y     0.224545  0.118321  
            6m     0.224545  0.118321