pythonpandasdataframemulti-level

Creating a multilevel dataframe by clubbing columns with the same name under top level


Consider the following input dataframe:

        index      |        col_1            |           col_2             |
         1         |        1234             |           4567              |
         2         |        3456             |           9453              |

Each column of the dataframe is a series (timeseries), and we want to do some computations that create series of length equal to the input (for example, computing the running mean of the series of last 5 samples (op_1), and of 10 samples (op_2).

Finally, the output should be grouped under the name of the column like shown below:

        Output:

                   |        col_1          |          col_2        |
        index      | value   opr_1   opr_2 | value   opr_1   opr_2 |
         1         | 1234     10       1   | 4567     22       13  |
         2         | 3456     18       6   | 9453     21       4   |

This should allow me to access each original column's related computation under a single head col_1.

Initially, I thought of increasing the level of the input dataframe manually as:

df.columns = pd.MultiIndex.from_product([df.columns, ['value']])

But, I cannot figure out how to run apply on its second level alone (considering that I want to address the column as df['col_1']['value'] and then to put those values into the dataframe at the same level inside df['col_1']['op_1'].

So, the second approach I tried was to create a dataframe for each operation as

op_1 = df.apply(lambda x: op_1_func(x, **params))
op_2 = df.apply(lambda x: op_2_func(x, **params))

And then merge the three dataframes to create the desired multilevel view. However, I cannot figure out a way to concat the dataframes to produce the desired output. Please help!


Solution

  • Given the following example:

    import numpy as np
    import pandas as pd
    
    def opr1_func(x, y):
        return sum(x) - y
    
    
    def opr2_func(x, y):
        return np.mean(x) - y
    
    
    df = pd.DataFrame({"col1": [2, 4, 6, 8, 10], "col2": [6, 8, 12, 14, 16]})
    

    Here is one way to do it using Pandas advanced indexing with hierarchical index:

    df.columns = pd.MultiIndex.from_product([df.columns, ["value"]])
    
    for col in ["col1", "col2"]:
        for new_col, func in zip(["opr_1", "opr_2"], [opr1_func, opr2_func]):
            df[(col, new_col)] = df[(col, "value")].rolling(2).apply(lambda x: func(x, 10))
    
    df = df.reindex(
        pd.MultiIndex.from_product([["col1", "col2"], ["value", "opr_1", "opr_2"]]), axis=1
    )
    

    Then:

    print(df)
    # Output
    
       col1              col2
      value opr_1 opr_2 value opr_1 opr_2
    0     2   NaN   NaN     6   NaN   NaN
    1     4  -4.0  -7.0     8   4.0  -3.0
    2     6   0.0  -5.0    12  10.0   0.0
    3     8   4.0  -3.0    14  16.0   3.0
    4    10   8.0  -1.0    16  20.0   5.0