pythonpandasdataframedateshift

Python Pandas: update last dates values of parameters with simple dynamics grouped by id


Here's a dataframe, with some parameters for each id by some regular quarterly dates. It's originally shuffled randomly, but, at first, let say, it is sorted by fab_date and id.

import pandas as pd
np.random.seed(1)
dt_to_fun = pd.DataFrame({
    'fab_date': pd.to_datetime(["2022-01-01", "2022-07-01", "2023-01-01", "2023-07-01",
                                "2022-01-01", "2022-07-01", "2023-01-01", "2023-07-01",
                                "2022-01-01", "2022-07-01", "2023-01-01", "2023-07-01"]),
    'id': ['n_01', 'n_01', 'n_01', 'n_01', 'n_02', 'n_02', 'n_02', 'n_02', 'n_03', 'n_03', 'n_03', 'n_03'],
    'param_01': np.random.choice([10, 20, 30], size=12),
    'param_02': np.random.choice([10, 30, 50], size=12)
})

dt_to_fun
    fab_date    id  param_01    param_02
0   2022-01-01  n_01    20  30
1   2022-07-01  n_01    10  50
2   2023-01-01  n_01    10  10
3   2023-07-01  n_01    20  50
4   2022-01-01  n_02    20  30
5   2022-07-01  n_02    10  50
6   2023-01-01  n_02    10  10
7   2023-07-01  n_02    20  10
8   2022-01-01  n_03    10  50
9   2022-07-01  n_03    20  10
10  2023-01-01  n_03    10  30
11  2023-07-01  n_03    30  50

The goal is: for each id replace (last date) param value like this: param_01(last date) = param_01(last date) + param_01(-1 date) - param_01(-2 date) For example, id n_03 last value of param_02 is 50, previous is 30, two times earlier is 10, so the result should be 50 + 30 - 10 = 70. The same for each id and all param* columns. Only parameters in rows with last date for each id should be updated.

Resulting table should be like this:

dt_to_fun
    fab_date    id  param_01    param_02
0   2022-01-01  n_01    20  30
1   2022-07-01  n_01    10  50
2   2023-01-01  n_01    10  10
3   2023-07-01  n_01    20  10
4   2022-01-01  n_02    20  30
5   2022-07-01  n_02    10  50
6   2023-01-01  n_02    10  10
7   2023-07-01  n_02    20  -30
8   2022-01-01  n_03    10  50
9   2022-07-01  n_03    20  10
10  2023-01-01  n_03    10  30
11  2023-07-01  n_03    20  70

I've tried applymap, shift etc, but can not figure out, how this all should work together. Would appreciate any help.


Solution

  • IIUC, use a custom groupby.transform with diff and shift:

    cols = ['param_01', 'param_02']
    
    dt_to_fun.loc[~dt_to_fun['id'].duplicated(keep='last'), cols] = (
        dt_to_fun.groupby('id')[cols]
        .transform(lambda x: x.diff().shift())
        .add(dt_to_fun[cols])
    )
    

    Or with multiple groupby.shift:

    cols = ['param_01', 'param_02']
    
    g = dt_to_fun.groupby('id')[cols]
    
    dt_to_fun.loc[~dt_to_fun['id'].duplicated(keep='last'), cols] = (
        dt_to_fun[cols] + g.shift(1) - g.shift(2)
    )
    

    Output:

         fab_date    id  param_01  param_02
    0  2022-01-01  n_01        20        30
    1  2022-07-01  n_01        10        50
    2  2023-01-01  n_01        10        10
    3  2023-07-01  n_01        20        10
    4  2022-01-01  n_02        20        30
    5  2022-07-01  n_02        10        50
    6  2023-01-01  n_02        10        10
    7  2023-07-01  n_02        20       -30
    8  2022-01-01  n_03        10        50
    9  2022-07-01  n_03        20        10
    10 2023-01-01  n_03        10        30
    11 2023-07-01  n_03        20        70