pythonpy-datatable

Create many lagged variables


I have the following Python datatable:

import datatable
import numpy as np

np.random.seed(42)

dt = datatable.Frame({"A":np.repeat(np.arange(0, 2), 5), "B":np.random.normal(0, 1, 10)})
dt

#          A        B
#0         0        −0.342855
#1         0        0.0706784
#2         0        0.0470259
#3         0        −0.0522357
#4         0        −0.610938
#5         1        −2.62617
#6         1        0.550128
#7         1        0.538717
#8         1        −0.487166
#9         1        0.996788

I want to create 4 lagged columns for the column B for each value in column A. This will result in the following datatable:

#          A        B               B_lag_1         B_lag_2         B_lag_3         B_lag_4
#0         0        −0.342855       NA              NA              NA              NA
#1         0        0.0706784       −0.342855       NA              NA              NA
#2         0        0.0470259       0.0706784       −0.342855       NA              NA
#3         0        −0.0522357      0.0470259       0.0706784       −0.342855       NA
#4         0        −0.610938       −0.0522357      0.0470259       0.0706784       −0.342855
#5         1        −2.62617        NA              NA              NA              NA
#6         1        0.550128        −2.62617        NA              NA              NA
#7         1        0.538717        0.550128        −2.62617        NA              NA
#8         1        −0.487166       0.538717        0.550128        −2.62617        NA
#9         1        0.996788        −0.487166       0.538717        0.550128        −2.62617

How can I achieve this?


Solution

  • I never used datatable but pandas.DataFrame has groupby() and shift() and I found similar functions in datatable.


    You can use:

    import datatable as dt
    import numpy as np
    
    np.random.seed(42)
    
    df = dt.Frame({
        "A": np.repeat(np.arange(0, 2), 5), 
        "B": np.random.normal(0, 1, 10)
    })
    
    for n in range(1, 5):
        df[f'B_lag_{n}'] = df[:, dt.shift(dt.f.B, n), dt.by('A')]['B']
        
    df    
    

    Result

       |     A          B    B_lag_1    B_lag_2    B_lag_3    B_lag_4
       | int64    float64    float64    float64    float64    float64
    -- + -----  ---------  ---------  ---------  ---------  ---------
     0 |     0   0.496714  NA         NA         NA         NA       
     1 |     0  -0.138264   0.496714  NA         NA         NA       
     2 |     0   0.647689  -0.138264   0.496714  NA         NA       
     3 |     0   1.52303    0.647689  -0.138264   0.496714  NA       
     4 |     0  -0.234153   1.52303    0.647689  -0.138264   0.496714
     5 |     1  -0.234137  NA         NA         NA         NA       
     6 |     1   1.57921   -0.234137  NA         NA         NA       
     7 |     1   0.767435   1.57921   -0.234137  NA         NA       
     8 |     1  -0.469474   0.767435   1.57921   -0.234137  NA       
     9 |     1   0.54256   -0.469474   0.767435   1.57921   -0.234137
    [10 rows x 6 columns]