rdata.tableself-join

R data.table rolling join etc to add dynamics to previous period


There is a data.table, which has some data for different id's on yearly basis:

example_data <- data.table(
  id = c('id1', 'id1', 'id1', 'id2', 'id2', 'id2', 'id2', 'id2'),
  year = c(2010, 2011, 2012, 2010, 2011, 2012, 2014, 2015),
  param1 = c(10, 10, 20, 10, 5, 7, 15, 15),
  param2 = c(100, 150, 200, 50, 50, 75, 15, 10),
  param3 = c(100, 100, 0, 500, 0, 400, 500, 1000)
)
> example_data
    id year param1 param2 param3
1: id1 2010     10    100    100
2: id1 2011     10    150    100
3: id1 2012     20    200      0
4: id2 2010     10     50    500
5: id2 2011      5     50      0
6: id2 2012      7     75    400
7: id2 2014     15     15    500
8: id2 2015     15     10   1000
> 

I need to add three columns, so that for each id I get a ratio of each param for previous year. I managed at least to make three new columns with just the values of params for previous year:

try01 <- example_data[example_data, on = .(id, year < year), mult="last"]

> try01
    id year param1 param2 param3 i.param1 i.param2 i.param3
1: id1 2010     NA     NA     NA       10      100      100
2: id1 2011     10    100    100       10      150      100
3: id1 2012     10    150    100       20      200        0
4: id2 2010     NA     NA     NA       10       50      500
5: id2 2011     10     50    500        5       50        0
6: id2 2012      5     50      0        7       75      400
7: id2 2014      7     75    400       15       15      500
8: id2 2015     15     15    500       15       10     1000

This is ok, but here the original data is on the right side of the table, and previous values are on the left - it would be better to add new columns to the right instead.

And the original question is - how can I directly add not values from previous years, but the ratio " param current year / param previous year"? May be I could use shift or similar command?


Solution

  • You're on the right track with the suggestion of using shift(). That will be a lot quicker than doing a (cartesian) non-equi join and then subsetting the last value.

    param_cols <- grep("^param", names(example_data), value = TRUE)
    new_cols <- paste0(param_cols, "_ratio")
    example_data[, (new_cols) := lapply(
        .SD,
        \(x) x / shift(x)
    ),
    by = id,
    .SDcols = param_cols
    ]
    
    #        id  year param1 param2 param3 param1_ratio param2_ratio param3_ratio
    #    <char> <num>  <num>  <num>  <num>        <num>        <num>        <num>
    # 1:    id1  2010     10    100    100           NA           NA           NA
    # 2:    id1  2011     10    150    100     1.000000    1.5000000         1.00
    # 3:    id1  2012     20    200      0     2.000000    1.3333333         0.00
    # 4:    id2  2010     10     50    500           NA           NA           NA
    # 5:    id2  2011      5     50      0     0.500000    1.0000000         0.00
    # 6:    id2  2012      7     75    400     1.400000    1.5000000          Inf
    # 7:    id2  2014     15     15    500     2.142857    0.2000000         1.25
    # 8:    id2  2015     15     10   1000     1.000000    0.6666667         2.00
    

    I've assumed you want the first value to be NA as there is no previous value. However, if not, you can change the fill parameter in your shift() call. For example, if you want it to always be 1, just divide by the first value by group, shift(x, fill = x[1]).

    Similarly, you will get Inf rather than NA if the previous value is 0. If you don't want this you replace all the infinite values with NA, e.g. after the previous stage:

    example_data[,
        (new_cols) := lapply(.SD, \(x) fifelse(is.infinite(x), NA, x)),
        .SDcols = new_cols
    ]