rdata.tabletime-seriesvectorizationdifference-equations

How to write a cumulative calculation in data.table


A sequential, cumulative calculation

I need to make a time-series calculation, where the value calculated in each row depends on the result calculated in the previous row. I am hoping to use the convenience of data.table. The actual problem is a hydrological model -- a cumulative water balance calculation, adding rainfall at each time step and subtracting runoff and evaporation as a function of the current water volume. The dataset includes different basins and scenarios (groups). Here I will use a simpler illustration of the problem.

A simplified example of the calculation looks like this, for each time step (row) i:

 v[i] <- a[i] + b[i] * v[i-1]

a and b are vectors of parameter values, and v is the result vector. For the first row (i == 1) the initial value of v is taken as v0 = 0.

First attempt

My first thought was to use shift() in data.table. A minimal example, including the desired result v.ans, is

library(data.table)        # version 1.9.7
DT <- data.table(a = 1:4, 
                 b = 0.1,
                 v.ans = c(1, 2.1, 3.21, 4.321) )
DT
#    a   b v.ans
# 1: 1 0.1 1.000
# 2: 2 0.1 2.100
# 3: 3 0.1 3.210
# 4: 4 0.1 4.321

DT[, v := NA]   # initialize v
DT[, v := a + b * ifelse(is.na(shift(v)), 0, shift(v))][]
#    a   b v.ans v
# 1: 1 0.1 1.000 1
# 2: 2 0.1 2.100 2
# 3: 3 0.1 3.210 3
# 4: 4 0.1 4.321 4

This doesn't work, because shift(v) gives a copy of the original column v, shifted by 1 row. It is unaffected by assignment to v.

I also considered building the equation using cumsum() and cumprod(), but that won't work either.

Brute force approach

So I resort to a for loop inside a function for convenience:

vcalc <- function(a, b, v0 = 0) {
  v <- rep(NA, length(a))      # initialize v
  for (i in 1:length(a)) {
    v[i] <- a[i] + b[i] * ifelse(i==1, v0, v[i-1])
  }
  return(v)
}

This cumulative function works fine with data.table:

DT[, v := vcalc(a, b, 0)][]
#    a   b v.ans     v
# 1: 1 0.1 1.000 1.000
# 2: 2 0.1 2.100 2.100
# 3: 3 0.1 3.210 3.210
# 4: 4 0.1 4.321 4.321
identical(DT$v, DT$v.ans)
# [1] TRUE

My question

My question is, can I write this calculation in a more concise and efficient data.table way, without having to use the for loop and/or function definition? Using set() perhaps?

Or is there a better approach all together?

Edit: A better loop

David's Rcpp solution below inspired me to remove the ifelse() from the for loop:

vcalc2 <- function(a, b, v0 = 0) {
  v <- rep(NA, length(a))
  for (i in 1:length(a)) {
    v0 <- v[i] <- a[i] + b[i] * v0
  }
  return(v)
}

vcalc2() is 60% faster than vcalc().


Solution

  • It may not be 100% what you are looking for, as it does not use the "data.table-way" and still uses a for-loop. However, this approach should be faster (I assume you want to use data.table and the data.table-way to speed up your code). I leverage Rcpp to write a short function called HydroFun, that can be used in R like any other function (you just need to source the function first). My gut-feeling tells me that the data.table way (if existent) is pretty complicated because you cannot compute a closed-form solution (but I may be wrong on this point...).

    My approach looks like this:

    The Rcpp function looks like this (in the file: hydrofun.cpp):

    #include <Rcpp.h>
    using namespace Rcpp;
    
    // [[Rcpp::export]]
    NumericVector HydroFun(NumericVector a, NumericVector b, double v0 = 0.0) {
      // get the size of the vectors
      int vecSize = a.length();
    
      // initialize a numeric vector "v" (for the result)
      NumericVector v(vecSize);
    
       // compute v_0
      v[0] = a[0] + b[0] * v0;
    
      // loop through the vector and compute the new value
      for (int i = 1; i < vecSize; ++i) {
        v[i] = a[i] + b[i] * v[i - 1];
      }
      return v;
    }
    

    To source and use the function in R you can do:

    Rcpp::sourceCpp("hydrofun.cpp")
    
    library(data.table)
    DT <- data.table(a = 1:4, 
                     b = 0.1,
                     v.ans = c(1, 2.1, 3.21, 4.321))
    
    DT[, v_ans2 := HydroFun(a, b, 0)]
    DT
    # a   b v.ans v_ans2
    # 1: 1 0.1 1.000  1.000
    # 2: 2 0.1 2.100  2.100
    # 3: 3 0.1 3.210  3.210
    # 4: 4 0.1 4.321  4.321
    

    Which gives the result you are looking for (at least from the value-perspective).

    Comparing the speeds reveals a speed-up of roughly 65x.

    library(microbenchmark)
    n <- 10000
    dt <- data.table(a = 1:n,
                     b = rnorm(n))
    
    microbenchmark(dt[, v1 := vcalc(a, b, 0)],
                   dt[, v2 := HydroFun(a, b, 0)])
    # Unit: microseconds
    # expr                                min        lq       mean    median         uq       max neval
    # dt[, `:=`(v1, vcalc(a, b, 0))]    28369.672 30203.398 31883.9872 31651.566 32646.8780 68727.433   100
    # dt[, `:=`(v2, HydroFun(a, b, 0))]   381.307   421.697   512.2957   512.717   560.8585  1496.297   100
    
    identical(dt$v1, dt$v2)
    # [1] TRUE
    

    Does that help you in any way?