rvectorizationnumerical-integration

How can I efficiently vectorize the integrate function in R?


I am numerically integrating with integrate(). The function I am integrating depends on several parameters and I would like to calculate the integral for a large number of parameter values. I currently do this using the mapply function.

MWE:

fc <- function(x, a, b) a*x + b
fc.wrapper <- function(p.a, p.b) integrate(fc, 0, 1, a = p.a, b = p.b)$value
a.vec <- seq(0, 1, length.out = 100)
b.vec <- seq(0, 1, length.out = 100) 
system.time(mapply(fc.wrapper, a.vec, b.vec))

Output:

   user  system elapsed 
  0.003   0.000   0.004 

Problem is, in my application, this takes too long to run. So my question is, are there faster alternatives to the use of the mapply()? I have tried using future_mapply(), but it doesn't seem to speed things up.

Tried future.apply():

library(future.apply)
plan(multisession, workers = 6L)
system.time(future_mapply(fc.wrapper, a.vec, b.vec, future.seed = T))

Output:

   user  system elapsed 
  0.315   0.008   0.752

I am also open to suggestions for faster user-written alternatives to integrate().


Solution

  • For a 1-D smooth function, with enough memory you can use straight non-adaptive Gauss-Legendre quadrature with a relatively large number of nodes to vectorize the integration. I'll demonstrate with a function with moderate computational cost:

    f1 <- function(x, A, B, C, D) {
      (A*lgamma(x) - B*log(x) + C*qgamma(pcauchy(x), 5))*exp(sin(D*x))
    }
    
    a <- 0; b <- 100 # integration extents
    A <- B <- C <- D <- seq(0, 1, length.out = 20) # parameter values
    

    I'll explore the full 20^4 parameter space by using expand.grid:

    system.time({
      ba2 <- (b - a)/2 # to normalize the integration extents to [-1, 1]
      nodes <- statmod::gauss.quad(100) # calculate the nodes
      x <- nodes[[1]]*ba2 + (b + a)/2
      w <- nodes[[2]]*ba2
      df1 <- expand.grid(x = x, A = A, B = B, C = C, D = D)
      df1 <- cbind(
        expand.grid(A = A, B = B, C = C, D = D),
        v = c(crossprod(w, matrix(do.call(f1, df1), length(x))))
      )
    })
    #>    user  system elapsed 
    #>   23.01    0.12   23.15
    

    Spot check the integration:

    sum(f1(x, 0.5, 0.5, 0.5, 0.5)*w)
    #> [1] 9919.591
    integrate(\(x) f1(x, 0.5, 0.5, 0.5, 0.5), a, b, rel.tol = 1e-6)
    #> 9919.598 with absolute error < 0.00023
    

    If the function is amenable to broadcasting, the integration can be sped up even more using outer:

    f2 <- function(x, A, B, C, D) {
      (outer(lgamma(x), A) - outer(log(x), B) + outer(qgamma(pcauchy(x), 5), C))*
        exp(sin(outer(x, D)))
    }
    
    system.time({
      ba2 <- (b - a)/2
      nodes <- statmod::gauss.quad(100)
      x <- nodes[[1]]*ba2 + (b + a)/2
      w <- nodes[[2]]*ba2
      df2 <- expand.grid(A = A, B = B, C = C, D = D)
      df2 <- transform(df2, v = c(crossprod(w, f2(x, A, B, C, D))))
    })
    #>    user  system elapsed 
    #>    1.45    0.05    1.52
    
    all.equal(df1$v, df2$v)
    #> [1] TRUE
    

    This is all using a single thread. It would be straightforward to do the above in parallel.

    Compare to the mapply approach in the question:

    system.time({
      f3 <- function(A, B, C, D) {
        integrate(f1, a, b, A = A, B = B, C = C, D = D)$value
      }
      
      df3 <- expand.grid(A = A, B = B, C = C, D = D)
      df3 <- transform(df3, v = mapply(f3, A, B, C, D))
    })
    #>    user  system elapsed 
    #>  150.28    0.04  150.58
    
    all.equal(df1$v, df3$v)
    #> [1] "Mean relative difference: 2.018989e-05"