rvectorization

Speeding Up Monte Carlo Simulations for Weather-Dependent Claims


I’m working on a Monte Carlo simulation project, and I need help optimizing some aspects of the problem. Here’s the scenario :

We model claims R (linked to weather-dependent events) as independent variables, with the weather conditions themselves being independent under certain assumptions. The situation involves two hypotheses regarding motorcyclists :

  1. Hypothesis A: All motorcyclists travel in the same region, subjected to the same weather conditions.

  2. Hypothesis B: Motorcyclists travel in different regions, each with independent weather conditions (modeled as independent Markov chains).

The claim amounts (in monetary units) follow the distribution $\mathbb{P}r$ with density proportional to : $f^*(x) = \textbf{1}{\mathbb{R}_+} \exp\left(-\eta\ln(\alpha + |x - x_0|)\right)$,

Where $\alpha > 0, x_0 > 0, \eta > 0$ are constants.

Over a period of 365 days, we insure $N$, and and the total claim amount $R$ is the sum of individual claims.

I aim to approximate the conditional expectation $m(s) = \mathbb{R}[R - s \mid R > s]$ using a Monte Carlo method.

Here’s the current R code I’ve written for simulating claim amounts based on $\mathbb{P}_r$ :

rremb <- function(n, params) {
  # Extract parameters from the list
  alpha <- params$alpha
  x0 <- params$x0
  eta <- params$eta
  
  # Check parameter validity
  if (alpha <= 0 | x0 <= 0 | eta <= 3) {
    stop("Invalid parameters: alpha > 0, x0 > 0, and eta > 3 are required")
  }
  
  # Normalization factor K
  K <- 2 * alpha^(1 - eta) / (eta - 1)
  
  # Inverse of F* (without normalization factor)
  F_star_inv <- function(z) {
    if (z <= 0.5) {
      return((alpha + x0 - (z * (eta - 1))^(1 / (1 - eta))))
    } else {
      return((x0 - alpha + ((1 - z) * (eta - 1))^(1 / (eta - 1))))
    }
  }
  
  # Inverse of F considering K
  F_inv <- function(y) {
    return(F_star_inv(K * y))
  }
  
  # Generate n uniform random values on (0, 1)
  u <- runif(n)
  
  # Use F_inv to generate values based on P_r
  x <- sapply(u, F_inv)
  
  return(x)
}

# Parameters
params <- list(alpha = 2.0, x0 = 50.0, eta = 4.0)

# Simulate values
n <- 100000  # Adjust as needed
simulated_values <- rremb(n, params)

# Display results
print(head(simulated_values, 10))

# Optional: Plot histogram of simulated values
hist(simulated_values, breaks = 50, main = "Histogram of Simulated Values", 
     xlab = "Values", ylab = "Frequency", col = "skyblue")

While the code works correctly, it’s not as fast as I’d like for large simulations. Is there a way to speed it up? For example:

  1. Could the sapply call be replaced with something faster?

  2. Is there a way to vectorize or optimize the F_inv function?

Thank you for your help!

Attention : The support of x is $\mathbb{R}_+$


Solution

  • Update 2

    For the updated question that specifies x > 0, a quick fix is to implement rejection sampling. It will generally have a small impact on sampling speed compared to the first updated answer. Worst case (x0 ~ 0), it would be about twice as slow.

    params <- list(alpha = 2, x0 = 1, eta = 3.1)
    
    rremb6 <- function(n, params) {
      # Check parameter validity
      with(
        params,
        if (alpha <= 0 | x0 <= 0 | eta <= 3) {
          stop("Invalid parameters: alpha > 0, x0 > 0, and eta > 3 are required")
        } else {
          p <- 1 - (1 + x0/alpha)^(1 - eta)/2
          ns <- 0
          x <- numeric(n)
          while (n > 0) {
            np <- ceiling(n/p)
            xp <- x0 + sign(runif(np, -0.5, 0.5))*alpha*expm1(rexp(np, eta - 1))
            xp <- xp[xp > 0]
            if (length(xp) > n) xp <- xp[1:n]
            x[(ns + 1):(ns <- ns + length(xp))] <- xp
            n <- n - length(xp)
          }
          x
        }
      )
    }
    
    system.time(x <- rremb6(36e6, params))
    #>    user  system elapsed 
    #>    3.58    0.30    3.87
    length(x)
    #> [1] 36000000
    any(x < 0)
    #> [1] FALSE
    

    Update

    rremb does not correctly sample from the probability density given in the question introduction. Below is a function that correctly samples from the proportional density exp(-eta*log(alpha + abs(x - x0))).

    rremb5 <- function(n, params) {
      # Check parameter validity
      with(
        params,
        if (alpha <= 0 | x0 <= 0 | eta <= 3) {
          stop("Invalid parameters: alpha > 0, x0 > 0, and eta > 3 are required")
        } else x0 + sign(runif(n, -0.5, 0.5))*alpha*expm1(rexp(n, eta - 1))
      )
    }
    

    The density is symmetric about x0, so the variates should have a mean close to x0. A quick check to show that rremb5 behaves as expected, while rremb does not:

    params <- list(alpha = 2.0, x0 = 50.0, eta = 4.0)
    mean(rremb(1e6, params))
    #> [1] 49.61773
    mean(rremb5(1e6, params))
    #> [1] 49.99987
    

    Check the speed of rremb5:

    system.time(rremb5(63e6, params))
    #>    user  system elapsed 
    #>    4.10    0.15    4.23
    

    For the distribution of sums of large numbers of these random variates, the central limit theorem does well since the distribution is already symmetric. By n = 1e4 the mean is essentially indistinguishable from a normal distribution:

    (seed <- sample(.Machine$integer.max, 1))
    #> [1] 2111216523
    set.seed(seed)
    
    system.time(x <- rremb5(1e8, params))
    #>    user  system elapsed 
    #>    6.01    0.31    6.33
    R <- rowMeans(`dim<-`(x, c(1e4, 1e4)))
    sigma <- with(params, alpha*sqrt(eta/(eta - 3)/length(R))/(eta - 2))
    

    Quick check for normality:

    ks.test(R, pnorm, mean = params$x0, sd = sigma)
    #> 
    #>  Asymptotic one-sample Kolmogorov-Smirnov test
    #> 
    #> data:  R
    #> D = 0.0075364, p-value = 0.621
    #> alternative hypothesis: two-sided
    
    plot(ecdf(R), col = "blue", xlim = c(49.9, 50.1))
    curve(pnorm(x, params$x0, sigma), 49.9, 50.1, col = "orange", add = TRUE)
    

    enter image description here


    Original Answer

    Squeezing out a little more performance (compared to @RuiBarradas' answer):

    rremb4 <- function(n, params) {
      eta1 <- params$eta - 1
      
      # Check parameter validity
      if (params$alpha <= 0 | params$x0 <= 0 | eta1 <= 2) {
        stop("Invalid parameters: alpha > 0, x0 > 0, and eta > 3 are required")
      }
      
      K <- 2/params$alpha^eta1
      u <- runif(n, 0, K)
      params$x0 + if (K > eta1/2) {
        s <- sign(eta1/2 - u)
        s*(params$alpha - (eta1*(s < 0) + s*u)^(-s/eta1))
      } else {
        params$alpha - u^(-1/eta1)
      }
    }
    

    Testing

    params <- list(alpha = 2.0, x0 = 50.0, eta = 4.0)
    n <- 100000
    
    set.seed(2024)
    v0 <- rremb(n, params)
    set.seed(2024)
    v3 <- rremb3(n, params)
    set.seed(2024)
    v4 <- rremb4(n, params)
    
    identical(v0, v3)
    #> [1] TRUE
    all.equal(v0, v4)
    #> [1] TRUE
    

    Benchmark 1

    microbenchmark::microbenchmark(
      original = rremb(n, params),
      rremb3 = rremb3(n, params),
      rremb4 = rremb4(n, params)
    ) |> print(order = "median", unit = "relative")
    #> Unit: relative
    #>      expr       min        lq      mean    median        uq      max neval
    #>    rremb4  1.000000  1.000000  1.000000  1.000000  1.000000 1.000000   100
    #>    rremb3  1.107398  1.160999  1.155243  1.156929  1.152521 1.695537   100
    #>  original 12.326559 13.862971 14.206785 14.728450 15.586803 6.436684   100
    

    Benchmark 2 (K > eta1/2)

    params <- list(alpha = 0.9, x0 = 50.0, eta = 4.0)
    set.seed(2024)
    v3 <- rremb3(n, params)
    set.seed(2024)
    v4 <- rremb4(n, params)
    all.equal(v3, v4)
    #> [1] TRUE
    
    microbenchmark::microbenchmark(
      rremb3 = rremb3(n, params),
      rremb4 = rremb4(n, params)
    ) |> print(order = "median", unit = "relative")
    #> Unit: relative
    #>    expr      min       lq     mean   median       uq      max neval
    #>  rremb4 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000   100
    #>  rremb3 1.043331 1.074695 1.131025 1.097187 1.126365 1.554966   100
    

    Additional Checks

    In response to the comment that the results (in form of histograms) are different to what the original code produced: the sampling performed by rremb4 is algebraically equivalent to that of rremb. Some additional checks to demonstrate:

    f <- function(x) {
      seed <- sample(.Machine$integer.max, 1)
      set.seed(seed)
      v0 <- rremb(100, x)
      set.seed(seed)
      v4 <- rremb4(100, x)
      isTRUE(all.equal(v0, v4))
    }
    
    expand.grid(alpha = log(1:5) + 0.9, x0 = 50, eta = log(1:5) + 4) |>
      asplit(1) |> lapply(as.list) |> lapply(f) |> unlist() |> all()
    #> [1] TRUE
    

    Also, I'll point out that if alpha < ((eta - 1)/2)^(1/(1 - eta)), then the samples can include NA since z > 1 is a possibility, resulting in a negative value for (1 - z) * (eta - 1), for which there are no real roots.