I’m working on a Monte Carlo simulation project, and I need help optimizing some aspects of the problem. Here’s the scenario :
We model claims R (linked to weather-dependent events) as independent variables, with the weather conditions themselves being independent under certain assumptions. The situation involves two hypotheses regarding motorcyclists :
Hypothesis A: All motorcyclists travel in the same region, subjected to the same weather conditions.
Hypothesis B: Motorcyclists travel in different regions, each with independent weather conditions (modeled as independent Markov chains).
The claim amounts (in monetary units) follow the distribution $\mathbb{P}r$ with density proportional to : $f^*(x) = \textbf{1}{\mathbb{R}_+} \exp\left(-\eta\ln(\alpha + |x - x_0|)\right)$,
Where $\alpha > 0, x_0 > 0, \eta > 0$ are constants.
Over a period of 365 days, we insure $N$, and and the total claim amount $R$ is the sum of individual claims.
I aim to approximate the conditional expectation $m(s) = \mathbb{R}[R - s \mid R > s]$ using a Monte Carlo method.
Here’s the current R code I’ve written for simulating claim amounts based on $\mathbb{P}_r$ :
rremb <- function(n, params) {
# Extract parameters from the list
alpha <- params$alpha
x0 <- params$x0
eta <- params$eta
# Check parameter validity
if (alpha <= 0 | x0 <= 0 | eta <= 3) {
stop("Invalid parameters: alpha > 0, x0 > 0, and eta > 3 are required")
}
# Normalization factor K
K <- 2 * alpha^(1 - eta) / (eta - 1)
# Inverse of F* (without normalization factor)
F_star_inv <- function(z) {
if (z <= 0.5) {
return((alpha + x0 - (z * (eta - 1))^(1 / (1 - eta))))
} else {
return((x0 - alpha + ((1 - z) * (eta - 1))^(1 / (eta - 1))))
}
}
# Inverse of F considering K
F_inv <- function(y) {
return(F_star_inv(K * y))
}
# Generate n uniform random values on (0, 1)
u <- runif(n)
# Use F_inv to generate values based on P_r
x <- sapply(u, F_inv)
return(x)
}
# Parameters
params <- list(alpha = 2.0, x0 = 50.0, eta = 4.0)
# Simulate values
n <- 100000 # Adjust as needed
simulated_values <- rremb(n, params)
# Display results
print(head(simulated_values, 10))
# Optional: Plot histogram of simulated values
hist(simulated_values, breaks = 50, main = "Histogram of Simulated Values",
xlab = "Values", ylab = "Frequency", col = "skyblue")
While the code works correctly, it’s not as fast as I’d like for large simulations. Is there a way to speed it up? For example:
Could the sapply call be replaced with something faster?
Is there a way to vectorize or optimize the F_inv function?
Thank you for your help!
Attention : The support of x is $\mathbb{R}_+$
For the updated question that specifies x > 0
, a quick fix is to implement rejection sampling. It will generally have a small impact on sampling speed compared to the first updated answer. Worst case (x0 ~ 0
), it would be about twice as slow.
params <- list(alpha = 2, x0 = 1, eta = 3.1)
rremb6 <- function(n, params) {
# Check parameter validity
with(
params,
if (alpha <= 0 | x0 <= 0 | eta <= 3) {
stop("Invalid parameters: alpha > 0, x0 > 0, and eta > 3 are required")
} else {
p <- 1 - (1 + x0/alpha)^(1 - eta)/2
ns <- 0
x <- numeric(n)
while (n > 0) {
np <- ceiling(n/p)
xp <- x0 + sign(runif(np, -0.5, 0.5))*alpha*expm1(rexp(np, eta - 1))
xp <- xp[xp > 0]
if (length(xp) > n) xp <- xp[1:n]
x[(ns + 1):(ns <- ns + length(xp))] <- xp
n <- n - length(xp)
}
x
}
)
}
system.time(x <- rremb6(36e6, params))
#> user system elapsed
#> 3.58 0.30 3.87
length(x)
#> [1] 36000000
any(x < 0)
#> [1] FALSE
rremb
does not correctly sample from the probability density given in the question introduction. Below is a function that correctly samples from the proportional density exp(-eta*log(alpha + abs(x - x0)))
.
rremb5 <- function(n, params) {
# Check parameter validity
with(
params,
if (alpha <= 0 | x0 <= 0 | eta <= 3) {
stop("Invalid parameters: alpha > 0, x0 > 0, and eta > 3 are required")
} else x0 + sign(runif(n, -0.5, 0.5))*alpha*expm1(rexp(n, eta - 1))
)
}
The density is symmetric about x0
, so the variates should have a mean close to x0
. A quick check to show that rremb5
behaves as expected, while rremb
does not:
params <- list(alpha = 2.0, x0 = 50.0, eta = 4.0)
mean(rremb(1e6, params))
#> [1] 49.61773
mean(rremb5(1e6, params))
#> [1] 49.99987
Check the speed of rremb5
:
system.time(rremb5(63e6, params))
#> user system elapsed
#> 4.10 0.15 4.23
For the distribution of sums of large numbers of these random variates, the central limit theorem does well since the distribution is already symmetric. By n = 1e4
the mean is essentially indistinguishable from a normal distribution:
(seed <- sample(.Machine$integer.max, 1))
#> [1] 2111216523
set.seed(seed)
system.time(x <- rremb5(1e8, params))
#> user system elapsed
#> 6.01 0.31 6.33
R <- rowMeans(`dim<-`(x, c(1e4, 1e4)))
sigma <- with(params, alpha*sqrt(eta/(eta - 3)/length(R))/(eta - 2))
Quick check for normality:
ks.test(R, pnorm, mean = params$x0, sd = sigma)
#>
#> Asymptotic one-sample Kolmogorov-Smirnov test
#>
#> data: R
#> D = 0.0075364, p-value = 0.621
#> alternative hypothesis: two-sided
plot(ecdf(R), col = "blue", xlim = c(49.9, 50.1))
curve(pnorm(x, params$x0, sigma), 49.9, 50.1, col = "orange", add = TRUE)
Squeezing out a little more performance (compared to @RuiBarradas' answer):
rremb4 <- function(n, params) {
eta1 <- params$eta - 1
# Check parameter validity
if (params$alpha <= 0 | params$x0 <= 0 | eta1 <= 2) {
stop("Invalid parameters: alpha > 0, x0 > 0, and eta > 3 are required")
}
K <- 2/params$alpha^eta1
u <- runif(n, 0, K)
params$x0 + if (K > eta1/2) {
s <- sign(eta1/2 - u)
s*(params$alpha - (eta1*(s < 0) + s*u)^(-s/eta1))
} else {
params$alpha - u^(-1/eta1)
}
}
Testing
params <- list(alpha = 2.0, x0 = 50.0, eta = 4.0)
n <- 100000
set.seed(2024)
v0 <- rremb(n, params)
set.seed(2024)
v3 <- rremb3(n, params)
set.seed(2024)
v4 <- rremb4(n, params)
identical(v0, v3)
#> [1] TRUE
all.equal(v0, v4)
#> [1] TRUE
Benchmark 1
microbenchmark::microbenchmark(
original = rremb(n, params),
rremb3 = rremb3(n, params),
rremb4 = rremb4(n, params)
) |> print(order = "median", unit = "relative")
#> Unit: relative
#> expr min lq mean median uq max neval
#> rremb4 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 100
#> rremb3 1.107398 1.160999 1.155243 1.156929 1.152521 1.695537 100
#> original 12.326559 13.862971 14.206785 14.728450 15.586803 6.436684 100
Benchmark 2 (K > eta1/2
)
params <- list(alpha = 0.9, x0 = 50.0, eta = 4.0)
set.seed(2024)
v3 <- rremb3(n, params)
set.seed(2024)
v4 <- rremb4(n, params)
all.equal(v3, v4)
#> [1] TRUE
microbenchmark::microbenchmark(
rremb3 = rremb3(n, params),
rremb4 = rremb4(n, params)
) |> print(order = "median", unit = "relative")
#> Unit: relative
#> expr min lq mean median uq max neval
#> rremb4 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 100
#> rremb3 1.043331 1.074695 1.131025 1.097187 1.126365 1.554966 100
In response to the comment that the results (in form of histograms) are different to what the original code produced: the sampling performed by rremb4
is algebraically equivalent to that of rremb
. Some additional checks to demonstrate:
f <- function(x) {
seed <- sample(.Machine$integer.max, 1)
set.seed(seed)
v0 <- rremb(100, x)
set.seed(seed)
v4 <- rremb4(100, x)
isTRUE(all.equal(v0, v4))
}
expand.grid(alpha = log(1:5) + 0.9, x0 = 50, eta = log(1:5) + 4) |>
asplit(1) |> lapply(as.list) |> lapply(f) |> unlist() |> all()
#> [1] TRUE
Also, I'll point out that if alpha < ((eta - 1)/2)^(1/(1 - eta))
, then the samples can include NA
since z > 1
is a possibility, resulting in a negative value for (1 - z) * (eta - 1)
, for which there are no real roots.