rbayesiannumerical-integration

Integrating a density to find the upper limit to give a certain probability


I have a posterior density function and I am trying to integrate its posterior cumulative distribution function (CDF) such that I want to find the upper bound of integration so that the probability is equal to 90%. Put another way, I am trying to solve for q_j in the following equation where pi(q) is my posterior distribution:

enter image description here

such that the solution is equal to 0.90.

I have the following R code to do so:

y <- rgamma(22, 2, 0.5)
n <- length(y)

tau <- 0.9
alpha <- 0.1

# Define the log-posterior
log_posterior <- function(q) {
  mu <- mean((y < q) - tau)
  sigma <- max(var((y < q) - tau), 1e-8)
  log_post <- -0.5 * log(sigma) - (n / 2) * (mu^2 / sigma)
  return(log_post)
}
  
# Create evaluation grid
a <- 0
b <- 10
n_grid <- 1000
q_grid <- seq(a, b, length.out = n_grid)
log_post_vals <- sapply(q_grid, log_posterior)
  
# Stabilize and exponentiate
post_vals <- exp(log_post_vals - max(log_post_vals))
post_dens <- post_vals / sum(post_vals * diff(q_grid)[1])
  
# Compute CDF
cdf <- cumsum(post_dens * diff(q_grid)[1])

# Find the upper limit  
(upper <- q_grid[which(cdf >= 1 - alpha)[1]])

Most times the solution I have seems to give sensible results, however, for small-ish sample sizes, the code does not seem to be very robust choices of a, b, and the grid size. The lower bound a has a natural value of 0 (it can't be any smaller) but how large to set b to is what is giving me issues. I understand as well than I need to choose b large enough that I cover the posterior density where it has meaningful density, but if I choose b too large that also has implications which I assume is tied to the grid size. Are there suggestions for a better or more robust solution to what I am trying to do? I know I can also just obtain posterior samples of my distribution and calculate the quantity of interest that way, but since this is a "simple" one-dimensional problem I wanted to use numerical integration to try to speed up calculations.


Solution

  • I do not see the point of doing numerical integration by hand. Perhaps we can try R's integrate function? This abstracts away the requirement for an upper bound, as we can simply set b = Inf. Define the integral operator as

    int <- function(f, a, b) {
      integrate(f, a, b, subdivisions = 1e5, rel.tol = 1e-10)$value
    }
    

    Another thing is that we can try to compute the posterior density directly for greater precision. To ensure things are well-behaved, -0.5 * log(sigma) - (n / 2) * (mu^2 / sigma) needs to be defined for the full support of q, that is, [0, Inf). With some algebra, you can show that

    mu <- p - tau
    sigma <- n / (n - 1) * p * (1 - p)
    

    where p is simply mean(y < q). Then it is clear that the boundary cases of q in {0, Inf} can be mapped to p in {0, 1}. You can also show that 0.5 * log(x) - (n / 2) * (mu^2 * x) is diverging to negative infinity for x sufficiently large. So take x = 1/sigma, the same expression must be evaluated to -Inf when p in {0, 1} (i.e., sigma = 0). Then given any vector y, the posterior density can be defined as follows

    posterior <- function(y, tau) {
      n <- length(y)
      f <- \(x) {
        p <- colMeans(outer(y, x, `<`))
        mu <- p - tau
        sigma <- n / (n - 1) * p * (1 - p)
        out <- -0.5 * log(sigma) - (n / 2) * (mu^2 / sigma)
        out[sigma <= 0] <- -Inf
        exp(out)
      }
      integral <- int(f, 0, Inf)
      \(x) f(x) / integral
    }
    

    Note that colMeans(outer(y, x, `<`)) is just a faster way of doing sapply(x, \(i) mean(y < i)). Using this trick we vectorize the internal function f for speed. Note that we do not need to subtract the max because the numerator simplifies to exp(x - max(x)) = exp(x) / exp(max(x)) and the constant exp(max(x)) will cancel out on both the numerator and denominator by linearity of integral. Finally, to get the 90th percentile, you can use the uniroot function to search for the point where the posterior CDF is exactly 0.9. The posterior CDF is just another numerical integral. Consider the following steps:

    y <- rgamma(22, 2, 0.5)
    alpha <- 0.1
    tau <- 0.9
    
    post_pdf <- posterior(y, tau)
    uniroot(\(x) int(post_pdf, 0, x) - (1 - alpha), c(0, 100))
    

    Full R script as follows.

    int <- function(f, a, b) {
      integrate(f, a, b, subdivisions = 1e5, rel.tol = 1e-10)$value
    }
    
    posterior <- function(y, tau) {
      n <- length(y)
      f <- \(x) {
        p <- colMeans(outer(y, x, `<`))
        mu <- p - tau
        sigma <- n / (n - 1) * p * (1 - p)
        out <- -0.5 * log(sigma) - (n / 2) * (mu ^ 2 / sigma)
        out[sigma <= 0] <- -Inf
        exp(out)
      }
      integral <- int(f, 0, Inf)
      \(x) f(x) / integral
    }
    
    y <- rgamma(22, 2, 0.5)
    alpha <- 0.1
    tau <- 0.9
    
    post_pdf <- posterior(y, tau)
    uniroot(\(x) int(post_pdf, 0, x) - (1 - alpha), c(0, 100))