rfunctionoptimizationdistributionintegrate

Can we use Base R to find the 95% of the area under a curve?


Using Base R, I was wondering if I could determine the 95% area under the curve denoted as posterior below?

More specifically, I want to move from the mode (the green dashed line) toward the tails and then stop when I have covered 95% of the curve area. Desired are the x-axis values that are the limits of this 95% area as shown in the picture below?

     prior = function(x) dbeta(x, 15.566, 7.051) 
likelihood = function(x) dbinom(55, 100, x)
 posterior = function(x) prior(x)*likelihood(x)

mode = optimize(posterior, interval = c(0, 1), maximum = TRUE, tol = 1e-12)[[1]]

curve(posterior, n = 1e4)

P.S In other words, it is highly desirable if such an Interval be the shortest 95% interval possible.

enter image description here


Solution

  • Symmetric distribution

    Even though OP's example was not exactly symmetric, it is close enough - and useful to start there since the solution is much simpler.

    You can use a combination of integrate and optimize. I wrote this as a custom function, but note that if you use this in other situations you may have to rethink the bounds for searching the quantile.

    # For a distribution with a single peak, find the symmetric!
    # interval that contains probs probability. Search over 'range'.
    f_quan <- function(fun, probs, range=c(0,1)){
    
      mode <- optimize(fun, interval = range, maximum = TRUE, tol = 1e-12)[[1]]
    
      total_area <- integrate(fun, range[1], range[2])[[1]]
    
      O <- function(d){
        parea <- integrate(fun, mode-d, mode+d)[[1]] / total_area
        (probs - parea)^2
      }
      # Bounds for searching may need some adjustment depending on the problem!
      o <- optimize(O, c(0,range[2]/2 - 1E-02))[[1]]
    
    return(c(mode-o, mode+o))
    }
    

    Use it like this,

    f <- f_quan(posterior, 0.95)
    curve(posterior, n = 1e4)
    abline(v=f, col="blue", lwd=2, lty=3)
    

    gives

    enter image description here

    Asymmetric distribution

    In the case of an asymmetric distribution, we have to search two points that meet the criterium that P(a < x < b) = Prob, where Prob is some desired probability. Since there are infinitely many intervals (a,b) that meet this, OP suggested finding the shortest one.

    Important in the solution is the definition of a domain, the region where we want to search (we cannot use -Inf, Inf, so the user has to set this to reasonable values).

    # consider interval (a,b) on the x-axis
    # integrate our function, normalize to total area, to 
    # get the total probability in the interval
    prob_ab <- function(fun, a, b, domain){
      totarea <- integrate(fun, domain[1], domain[2])[[1]]
      integrate(fun, a, b)[[1]] / totarea
    }
    
    # now given a and the probability, invert to find b
    invert_prob_ab <- function(fun, a, prob, domain){
    
      O <- function(b, fun, a, prob){
        (prob_ab(fun, a, b, domain=domain) - prob)^2
      }
    
      b <- optimize(O, c(a, domain[2]), a = a, fun=fun, prob=prob)$minimum
    
    return(b)
    }
    
    # now find the shortest interval by varying a
    # Simplification: don't search past the mode, otherwise getting close
    # to the right-hand side of domain will give serious trouble!
    prob_int_shortest <- function(fun, prob, domain){
    
      mode <- optimize(fun, interval = domain, maximum = TRUE, tol = 1e-12)[[1]]
    
      # objective function to be minimized: the width of the interval
      O <- function(a, fun, prob, domain){
        b <- invert_prob_ab(fun, a, prob, domain)
    
        b - a
      }
    
      # shortest interval that meets criterium
      abest <- optimize(O, c(0,mode), fun=fun, prob=prob, domain=domain)$minimum
    
      # now return the interval
      b <- invert_prob_ab(fun, abest, prob, domain)
    
    return(c(abest,b))
    }
    

    Now use the above code like this. I use a very asymmetric function (just assume mydist is actually some complicated pdf, not the dgamma).

    mydist <- function(x)dgamma(x, shape=2)
    curve(mydist(x), from=0,  to=10)
    abline(v=prob_int_shortest(mydist, 0.9, c(0,10)), lty=3, col="blue", lwd=2)
    

    In this example I set domain to (0,10), since clearly the interval must be in there somewhere. Note that using a very large value like (0, 1E05) does not work, because integrate has trouble with long sequences of near-zeroes. Again, for your situation, you will have to adjust the domain (unless someone has a better idea!).

    enter image description here