rprobabilitykernel-densitydensity-plotlog-likelihood

Estimating PDF with monotonically declining density at tails


tldr: I am numerically estimating a PDF from simulated data and I need the density to monotonically decrease outside of the 'main' density region (as x-> infinity). What I have yields a close to zero density, but which does not monotonically decrease.


Detailed Problem

I am estimating a simulated maximum likelihood model, which requires me to numerically evaluate the probability distribution function of some random variable (the probability of which cannot be analytically derived) at some (observed) value x. The goal is to maximize the log-likelihood of these densities, which requires them to not have spurious local maxima.

Since I do not have an analytic likelihood function I numerically simulate the random variable by drawing the random component from some known distribution function, and apply some non-linear transformation to it. I save the results of this simulation in a dataset named simulated_stats.

I then use density() to approximate the PDF and approxfun() to evaluate the PDF at x:

#some example simulation
Simulated_stats_ <- runif(n=500, 10,15)+ rnorm(n=500,mean = 15,sd = 3)
#approximation for x
approxfun(density(simulated_stats))(x)

This works well within the range of simulated simulated_stats, see image: Example PDF. The problem is I need to be able to evaluate the PDF far from the range of simulated data.

So in the image above, I would need to evaluate the PDF at, say, x=50:

approxfun(density(simulated_stats))(50)
> [1] NA

So instead I use the from and to arguments in the density function, which correctly approximate near 0 tails, such

approxfun(
 density(Simulated_stats, from = 0, to = max(Simulated_stats)*10)
)(50)
[1] 1.924343e-18

Which is great, under one condition - I need the density to go to zero the further out from the range x is. That is, if I evaluated at x=51 the result must be strictly smaller. (Otherwise, my estimator may find local maxima far from the 'true' region, since the likelihood function is not monotonic very far from the 'main' density mass, i.e. the extrapolated region).

To test this I evaluated the approximated PDF at fixed intervals, took logs, and plotted. The result is discouraging: far from the main density mass the probability 'jumps' up and down. Always very close to zero, but NOT monotonically decreasing.

    a <- sapply(X = seq(from = 0, to = 100, by = 0.5), FUN = function(x){approxfun(
      density(Simulated_stats_,from = 0, to = max(Simulated_stats_)*10)
      )(x)})
    aa <- cbind( seq(from = 0, to = 100, by = 0.5), a)
    plot(aa[,1],log(aa[,2]))

Result: Non-monotonic log density far from density mass

My question

Does this happen because of the kernel estimation in density() or is it inaccuracies in approxfun()? (or something else?)

What alternative methods can I use that will deliver a monotonically declining PDF far from the simulated density mass?

Or - how can I manually change the approximated PDF to monotonically decline the further I am from the density mass? I would happily stick some linear trend that goes to zero...

Thanks!


Solution

  • One possibility is to estimate the CDF using a beta regression model; numerical estimate of the derivative of this model could then be used to estimate the pdf at any point. Here's an example of what I was thinking. I'm not sure if it helps you at all.

    1. Import libraries
    library(mgcv)
    library(data.table)
    library(ggplot2)
    
    1. Generate your data
    set.seed(123)
    Simulated_stats_ <- runif(n=5000, 10,15)+ rnorm(n=500,mean = 15,sd = 3)
    
    1. Function to estimate CDF using gam beta regression model
    get_mod <- function(ss,p = seq(0.02, 0.98, 0.02)) {
      qp = quantile(ss, probs=p)
      betamod = mgcv::gam(p~s(qp, bs="cs"), family=mgcv::betar())
      return(betamod)
    }
    
    betamod <- get_mod(Simulated_stats_)
    
    1. Very basic estimate of PDF at val given model that estimates CDF
    est_pdf <- function(val, betamod, tol=0.001) {
      xvals  = c(val,val+tol)
      yvals = predict(betamod,newdata=data.frame(qp = xvals), type="response")
      as.numeric((yvals[1] - yvals[2])/(xvals[1] - xvals[2]))
    }
    
    1. Lets check if monotonically increasing below min of Simulated_stats
    test_x = seq(0,min(Simulated_stats_), length.out=1000)
    pdf = sapply(test_x, est_pdf, betamod=betamod)
    all(pdf == cummax(pdf))
    
    [1] TRUE
    
    1. Lets check if monotonically decreasing above max of Simulated_stats
    test_x = seq(max(Simulated_stats_), 60, length.out=1000)
    pdf = sapply(test_x, est_pdf, betamod=betamod)
    all(pdf == cummin(pdf))
    
    [1] TRUE
    

    Additional thoughts 3/5/22

    As discussed in comments, using the betamod to predict might slow down the estimator. While this could be resolved to a great extent by writing your own predict function directly, there is another possible shortcut.

    1. Generate estimates from the betamod over the range of X, including the extremes
    k <- sapply(seq(0,max(Simulated_stats_)*10, length.out=5000), est_pdf, betamod=betamod)
    
    1. Use the approach above that you were initially using, i.e. a linear interpolation across the density, but rather than doing this over the density outcome, instead do over k (i.e. over the above estimates from the beta model)
    lin_int = approxfun(x=seq(0,max(Simulated_stats_)*10, length.out=5000),y=k)
    
    1. You can use the lin_int() function for prediction in the estimator, and it will be lighting fast. Note that it produces virtually the same value for a given x
    c(est_pdf(38,betamod), lin_int(38))
    [1] 0.001245894 0.001245968
    

    and it is very fast

    microbenchmark::microbenchmark(
      list = alist("betamod" = est_pdf(38, betamod),"lin_int" = lint(38)),times=100
    )
    
    Unit: microseconds
        expr    min      lq     mean  median      uq    max neval
     betamod 1157.0 1170.20 1223.304 1188.25 1211.05 2799.8   100
     lin_int    1.7    2.25    3.503    4.35    4.50   10.5   100
    

    Finally, lets check the same plot you did before, but using lin_int() instead of approxfun(density(....))

    a <- sapply(X = seq(from = 0, to = 100, by = 0.5), lin_int)
    aa <- cbind( seq(from = 0, to = 100, by = 0.5), a)
    plot(aa[,1],log(aa[,2]))
    

    performance of lin_int at extremes