rkernel-densitycdf

R: probability / numerical integral of bivariate (or multivariate) kernel density


I am using the package ks for kernel density estimation. Here's an easy example:

n <- 70
x <- rnorm(n)

library(ks)
f_kde <- kde(x) 

I am actually interested in the respective exceeding probabilities of my input data, which can be easily returned by ks having f_kde:

p_kde <- pkde(x, f_kde)

This is done in ks with a numerical integration using Simpson's rule. Unfortunately, they only implemented this for a 1d case. In a bivariate case, there's no implementation in ks of any method for returning the probabilities :

y <- rnorm(n)
f_kde <- kde(data.frame(x,y))
# does not work, but it's what I am looking for:
p_kde <- pkde(data.frane(x,y), f_kde) 

I couldnt find any package or help searching in stackoverflow to solve this issue in R (some suggestions for Python exist, but I would like to keep it in R). Any line of code or package recommendation is appreciated. Even though I am mostly interested in the bivariate case, any ideas for a multivariate case are appreciated as well.


Solution

  • kde allows multidimensional kernel estimate, so we could use kde to calculate pkde.
    For this, we calculate kde on small enough dx and dy steps using eval.points parameter : this gives us the local density estimate on a dx*dy square.
    We verify that the sum of estimates mutiplied by the surface of the squares almost equals 1:

    library(ks)
    set.seed(1)
    n <- 10000
    x <- rnorm(n)
    y <- rnorm(n)
    xy <- cbind(x,y)
    
    xmin <- -10
    xmax <- 10
    dx <- .1
    
    ymin <- -10
    ymax <- 10
    dy <- .1
    
    pts.x <- seq(xmin, xmax, dx)
    pts.y <- seq(ymin, ymax, dy)
    pts <- as.data.frame(expand.grid(x = pts.x, y = pts.y))
    f_kde <- kde(xy,eval.points=pts)
    
    pts$est <- f_kde$estimate
    
    sum(pts$est)*dx*dy
    [1] 0.9998778
    

    You can now query the pts dataframe for the cumulative probability on the area of your choice :

    library(data.table)
    setDT(pts)
    # cumulative density
    pts[x < 1 & y < 2 , .(pkde=sum(est)*dx*dy)]
            pkde
    1: 0.7951228
    
    # average density around a point
    tolerance <-.1
    pts[pmin(abs(x-1))<tolerance & pmin(abs(y-2))<tolerance, .(kde = mean(est))]
              kde
    1: 0.01465478