rexpectation-maximization

Expectation Maximization using a Poisson likelihood function


I am trying to apply the expectation-maximization algorithm to estimate missing count data but all the packages in R, such as missMethods, assume a multivariate Gaussian distribution. How would I apply the expectation-maximization algorithm to estimate missing count data assuming a Poisson distribution?

Say we have data that look like this:

x <- c(100,  96,  79, 109, 111,  NA,  93,  95, 119,  90, 121,  96,  NA,  
       NA,  85,  95, 110,  97,  87, 104, 101,  87,  87,  NA,  89,  NA, 
       113,  NA,  95,  NA, 119, 115,  NA, 105,  NA,  80,  90, 108,  90,  
       99, 111,  93,  99,  NA,  87,  89,  87, 126, 101, 106)

Applying impute_EM using missMethods (missMethods::impute_EM(x, stochastic = FALSE)) gives an answer but the data are not continuous but discrete.

I understand that questions like these require a minimum, reproducible example, but I honestly do not know where to start. Even suggested reading to point me in the right direction would be helpful.


Solution

  • Defining x0:

    x0 <- x[!is.na(x)]
    

    The Jeffreys/reference prior for a Poisson distribution with mean lambda is 1/sqrt(lambda). From the observed values, this results in lambda having a gamma reference posterior with a shape parameter sum(x0) + 0.5 and a rate parameter 1/length(x0). You could take n samples of lambda with:

    lambda <- rgamma(n, sum(x0) + 0.5, length(x0))
    

    Then sample n missing values (xm) with

    xm <- rpois(n, lambda)
    

    Alternatively, since a Gamma-Poisson compound distribution can be formulated as a negative binomial (after integrating out lambda):

    xm <- rnbinom(n, sum(x0) + 0.5, length(x0)/(length(x0) + 1L))
    

    As a function:

    MI_poisson <- function(x, n) {
      x0 <- x[!is.na(x)]
      rbind(matrix(x0, ncol = n, nrow = length(x0)),
            matrix(rnbinom(n*(length(x) - length(x0)), sum(x0) + 0.5, length(x0)/(length(x0) + 1L)), ncol = n))
    }
    

    This will return a matrix with n columns where each column contains the original vector x with all NA values imputed. Each column could be used separately in further analysis, then the results can be aggregated.