rdata-fittingpoissonfitdistrplus

Fit Poisson binomial distribution to data


I keep hitting the same error when trying to fit distribution to data using fitdistrplus. MWE is below. In short, I want to fit a Poisson binomial distribution to some data. I'm using the poibin R package for the Poisson binomial p,d,q,r functions (I've also tried poisbinom with same error). In the MWE I create dd, the vector of successes. I'm trying to use fitdist then to fit the distribution given the starting values in the start list. The error says (I think) that I'm giving it start values that have names that aren't in the dpoibin function, which is where I'm stuck.

library(fitdistrplus)
library(poibin)
set.seed(123)
dd <- rpoibin(10, pp=seq(0.1, 0.5, length.out=10))
ppp <- runif(10)
ret <- try(fitdistrplus::fitdist(dd, distr=dpoibin,
    start=list(pp = ppp)))

Error message:

Error in checkparamlist(arg_startfix$start.arg, arg_startfix$fix.arg, : 'start' must specify names which are arguments to 'distr'.


Solution

  • The error comes from the function fitdistrplus:::checkparamlist, which is called by fitdist to ensure the names in the list passed to start match the parameter names in the function passed to distr. When you pass a vector like ppp as a parameter in start, checkparamlist renames each element of the vector by appending an integer. This means the argument names become "pp1", "pp2", "pp3" and so on up to "pp10". Since there is no argument being passed called pp, an error is thrown.

    I'm not sure if there is a way to estimate vectorized parameters in fitdist due to this problem, but fortunately in this case we can easily just fit the distribution ourselves.

    Since we know the mean of the distribution is

    and the variance is

    (Reference)

    Then we know that if we have a sample dd, the following function will return 0 if pp fits the distribution perfectly:

    objective <- function(pp) {
      abs(mean(dd) - sum(pp)) + abs(sum(pp * (1 - pp)) - var(dd))
    }
    

    To demonstrate this works, let's take a much larger sample from rpoibin

    set.seed(123)
    
    dd  <- poibin::rpoibin(100000, pp=seq(0.1, 0.5, length.out=10))
    ppp <- runif(10)
    

    Now we find the set of values that optimizes our objective function:

    pp_opt <- optim(par = ppp, objective)$par
    
    pp_opt
    #>  [1] 0.45594175 0.08754997 0.54250499 0.28056432 0.30363457 0.28354584
    #>  [7] 0.17861750 0.21109410 0.41562763 0.23920435
    

    We can confirm this is a good fit by plotting a histogram and overlay the output of dpoibin with our calculated values for the pp parameter:

    hist(dd, freq = FALSE, breaks = 0:11 - 0.5)
    points(0:10, poibin::dpoibin(0:10, pp = pp_opt), col = "red")
    

    Note that there could be many solutions to the optimal value of pp, and we should not expect to get seq(0.1, 0.5, length.out = 10). For a start, order does not make a difference. We can see our pp_opt has a very similar mean and variance to seq(0.1, 0.5, length.out = 10), which is all that matters in terms of fitting the distribution

    mean(seq(0.1, 0.5, length.out = 10))
    #> [1] 0.3
    mean(pp_opt)
    #> [1] 0.2998285
    
    sum((1 - pp_opt) * pp_opt)
    #> [1] 1.930687
    sum((1 - seq(0.1, 0.5, length.out = 10)) * seq(0.1, 0.5, length.out = 10))
    #> [1] 1.937037
    

    In general, it is not possible to recover pp exactly from a given sample due to the ordering and the fact that an infinite number of sets have the same distribution and calculated variance.

    Created on 2023-07-18 with reprex v2.0.2