Fit Poisson binomial distribution to data

I keep hitting the same error when trying to fit distribution to data using fitdistrplus. MWE is below. In short, I want to fit a Poisson binomial distribution to some data. I'm using the poibin R package for the Poisson binomial p,d,q,r functions (I've also tried poisbinom with same error). In the MWE I create dd, the vector of successes. I'm trying to use fitdist then to fit the distribution given the starting values in the start list. The error says (I think) that I'm giving it start values that have names that aren't in the dpoibin function, which is where I'm stuck.

library(fitdistrplus)
library(poibin)
set.seed(123)
dd <- rpoibin(10, pp=seq(0.1, 0.5, length.out=10))
ppp <- runif(10)
ret <- try(fitdistrplus::fitdist(dd, distr=dpoibin,
    start=list(pp = ppp)))

Error message:

Error in checkparamlist(arg_startfix$start.arg, arg_startfix$fix.arg, : 'start' must specify names which are arguments to 'distr'.

Solution

The error comes from the function fitdistrplus:::checkparamlist, which is called by fitdist to ensure the names in the list passed to start match the parameter names in the function passed to distr. When you pass a vector like ppp as a parameter in start, checkparamlist renames each element of the vector by appending an integer. This means the argument names become "pp1", "pp2", "pp3" and so on up to "pp10". Since there is no argument being passed called pp, an error is thrown.

I'm not sure if there is a way to estimate vectorized parameters in fitdist due to this problem, but fortunately in this case we can easily just fit the distribution ourselves.

Since we know the mean of the distribution is

$\mu=\sum\limits _{{i=1}}^{n}p_{i}$

and the variance is

$\sigma^{2}=\sum\limits _{{i=1}}^{n}(1-p_{i})p_{i}$

(Reference)

Then we know that if we have a sample dd, the following function will return 0 if pp fits the distribution perfectly:

objective <- function(pp) {
  abs(mean(dd) - sum(pp)) + abs(sum(pp * (1 - pp)) - var(dd))
}

To demonstrate this works, let's take a much larger sample from rpoibin

set.seed(123)

dd  <- poibin::rpoibin(100000, pp=seq(0.1, 0.5, length.out=10))
ppp <- runif(10)

Now we find the set of values that optimizes our objective function:

pp_opt <- optim(par = ppp, objective)$par

pp_opt
#>  [1] 0.45594175 0.08754997 0.54250499 0.28056432 0.30363457 0.28354584
#>  [7] 0.17861750 0.21109410 0.41562763 0.23920435

We can confirm this is a good fit by plotting a histogram and overlay the output of dpoibin with our calculated values for the pp parameter:

hist(dd, freq = FALSE, breaks = 0:11 - 0.5)
points(0:10, poibin::dpoibin(0:10, pp = pp_opt), col = "red")

Note that there could be many solutions to the optimal value of pp, and we should not expect to get seq(0.1, 0.5, length.out = 10). For a start, order does not make a difference. We can see our pp_opt has a very similar mean and variance to seq(0.1, 0.5, length.out = 10), which is all that matters in terms of fitting the distribution

mean(seq(0.1, 0.5, length.out = 10))
#> [1] 0.3
mean(pp_opt)
#> [1] 0.2998285

sum((1 - pp_opt) * pp_opt)
#> [1] 1.930687
sum((1 - seq(0.1, 0.5, length.out = 10)) * seq(0.1, 0.5, length.out = 10))
#> [1] 1.937037

In general, it is not possible to recover pp exactly from a given sample due to the ordering and the fact that an infinite number of sets have the same distribution and calculated variance.

^{Created on 2023-07-18 with reprex v2.0.2}