I have a vector dataset dt and only a few of 500 observations is given below.
[1] 0 4 8 13 15 1 0 5 4 6 0 0 0 1 2 3 0 0 0 0 0 3 0 10 15 0 23 0 6 0 10 2 15 0 3 0 2 0
[39] 3 0 42 4 6 12 3 2 0 0 0 2 3 9 6 0 3 2 1 6 0 0 0 0 0 1 0 0 2 0 0 0 0 7 0 0 0 1
I want to check if the distribution of data follows a negative binomial distribution or Poisson distribution better using the KS test and regression. There are only a few R packages available for the KS test of which not all give a good summary and some are no longer valid.
What I have tried?
Package iZID
install.packages("iZID")
library("iZID")
dis.kstest(dt,nsim=100,bootstrap=TRUE,distri='nb')$pvalue
dis.kstest(dt,nsim=100,bootstrap=TRUE,distri='Poisson')$pvalue
Returns
> dis.kstest(dt,nsim=100,bootstrap=TRUE,distri='nb')$pvalue
[1] 0
> dis.kstest(dt,nsim=100,bootstrap=TRUE,distri='Poisson')$pvalue
[1] 0
Question
First, With the above results it’s hard to recommend which distribution is a better fit. Second, I have also used ks.test()
in dgof
package but it needs initialization of lambda
value for Poisson which is unknown and also Negative binomial should have a probability
value initialized which is unknown. Third, I receive an error suggesting that goodfit()
in vcd
package is not available hence couldn't use this tool.
Can I use a ks-test package other than dgof
, iZID
, and vcd
?
Can I use regression to compare the AIC of two distributions using glm
in R? In that case, how to build the model? What is X and Y?
dput
formatdt <-
c(0L, 4L, 8L, 13L, 15L, 1L, 0L, 5L, 4L, 6L, 0L, 0L, 0L, 1L, 2L,
3L, 0L, 0L, 0L, 0L, 0L, 3L, 0L, 10L, 15L, 0L, 23L, 0L, 6L, 0L,
10L, 2L, 15L, 0L, 3L, 0L, 2L, 0L, 3L, 0L, 42L, 4L, 6L, 12L, 3L,
2L, 0L, 0L, 0L, 2L, 3L, 9L, 6L, 0L, 3L, 2L, 1L, 6L, 0L, 0L, 0L,
0L, 0L, 1L, 0L, 0L, 2L, 0L, 0L, 0L, 0L, 7L, 0L, 0L, 0L, 1L)
First, to answer your question about KS
test packages, yes, you can try the ks
package in R.
It provides the ks.test()
function that you could use to perform a KS test
on your dataset.
For example, to perform a KS test on your dataset using a Poisson distribution, you can use the following code (of course I am assuming that you already have installed the package, before running the below code):
install.packages("ks", dependencies = T)
library(ks)
ks.test(dt, "ppois", lambda=mean(dt))
Here,
"ppois" is the Poisson distribution,
"lambda" is an estimate for the rate parameter of the Poisson distribution
and we have set it to the mean of your data
The above code returns:
Asymptotic one-sample Kolmogorov-Smirnov test
data: dt
D = 0.44349, p-value = 2.078e-13
alternative hypothesis: two-sided
Which means:
The p-value is very small (2.078e-13), thus, suggesting a strong evidence against the null hypothesis. Therefore, you can reject the null hypothesis that the data follow the specified theoretical distribution at the significance level of 0.05, and hence, conclude that the data do not follow the theoretical distribution.
Similarly, you could use the qnbinom
distribution for the negative binomial distribution.
Now, to answer your second question about comparing the AIC of two distributions using regression, what you could use is glm()
native function in R.
All you need to do is, first, create an output variable Y
that represents the count data, and then an explanatory (input) variable X
that refers to the index of each observation in your dataset.
Now you can fit two models; one, with the Poisson Distributions and, second, with the Negative Binomial Distributions, respectively, and compare their AIC values to determine which distribution provides a better fit.
library(ks)
library(MASS)
Y <- dt # create response var
X <- 1:length(Y) # create independent (input) var
modelPD <- glm(Y ~ X, family = "poisson") # fit the Poisson Distribution
modelNBD <- glm.nb(Y ~ X) # fit the Negative Binomial Distribution from MASS packages
# compare two AICs
AIC(modelPD)
AIC(modelNBD)
Which returns:
688.5675 and 334.3564
And as we know, lower the AIC value, the better the model fits the data.
Therefore, we can conclude that the Negative Binomial Distribution
provides a better fit to the data compared to the Poisson Distribution
.
Let me know if this helps...