I have a function fitting a distribution and returning a vector consisting of distribution name, mean, sd, etc. I'm testing few distributions but I can't rely on gofstat() because it goes mad when there are too many zeros to consider.
Therefore I have to compare manually AIC for several variables, decide which ones are actually of "fitdist" class and return the name of the variable with the lowest AIC. Once I have that, I compute mean, sd, etc and return.
The code currently looks like this:
library(fitdistrplus)
fit_distr <- function(data){
fe <- tryCatch(fitdist(data, "exp"), error = function(e) FALSE )
flogis <- tryCatch(fitdist(data, "logis"), error = function(e) FALSE )
fn <- tryCatch(fitdist(data, "norm"), error = function(e) FALSE)
fp <- tryCatch(fitdist(data, "pois"), error = function(e) FALSE)
fg <- tryCatch(fitdist(data, "gamma"), error = function(e) FALSE)
classFitDist <- c(class(fe), class(flogis), class(fn), class(fp),class(fg))
distributions <- classFitDist == "fitdist"
AIC <- data.frame()
if(class(fe)=="fitdist") {AIC[1,ncol(AIC)+1] <- fe$aic}
if(class(flogis)=="fitdist") {AIC[1,ncol(AIC)+1] <- flogis$aic}
if(class(fn)=="fitdist") {AIC[1,ncol(AIC)+1] <- fn$aic}
if(class(fp)=="fitdist") {AIC[1,ncol(AIC)+1] <- fp$aic}
if(class(fg)=="fitdist") {AIC[1,ncol(AIC)+1] <- fg$aic}
names(AIC) <- c("exp", "logis", "norm", "pois", "gamma")[distributions]
fit <- names(AIC[which.min(AIC)])
mean <- switch (fit,
exp = 1/fe$estimate[[1]],
logis = flogis$estimate[[1]],
norm = fn$estimate[[1]],
pois = fp$estimate[[1]],
gamma = fg$estimate[[1]]/fg$estimate[[2]]
)
sd <- switch (fit,
exp = mean,
logis = (flogis$estimate[[2]]*pi)/sqrt(3),
norm = fn$estimate[[2]],
pois = sqrt(mean),
gamma = sqrt(fg$estimate[[1]]/(fg$estimate[[2]]^2))
)
return(c(fit,mean,sd))
}
It works, but on thousands of samples to consider is very slow. I would welcome any suggestions how to optimise it and make it 'cleaner' and faster.
Btw, this is what I had before, however like I mentioned - with samples consisting of too many zeros it was having a fit (pun unintentional!)
goodnessoffit <- gofstat(list(fe, flogis, fn, fp, fg)[distributions], fitnames = c("exp", "logis", "norm", "pois","gamma")[distributions])
fit <- names(which(goodnessoffit$aic == min(goodnessoffit$aic)))
Error in ans[!test & ok] <- rep(no, length.out = length(ans))[!test & : replacement has length zero
The problem with this approach is fitdist
is inefficient. You need to come up with faster ways of finding the AIC by writing better algorithms. One way of doing that is fitting a glm
.
AIC.fitdist <- function(x, ...) x$aic
x <- rnorm(100, mean=20)
AIC(fitdist(x, 'norm'))
AIC(glm(x ~ 1 , family=gaussian)) ## same
AIC(fitdist(x, 'gamma'))
AIC(glm(x ~ 1 , family=Gamma)) ## same
Some profiling shows that fitdist
has the same computational time as glm
. That's very bad news for fitdist
because glm
is just a bloated wrapper to glm.fit
. Using glm.fit
buys you substantial time. Lastly, if you really had to prune down time (for millions, not thousands) of models, you can use a one step estimator by
> benchmark(
+ fitdist(x, 'gamma'),
+ glm(x ~ 1, family=Gamma),
+ glm.fit(rep(1, length(x)), x, family=Gamma()),
+ glm.fit(rep(1, length(x)), x, family=Gamma(), control = glm.control(maxit=1))
+ )
test replications elapsed relative user.self sys.self user.child
1 fitdist(x, "gamma") 100 0.42 7.000 0.42 0 NA
2 glm(x ~ 1, family = Gamma) 100 0.17 2.833 0.17 0 NA
3 glm.fit(rep(1, length(x)), x, family = Gamma()) 100 0.06 1.000 0.07 0 NA
4 glm.fit(rep(1, length(x)), x, family = Gamma(), control = glm.control(maxit = 1)) 100 0.06 1.000 0.06 0 NA
sys.child
1 NA
2 NA
3 NA
4 NA
aic
is a stored object in glm.fit
output.
Exponential distribution fitting can be done with survreg
in the survival package: survreg(rep(1,100), x, dist='exponential)
.
Lastly, since these are all regular exponential families, you can use the sufficient statistics to just come up with a probability distribution. For instance:
normaic <- function(x) {
4 - 2*sum(dnorm(x, mean(x), sd(x), log=T))
}
> benchmark(normaic(x), glm.fit(rep(1, 100), x)$aic)
test replications elapsed relative user.self sys.self user.child sys.child
2 glm.fit(rep(1, 100), x)$aic 100 0.04 NA 0.05 0 NA NA
1 normaic(x) 100 0.00 NA 0.00 0 NA NA