roptimizationgoodness-of-fitfitdistrplus

Quick way of finding a name of a variable with lowest component value


I have a function fitting a distribution and returning a vector consisting of distribution name, mean, sd, etc. I'm testing few distributions but I can't rely on gofstat() because it goes mad when there are too many zeros to consider.

Therefore I have to compare manually AIC for several variables, decide which ones are actually of "fitdist" class and return the name of the variable with the lowest AIC. Once I have that, I compute mean, sd, etc and return.

The code currently looks like this:

library(fitdistrplus)

fit_distr <- function(data){

fe <- tryCatch(fitdist(data, "exp"), error = function(e) FALSE )
flogis <- tryCatch(fitdist(data, "logis"), error = function(e) FALSE )
fn <- tryCatch(fitdist(data, "norm"), error = function(e) FALSE)
fp <- tryCatch(fitdist(data, "pois"), error = function(e) FALSE)
fg <- tryCatch(fitdist(data, "gamma"), error = function(e) FALSE)

classFitDist <- c(class(fe), class(flogis), class(fn), class(fp),class(fg))

distributions <- classFitDist == "fitdist"


AIC <- data.frame()
if(class(fe)=="fitdist") {AIC[1,ncol(AIC)+1] <- fe$aic}
if(class(flogis)=="fitdist") {AIC[1,ncol(AIC)+1] <- flogis$aic}
if(class(fn)=="fitdist") {AIC[1,ncol(AIC)+1] <- fn$aic}
if(class(fp)=="fitdist") {AIC[1,ncol(AIC)+1] <- fp$aic}
if(class(fg)=="fitdist") {AIC[1,ncol(AIC)+1] <- fg$aic}

names(AIC) <- c("exp", "logis", "norm", "pois", "gamma")[distributions]

fit <- names(AIC[which.min(AIC)])


mean <- switch (fit,
                          exp = 1/fe$estimate[[1]],
                          logis = flogis$estimate[[1]],
                          norm = fn$estimate[[1]],
                          pois = fp$estimate[[1]],
                          gamma = fg$estimate[[1]]/fg$estimate[[2]]
          )

sd <- switch (fit,
                        exp = mean,
                        logis = (flogis$estimate[[2]]*pi)/sqrt(3),
                        norm = fn$estimate[[2]],
                        pois = sqrt(mean),
                        gamma = sqrt(fg$estimate[[1]]/(fg$estimate[[2]]^2))
          )

return(c(fit,mean,sd))
}

It works, but on thousands of samples to consider is very slow. I would welcome any suggestions how to optimise it and make it 'cleaner' and faster.

Btw, this is what I had before, however like I mentioned - with samples consisting of too many zeros it was having a fit (pun unintentional!)

goodnessoffit <- gofstat(list(fe, flogis, fn, fp, fg)[distributions],  fitnames = c("exp", "logis", "norm", "pois","gamma")[distributions])
fit <- names(which(goodnessoffit$aic == min(goodnessoffit$aic)))

Error in ans[!test & ok] <- rep(no, length.out = length(ans))[!test & : replacement has length zero


Solution

  • The problem with this approach is fitdist is inefficient. You need to come up with faster ways of finding the AIC by writing better algorithms. One way of doing that is fitting a glm.

    AIC.fitdist <- function(x, ...) x$aic
    
    x <- rnorm(100, mean=20)
    
    AIC(fitdist(x, 'norm'))
    AIC(glm(x ~ 1 , family=gaussian)) ## same
    
    AIC(fitdist(x, 'gamma'))
    AIC(glm(x ~ 1 , family=Gamma)) ## same
    

    Some profiling shows that fitdist has the same computational time as glm. That's very bad news for fitdist because glm is just a bloated wrapper to glm.fit. Using glm.fit buys you substantial time. Lastly, if you really had to prune down time (for millions, not thousands) of models, you can use a one step estimator by

    > benchmark(
    +     fitdist(x, 'gamma'),
    +     glm(x ~ 1, family=Gamma),
    +     glm.fit(rep(1, length(x)), x, family=Gamma()),
    +     glm.fit(rep(1, length(x)), x, family=Gamma(), control = glm.control(maxit=1))
    + )
                                                                                   test replications elapsed relative user.self sys.self user.child
    1                                                               fitdist(x, "gamma")          100    0.42    7.000      0.42        0         NA
    2                                                        glm(x ~ 1, family = Gamma)          100    0.17    2.833      0.17        0         NA
    3                                   glm.fit(rep(1, length(x)), x, family = Gamma())          100    0.06    1.000      0.07        0         NA
    4 glm.fit(rep(1, length(x)), x, family = Gamma(), control = glm.control(maxit = 1))          100    0.06    1.000      0.06        0         NA
      sys.child
    1        NA
    2        NA
    3        NA
    4        NA
    

    aic is a stored object in glm.fit output.

    Exponential distribution fitting can be done with survreg in the survival package: survreg(rep(1,100), x, dist='exponential).

    Lastly, since these are all regular exponential families, you can use the sufficient statistics to just come up with a probability distribution. For instance:

    normaic <- function(x) {
      4 - 2*sum(dnorm(x, mean(x), sd(x), log=T))
    }
    
    > benchmark(normaic(x), glm.fit(rep(1, 100), x)$aic)
                             test replications elapsed relative user.self sys.self user.child sys.child
    2 glm.fit(rep(1, 100), x)$aic          100    0.04       NA      0.05        0         NA        NA
    1                  normaic(x)          100    0.00       NA      0.00        0         NA        NA