rlmmodel-comparison

AIC different between biglm and lm


I have been trying to use biglm to run linear regressions on a large dataset (approx 60,000,000 lines). I want to use AIC for model selection. However I discovered when playing with biglm on smaller datasets that the AIC variables returned by biglm are different from those returned by lm. This even applies to the example in the biglm help.

data(trees)
ff<-log(Volume)~log(Girth)+log(Height)

chunk1<-trees[1:10,]
chunk2<-trees[11:20,]
chunk3<-trees[21:31,]

library(biglm)
a <- biglm(ff,chunk1)
a <- update(a,chunk2)
a <- update(a,chunk3)

AIC(a)#48.18546

a_lm <- lm(ff, trees)
AIC(a_lm)#-62.71125

Can someone please explain what is happening here? Are the AICs generated with biglm safe to use for comparing biglm models on the same dataset?


Solution

  • tl;dr it looks to me like there is a pretty obvious bug in the AIC method for biglm-class objects (more specifically, in the update method), in the current (0.9-1) version, but the author of the biglm package is a smart, experienced guy, and biglm is widely used, so perhaps I'm missing something. Googling for "biglm AIC df.resid", it seems this has been discussed way back in 2009? Update: the package author/maintainer reports via e-mail that this is indeed a bug.

    Something funny seems to be going on here. The differences in AIC between models should be the same across modeling frameworks, whatever the constants that have been used and however parameters are counted (because these constants and parameter-counting should be consistent within modeling frameworks ...)

    Original example:

    data(trees)
    ff <- log(Volume)~log(Girth)+log(Height)
    chunk1<-trees[1:10,]
    chunk2<-trees[11:20,]
    chunk3<-trees[21:31,]
    library(biglm)
    a <- biglm(ff,chunk1)
    a <- update(a,chunk2)
    a <- update(a,chunk3)
    a_lm <- lm(ff, trees)
    

    Now fit a reduced model:

    ff2 <- log(Volume)~log(Girth)    
    a2 <- biglm(ff2, chunk1)
    a2 <- update(a2, chunk2)
    a2 <- update(a2 ,chunk3)
    a2_lm <- lm(ff2,trees)
    

    Now compare AIC values:

    AIC(a)-AIC(a2)
    ## [1] 1.80222
    
    AIC(a_lm)-AIC(a2_lm)
    ## [1] -20.50022
    

    Check that we haven't screwed something up:

    all.equal(coef(a),coef(a_lm))  ## TRUE
    all.equal(coef(a2),coef(a2_lm))  ## TRUE
    

    Look under the hood:

    biglm:::AIC.biglm
    ## function (object, ..., k = 2) 
    ##    deviance(object) + k * (object$n - object$df.resid)
    

    In principle this is the right formula (number of observations minus residual df should be the number of parameters fitted), but digging in, it looks like the $df.resid component of the objects hasn't been updated properly:

    a$n  ## 31, correct
    a$df.resid  ## 7, only valid before updating!
    

    Looking at biglm:::update.biglm, I would add

    object$df.resid <- object$df.resid + NROW(mm)
    

    right before or after the line that reads

    object$n <- object$n + NROW(mm)
    

    ...

    This seems like a fairly obvious bug to me, so perhaps I'm missing something obvious, or perhaps it has been fixed.

    A simple workaround would be to define your own AIC function as

    AIC.biglm <- function (object, ..., k = 2) {
        deviance(object) + k * length(coef(object))
    }
    
    AIC(a)-AIC(a2)  ## matches results from lm()
    

    (although note that AIC(a_lm) is still not equal to AIC(a), because stats:::AIC.default() uses 2*log-likelihood rather than deviance (these two measures differ in their additive coefficients) ...)