roffsetglmbroommodelr

Error when using broom (augment) and modelr (crossv_kfold) on glm with an offset term


I am trying to fit a poisson regression model on a k-fold cross validated data set using modelr's crossv_kfold and then get predictions using broom's augment function. In the data I'm modeling I have a count I'm trying to predict, but it needs to be offset by an exposure variable. For the sake of reproducibility I've included an augmented dataset to illustrate.

library(tidyverse)
library(modelr)
non_breaks = rpois(dim(warpbreaks)[1],20)
warp = warpbreaks %>%
    mutate(total = breaks + non_breaks)

So in this example I would be modeling the number of breaks on the given categorical variables and offset by the exposure, total. I'm finding that if I don't include an offset term in my model, everything works perfectly fine:

library(broom)
warp_no_offset = crossv_kfold(warp, k = 10) %>%
    mutate(model = map(train, ~ glm(breaks~ wool*tension, ., family=poisson))) %>%
    mutate(predicted = map2(model, test, ~ augment(.x, newdata = .y, predict.type= "response")))

But if I include an offset term:

warp_offset = crossv_kfold(warp, k = 10) %>%
    mutate(model = map(train, ~ glm(breaks~ offset(log(total)) + wool*tension, ., family=poisson))) %>%
    mutate(predicted = map2(model, test, ~ augment(.x, newdata = .y, predict.type= "response")))

it throws the error:

Error in mutate_impl(.data, dots) : 
    Evaluation error: arguments imply differing number of rows: 5, 49.

Solution

  • The problem is that offset() is not being evaluated how and when you think it is. I can see how this was tricky, but the solution is simple.

    You just need to remember to use I() for transformations inside of an equation.

    For example:

    warp_offset = crossv_kfold(warp, k = 10) %>%
      mutate(model = map(train, ~ glm(breaks~ I(offset(log(total))) + wool*tension, ., family=poisson))) %>%
      mutate(predicted = map2(model, test, ~ augment(.x, newdata = .y, predict.type= "response")))
    

    will throw no error and produce the desired results.