I am trying to fit a poisson regression model on a k-fold cross validated data set using modelr's crossv_kfold and then get predictions using broom's augment function. In the data I'm modeling I have a count I'm trying to predict, but it needs to be offset by an exposure variable. For the sake of reproducibility I've included an augmented dataset to illustrate.
library(tidyverse)
library(modelr)
non_breaks = rpois(dim(warpbreaks)[1],20)
warp = warpbreaks %>%
mutate(total = breaks + non_breaks)
So in this example I would be modeling the number of breaks on the given categorical variables and offset by the exposure, total. I'm finding that if I don't include an offset term in my model, everything works perfectly fine:
library(broom)
warp_no_offset = crossv_kfold(warp, k = 10) %>%
mutate(model = map(train, ~ glm(breaks~ wool*tension, ., family=poisson))) %>%
mutate(predicted = map2(model, test, ~ augment(.x, newdata = .y, predict.type= "response")))
But if I include an offset term:
warp_offset = crossv_kfold(warp, k = 10) %>%
mutate(model = map(train, ~ glm(breaks~ offset(log(total)) + wool*tension, ., family=poisson))) %>%
mutate(predicted = map2(model, test, ~ augment(.x, newdata = .y, predict.type= "response")))
it throws the error:
Error in mutate_impl(.data, dots) :
Evaluation error: arguments imply differing number of rows: 5, 49.
The problem is that offset()
is not being evaluated how and when you think it is. I can see how this was tricky, but the solution is simple.
You just need to remember to use I()
for transformations inside of an equation.
For example:
warp_offset = crossv_kfold(warp, k = 10) %>%
mutate(model = map(train, ~ glm(breaks~ I(offset(log(total))) + wool*tension, ., family=poisson))) %>%
mutate(predicted = map2(model, test, ~ augment(.x, newdata = .y, predict.type= "response")))
will throw no error and produce the desired results.