Consider a simple dataset, split into a training and testing set:
dat <- data.frame(x=1:5, y=c("a", "b", "c", "d", "e"), z=c(0, 0, 1, 0, 1))
train <- dat[1:4,]
train
# x y z
# 1 1 a 0
# 2 2 b 0
# 3 3 c 1
# 4 4 d 0
test <- dat[5,]
test
# x y z
# 5 5 e 1
When I train a logistic regression model to predict z
using x
and obtain test-set predictions, all is well:
mod <- glm(z~x, data=train, family="binomial")
predict(mod, newdata=test, type="response")
# 5
# 0.5546394
However, this fails on an equivalent-looking logistic regression model with a "Factor has new levels" error:
mod2 <- glm(z~.-y, data=train, family="binomial")
predict(mod2, newdata=test, type="response")
# Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels) :
# factor y has new level e
Since I removed y
from my model equation, I'm surprised to see this error message. In my application, dat
is very wide, so z~.-y
is the most convenient model specification. The simplest workaround I can think of is removing the y
variable from my data frame and then training the model with the z~.
syntax, but I was hoping for a way to use the original dataset without the need to remove columns.
You could try updating mod2$xlevels[["y"]]
in the model object
mod2 <- glm(z~.-y, data=train, family="binomial")
mod2$xlevels[["y"]] <- union(mod2$xlevels[["y"]], levels(test$y))
predict(mod2, newdata=test, type="response")
# 5
#0.5546394
Another option would be to exclude (but not remove) "y" from the training data
mod2 <- glm(z~., data=train[,!colnames(train) %in% c("y")], family="binomial")
predict(mod2, newdata=test, type="response")
# 5
#0.5546394