rclassificationr-caretvgam

Error adding interactions to custom VGAM::vglm model in caret


I've built a custom model with caret using vglm() from VGAM. It works fine with simple effects but when I try to add interactions it fails with a object 'x1:x2' not found error message, where x1 and x2 are predictor variables that I've entered into the model as an interaction. The problem has to do with the predictions, and unless I'm mistaken it seems to occur because either predict.train or predictvglm tries to use x1:x2 to predict classes.

I've provided a working example below.

# Set up data
set.seed(123)
n <- 100
x1 <- rnorm(n, 175, 7)
x2 <- rnorm(n, 30, 8)
cont <- 0.5 * x1 - 0.3 * x2 + 10 + rnorm(n, 0, 6)
y  <- cut(cont, breaks = quantile(cont), include.lowest = TRUE,
             labels = c("A", "B", "C", "D"), ordered = TRUE)
d <- data.frame(x1, x2, y)

# My custom caret function
vglmTrain <- list(
  label = "VGAM prop odds",
  library = "VGAM",
  loop = NULL,
  type = "Classification",
  parameters = data.frame(parameter = "parameter",
                          class = "character",
                          label = "parameter"),
  grid = function(x, y,
                  len = NULL, search = "grid") data.frame(parameter = "none"),
  fit = function(x, y, wts, param, lev, last, classProbs, ...) {
    dat <- if(is.data.frame(x)) x else as.data.frame(x)
    dat$.outcome <- y
    if(!is.null(wts))
    {
      out <- vglm(.outcome ~ ., propodds, data = dat, weights = wts, ...)
    } else {
      out <- vglm(.outcome ~ ., propodds, data = dat, ...)
    }
    out
  },
  predict = function(modelFit, newdata, preProc = NULL, submodels = NULL) {
    probs <- predict(modelFit, data.frame(newdata), type = "response")

    predClass <- function (x) {
      n <- colnames(x)
      factor(as.vector(apply(x, 1, which.max)),
             levels = 1:length(n),
             labels = n)
    }
    predClass(probs)
  },
  prob = function(modelFit, newdata, preProc = NULL, submodels = NULL)
    predict(modelFit, data.frame(newdata), type = "response"),
  predictors = function(x, ...) names(attributes(terms(x))$dataClasses[-1]),
  levels = function(x) x@misc$ynames,
  sort = function(x) x)

Now if I try to use the function, it exits with an error if I supply the formula with an interaction.

# Load caret
library(caret)

ctrl <- trainControl(method = "cv", number = 2, verboseIter = T)

# A model with no interactions - works
f1 <- train(y ~ x1 + x2, data = d,
           method = vglmTrain,
           trControl = ctrl)

# A model with interactions - fails
f2 <- train(y ~ x1*x2, data = d,
            method = vglmTrain,
            trControl = ctrl)

Error in train.default(x, y, weights = w, ...) : Stopping
In addition: Warning messages:
1: In eval(expr, envir, enclos) :
  predictions failed for Fold1: parameter=none Error in eval(expr, envir, enclos) : object 'x1:x2' not found

2: In eval(expr, envir, enclos) :
  predictions failed for Fold2: parameter=none Error in eval(expr, envir, enclos) : object 'x1:x2' not found

3: In nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,  :
  There were missing values in resampled performance measures.

Here's my sessionInfo():

> sessionInfo()
R version 3.2.4 (2016-03-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] splines   stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] VGAM_1.0-0      caret_6.0-64    ggplot2_2.1.0   lattice_0.20-33

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.3        magrittr_1.5       MASS_7.3-45        munsell_0.4.3      colorspace_1.2-6   foreach_1.4.3      minqa_1.2.4        stringr_1.0.0      car_2.1-1         
[10] plyr_1.8.3         tools_3.2.4        nnet_7.3-12        pbkrtest_0.4-6     parallel_3.2.4     grid_3.2.4         gtable_0.2.0       nlme_3.1-125       mgcv_1.8-12       
[19] quantreg_5.21      e1071_1.6-7        class_7.3-14       MatrixModels_0.4-1 iterators_1.0.8    lme4_1.1-11        Matrix_1.2-3       nloptr_1.0.4       reshape2_1.4.1    
[28] codetools_0.2-14   stringi_1.0-1      compiler_3.2.4     scales_0.4.0       SparseM_1.7       

Does anybody have any idea how to fix this?


Solution

  • Caret does handle interactions. I found a workaround however. You can first call model.matrix to create a matrix with the interactions. You need to remove the intercept as well.

    Using your f2 as an example, we specify the data not as a formula, but as x and y. The x contains the model.matrix specification with the interactions and the -1 removes the intercept. This transformed as a data.frame and your are set to go.

    f2 <- train(y = y, x = data.frame(model.matrix(y ~ x1*x2 - 1, data = d)),
                method = vglmTrain,
                trControl = ctrl)
    

    EDIT:

    After debugging the train.default and checking your modeltype specification and some others, I found a check that is done in the caret models and not in yours. The check has to do with the predict and probs functions. These both have a check on the Dataframe. If you add this check to both of these functions it works as expected.

    if (!is.data.frame(newdata)) 
      newdata <- as.data.frame(newdata)
    

    The whole function will then be:

    vglmTrain <- list(
      label = "VGAM prop odds",
      library = "VGAM",
      loop = NULL,
      type = "Classification",
      parameters = data.frame(parameter = "parameter",
                              class = "character",
                              label = "parameter"),
      grid = function(x, y,
                      len = NULL, search = "grid") data.frame(parameter = "none"),
      fit = function(x, y, wts, param, lev, last, classProbs, ...) {
        dat <- if(is.data.frame(x)) x else as.data.frame(x)
        dat$.outcome <- y
        if(!is.null(wts))
        {
          out <- vglm(.outcome ~ ., propodds, data = dat, weights = wts, ...)
        } else {
          out <- vglm(.outcome ~ ., propodds, data = dat, ...)
        }
        out
      },
      predict = function(modelFit, newdata, preProc = NULL, submodels = NULL) {
    
        if (!is.data.frame(newdata)) 
          newdata <- as.data.frame(newdata)
        probs <- predict(modelFit, newdata, type = "response")
    
        predClass <- function (x) {
          n <- colnames(x)
          factor(as.vector(apply(x, 1, which.max)),
                 levels = 1:length(n),
                 labels = n)
        }
        predClass(probs)
      },
      prob = function(modelFit, newdata, preProc = NULL, submodels = NULL) {
        if (!is.data.frame(newdata)) 
          newdata <- as.data.frame(newdata)
    
        predict(modelFit, newdata, type = "response")
      },
    
      levels = function(x) x@misc$ynames,
    
      tags = c("Cumulative Link", "Logistic Regression", "Accepts Case Weights",
               "Probit", "Logit"),
    
      sort = function(x) x)