rglmnettidymodels

Error in augment function for lasso models


Background

I have a data set on which I have fitted a lasso model using the tidymodels package. Now, I have artificially created a new data set on which I would like to use this model to make predictions. This is a classifications exercise, where I have 10 predictors trying to determine 1 binary variable.

Issue

When I use augment(extract_fit_parsnip(lasso_final_fit), new_data = new_investors) %>% glimpse(), R throws me the following error: Error in new_data[, rownames(object$fit$beta), drop = FALSE] : subscript out of bounds.

My model was coded as follows:

log_rec <- recipe(defensive ~ ., data = modelling_train) %>%
  step_select(-date) %>%
  step_novel(all_nominal_predictors()) %>% 
  step_dummy(all_nominal_predictors()) %>% 
  step_zv(all_predictors()) %>% 
  step_normalize(all_predictors())
lasso_final_spec <- logistic_reg(engine = "glmnet",
                                 penalty = 0.03,
                                 mixture = 1)
lasso_final_wf <- workflow(preprocessor = log_rec,
                           spec = lasso_final_spec)
lasso_final_fit <- last_fit(lasso_final_wf,
                            split = modelling_split,
                            metrics = metric_set(accuracy, sens, spec, roc_auc))

My new_investors data set was created as follows:

new_investors <- data.frame(sex = as.factor(sample(c("female", "male"), 100, T)),
                            date = as_date(sample(c("2020-03-15", "2021-03-15"), 100, T)),
                            age = sample(seq(min(modelling %>% pull(age)),
                                             max(modelling %>% pull(age))), 100, T),
                            tx = sample(seq(min(modelling %>% pull(tx)),
                                            max(modelling %>% pull(tx))), 100, T),
                            risk_before = sample(c(0, 0.25, 0.5, 0.75, 1), 100, T),
                            risk_after = sample(c(0, 0.25, 0.5, 0.75, 1), 100, T),
                            fund_value = rnorm(100, mean(cleaned %>% pull(fund_value), na.rm = T),
                                               sd(cleaned %>% pull(fund_value), na.rm = T)))
  1. I have compared new_investors to modelling_train and am very sure that the predictors I used in the model are all contained within new_investors. Column names are the same as well.
  2. I saw elsewhere that the predict function only accepts data.matrix data sets, so I tried that as well, but it did not work too.
  3. As an extension of this exercise, I was also required to build a random forest model, yet when I use the exact same new_investors data set in the augment code for this model, the predictions come out perfectly fine.

I must also apologise in advance for not producing an MWE. I know this is what is usually needed, but I am not sure how to produce an MWE in this case.

What could possibly be the issue here? Any intuitive explanations will be greatly appreciated :)


Solution

  • Hard to tell without the data but I think that you need to make predictions from the workflow. Using just the parsnip fit avoids preprocessing with the recipe. As a result, your model fit gets different data (sometimes silently) than what it should have gotten if the data were processed with the recipe.

    Here's an example:

    library(tidymodels)
    
    data("two_class_dat")
    
    set.seed(1)
    split <- initial_split(two_class_dat)
    
    lasso_final_spec <- logistic_reg(engine = "glmnet",
                                     penalty = 0.03,
                                     mixture = 1)
    lasso_final_wf <- workflow(preprocessor = Class ~ .,
                               spec = lasso_final_spec)
    lasso_final_fit <- last_fit(lasso_final_wf,
                                split = split,
                                metrics = metric_set(accuracy, sens, spec, roc_auc))
    
    # Get the fitted workflow out
    lasso_final_wflow <- 
      lasso_final_fit %>% 
      extract_workflow()
    
    set.seed(2)
    new_dat <- tibble(A = runif(3), B = runif(3))
    lasso_final_wflow %>% augment(new_dat) 
    #> # A tibble: 3 × 5
    #>       A     B .pred_class .pred_Class1 .pred_Class2
    #>   <dbl> <dbl> <fct>              <dbl>        <dbl>
    #> 1 0.185 0.168 Class1             0.945       0.0554
    #> 2 0.702 0.944 Class1             0.775       0.225 
    #> 3 0.573 0.943 Class1             0.771       0.229
    

    Created on 2022-10-11 by the reprex package (v2.0.1)

    Also, glance() pulls things out of the fitted model so you can't use that on a set of new predictions. I suggest saving your metric set as an object and use it to evaluate these predictions (assuming you know their labels)