rtidymodelsreciper-parsnip

How can I tune the `step_impute_knn` function from the recipe package?


I want to use the step_impute_knn function from the recipe package to impute missing values. This function uses the Gower distance as a distance metric, which is suitable when predictors are a mixture of categorical and continuous data. But as far as I can see, there is no way to use this function with the tune() parameter, since the tuning must be done on a (parsnip) model. But the only parsnip model is nearest_neighbor function that doesn't have Gower distance as an option.

Sample data:

train <- structure(list(PassengerId = c("0001_01", "0002_01", "0003_01", 
"0003_02", "0004_01", "0005_01"), HomePlanet = c("Europa", "Earth", 
"Europa", "Europa", "Earth", NA), CryoSleep = c("False", 
"False", "False", "False", "False", "False"), Cabin = c("B/0/P", 
"F/0/S", "A/0/S", "A/0/S", "F/1/S", "F/0/P"), Destination = c("TRAPPIST-1e", 
"TRAPPIST-1e", "TRAPPIST-1e", "TRAPPIST-1e", "TRAPPIST-1e", "PSO J318.5-22"
), Age = c(39, 24, 58, 33, 16, 44), VIP = c("False", "False", 
"True", "False", "False", "False"), RoomService = c(0, 109, 43, 
0, 303, 0), FoodCourt = c(0, 9, 3576, 1283, 70, 483), ShoppingMall = c(0, 
25, 0, 371, 151, 0), Spa = c(0, 549, 6715, 3329, 565, 291), VRDeck = c(0, 
44, 49, 193, 2, 0), Name = c("Maham Ofracculy", "Juanna Vines", 
"Altark Susent", "Solam Susent", "Willy Santantines", "Sandie Hinetthews"
), Transported = c("False", "True", "False", "False", "True", 
"True")), row.names = c(NA, 6L), class = "data.frame")

What I have so far:

train_no_na <- train %>%
na.omit()

imp_knn_blueprint <- recipe(Transported ~ ., data = train_no_na) %>%
     step_impute_knn(recipe = ., HomePlanet, 
              impute_with = imp_vars(.), neighbors = 5, 
              options = list(nthread = 1, eps = 1e-08))

imp_knn_prep <- prep(imp_knn_blueprint, training = train_no_na)
imp_knn_5 <- bake(imp_knn_prep, new_data = train)

Is there some way to use the tidymodels and parsnip workflows to tune the knn-function that is used inside the step_impute_knn? I've tried reading the code for the function but don't see which engine they use.

EDIT: To be clear, I'd like to tune the neighbours parameter inside step_impute_knn via some grid search, rather than having to do it manually.


Solution

  • You can tune() neighbors in step_impute_knn similarly to other hyperparameters in recipe steps.

    library(tidymodels)
    
    
    train_folds <- vfold_cv(train_no_na, v = 3)
    
    imp_knn_blueprint <- recipe(Transported ~ ., data = train_no_na) %>%
      step_impute_knn(HomePlanet, 
                      impute_with = imp_vars(all_predictors()), neighbors = tune::tune(), 
                      options = list(nthread = 1, eps = 1e-08))
    
    log_spec <- logistic_reg()
    
    # Update range as appropriate
    knn_params <- extract_parameter_set_dials(imp_knn_blueprint) %>%
      update(neighbors = neighbors(c(1L, 10L)))
    
    knn_grid <- grid_regular(knn_params,
                             levels = c(
                              neighbors = 10
                             ))
    
    knn_wf <- 
      workflow() %>%
      add_model(log_spec) %>%
      add_recipe(imp_knn_blueprint)
    
    impute_knn_tune <-
      knn_wf %>%
      tune_grid(
        train_folds,
        grid = knn_grid,
        metrics = metric_set(roc_auc, accuracy)
      )