I'd like to use the step_impute_knn
function from the recipe
package to impute some missing values in my data. I've tested it with the default parameters (neighbours = 5, nthread = 1 and eps = 1e-08) and can see that the resulting means and standard deviations for numerical variables (for example) are fairly close to the original data after imputation.
I'd like, however, to tune these parameters to see if there is an optimal set but I don't even know how to start inside the recipe package. The answers here and here are too complex or specific for me to understand.
The function step_impute_knn
doesn't provide any tuning options, as far as I can see and I'd rather not do it manually. Is there a simple way to do this?
Sample data:
train <- structure(list(PassengerId = c("0001_01", "0002_01", "0003_01",
"0003_02", "0004_01", "0005_01"), HomePlanet = c("Europa", "Earth",
"Europa", "Europa", "Earth", NA), CryoSleep = c("False",
"False", "False", "False", "False", "False"), Cabin = c("B/0/P",
"F/0/S", "A/0/S", "A/0/S", "F/1/S", "F/0/P"), Destination = c("TRAPPIST-1e",
"TRAPPIST-1e", "TRAPPIST-1e", "TRAPPIST-1e", "TRAPPIST-1e", "PSO J318.5-22"
), Age = c(39, 24, 58, 33, 16, 44), VIP = c("False", "False",
"True", "False", "False", "False"), RoomService = c(0, 109, 43,
0, 303, 0), FoodCourt = c(0, 9, 3576, 1283, 70, 483), ShoppingMall = c(0,
25, 0, 371, 151, 0), Spa = c(0, 549, 6715, 3329, 565, 291), VRDeck = c(0,
44, 49, 193, 2, 0), Name = c("Maham Ofracculy", "Juanna Vines",
"Altark Susent", "Solam Susent", "Willy Santantines", "Sandie Hinetthews"
), Transported = c("False", "True", "False", "False", "True",
"True")), row.names = c(NA, 6L), class = "data.frame")
What I have so far:
train_no_na <- train %>%
na.omit()
imp_knn_blueprint <- recipe(Transported ~ ., data = train_no_na) %>%
step_impute_knn(recipe = ., HomePlanet,
impute_with = imp_vars(.), neighbors = 5,
options = list(nthread = 1, eps = 1e-08))
imp_knn_prep <- prep(imp_knn_blueprint, training = train_no_na)
imp_knn_5 <- bake(imp_knn_prep, new_data = train)
Yes, you can (although we don't consider nthread
or eps
tuning parameters).
You would give them a value of tune()
in the recipe and treat it like any other tuning parameter associated with the model.
You would use tune_grid()
or one of the other tuning parameter functions. tidymodels even understands what this particular parameter is and has built-in default ranges (although you can pick the gird yourself)
There's an example of tuning recipe parameters in the tidymodels book and also on the tune_grid
help page (in the examples).