Imagine there is a formula for simple regression: y~f1+f2+f3, where f1 is a factor with A,B,C levels. f2 and f3 are numerics Further i'm using following recipe:
recipe(y~f1+f2+f3, data) %>%
step_dummy(f1) %>%
step_log(f3)
Question. Eventually initial formula turns to y~f1_A+f1_B+f1_C+f2+log(f3)
, right?
Question2. If I would have added
+step_pca(comp5)
it would become
y~PC1+PC2+..PC5
?
Hope it make sense
Thanks in advance
For the first question
Eventually initial formula turns to
y~f1_A+f1_B+f1_C+f2+log(f3)
, right?
Almost! The log step renames the variable (so the logged variables are just in column f3
). The other parts are right.
Question 2:
If I would have added
+step_pca(comp5)
it would become
y~PC1+PC2+..PC5?
Yes(ish). The names that come out of step_pca()
are designed to be sortable. If you have fewer than 10 components, then the above is right. If you have 11 to 99 components, then they are PC01
... PC99
.
Finally, recipes don't just make a formula to do these computations (you probably didn't mean that but just to be sure). However, there is a little-known formula method that you can use on the recipes once it is prepared:
library(tidymodels)
pen_rec <-
recipe(island ~ species + body_mass_g, data = penguins) %>%
step_dummy(species) %>%
step_log(body_mass_g) %>%
prep()
formula(pen_rec)
#> island ~ body_mass_g + species_Chinstrap + species_Gentoo
#> <environment: 0x125456170>
Created on 2023-10-28 with reprex v2.0.2