I am looking for functionality similar to https://rdrr.io/github/bfgray3/cattonum/man/catto_freq.html
but implemented as a recipes::step_-function (https://tidymodels.github.io/recipes/reference/index.html)
Is anyone aware of an implementation for this? :)
As I see catto_freq is doing count encoding similar to CountEncoder in sklearn.
Grouping is not supported by recipes right now. So, what you can do is count encoding with dyplyr and without recipes.
You can try two aproaches, but be aware of possible data leakage in the first approach.
library(tidyverse)
library(tidymodels)
full <- read_csv("data/raw/train.csv")
new_data <- read_csv("data/raw/test.csv")
combined <- bind_rows(full, new_data)
combined_catto_freq <- combined %>%
cattonum::catto_freq(Ticket) %>%
rename(Ticket_size = Ticket) %>%
arrange(-Ticket_size )
combined_catto_freq
combined_mutate <- combined %>%
group_by(Ticket) %>%
mutate(Ticket = n()) %>%
ungroup() %>%
rename(Ticket_size = Ticket) %>%
arrange(-Ticket_size )
combined_mutate
df2 <- full %>%
select(-Survived) %>%
cattonum::catto_freq(Ticket, test = new_data)
df2$train %>%
rename(Ticket_size = Ticket) %>%
arrange(-Ticket_size)
df2$test %>%
rename(Ticket_size = Ticket) %>%
arrange(-Ticket_size)
full_mutate <- full %>%
group_by(Ticket) %>%
mutate(Ticket_size = n()) %>%
ungroup() %>%
arrange(-Ticket_size )
full_mutate
new_data_mutate <- new_data %>%
left_join(unique(select(full_mutate, Ticket, Ticket_size)), by = "Ticket")
new_data_mutate %>%
arrange(-Ticket_size)
PD: I used titanic data for this example.