[SOLVED] R: frequency encoding for categorical variables via recipies package

R: frequency encoding for categorical variables via recipies package

I am looking for functionality similar to https://rdrr.io/github/bfgray3/cattonum/man/catto_freq.html

but implemented as a recipes::step_-function (https://tidymodels.github.io/recipes/reference/index.html)

Is anyone aware of an implementation for this? :)

Solution

As I see catto_freq is doing count encoding similar to CountEncoder in sklearn.

Grouping is not supported by recipes right now. So, what you can do is count encoding with dyplyr and without recipes.

You can try two aproaches, but be aware of possible data leakage in the first approach.

Using all the data:

library(tidyverse)
library(tidymodels)

full <- read_csv("data/raw/train.csv")
new_data <- read_csv("data/raw/test.csv")
combined <- bind_rows(full, new_data)

combined_catto_freq <- combined %>%
  cattonum::catto_freq(Ticket)  %>%
  rename(Ticket_size = Ticket) %>%
  arrange(-Ticket_size )
combined_catto_freq

combined_mutate <- combined %>%
  group_by(Ticket) %>%
  mutate(Ticket = n()) %>%
  ungroup() %>%
  rename(Ticket_size = Ticket) %>%
  arrange(-Ticket_size )
combined_mutate

Not using test data for the count encoding:

df2 <- full %>% 
  select(-Survived) %>% 
  cattonum::catto_freq(Ticket, test = new_data)
df2$train %>%
  rename(Ticket_size = Ticket) %>%
  arrange(-Ticket_size)
df2$test %>%
  rename(Ticket_size = Ticket) %>%
  arrange(-Ticket_size)

full_mutate <- full %>%
  group_by(Ticket) %>%
  mutate(Ticket_size = n()) %>%
  ungroup() %>%
  arrange(-Ticket_size )
full_mutate

new_data_mutate <- new_data %>%
 left_join(unique(select(full_mutate, Ticket, Ticket_size)), by = "Ticket")
new_data_mutate  %>%
  arrange(-Ticket_size)

PD: I used titanic data for this example.