rr-recipes

R: frequency encoding for categorical variables via recipies package


I am looking for functionality similar to https://rdrr.io/github/bfgray3/cattonum/man/catto_freq.html

but implemented as a recipes::step_-function (https://tidymodels.github.io/recipes/reference/index.html)

Is anyone aware of an implementation for this? :)


Solution

  • As I see catto_freq is doing count encoding similar to CountEncoder in sklearn.

    Grouping is not supported by recipes right now. So, what you can do is count encoding with dyplyr and without recipes.

    You can try two aproaches, but be aware of possible data leakage in the first approach.

    1. Using all the data:
    library(tidyverse)
    library(tidymodels)
    
    full <- read_csv("data/raw/train.csv")
    new_data <- read_csv("data/raw/test.csv")
    combined <- bind_rows(full, new_data)
    
    combined_catto_freq <- combined %>%
      cattonum::catto_freq(Ticket)  %>%
      rename(Ticket_size = Ticket) %>%
      arrange(-Ticket_size )
    combined_catto_freq
    
    combined_mutate <- combined %>%
      group_by(Ticket) %>%
      mutate(Ticket = n()) %>%
      ungroup() %>%
      rename(Ticket_size = Ticket) %>%
      arrange(-Ticket_size )
    combined_mutate
    
    1. Not using test data for the count encoding:
    df2 <- full %>% 
      select(-Survived) %>% 
      cattonum::catto_freq(Ticket, test = new_data)
    df2$train %>%
      rename(Ticket_size = Ticket) %>%
      arrange(-Ticket_size)
    df2$test %>%
      rename(Ticket_size = Ticket) %>%
      arrange(-Ticket_size)
    
    full_mutate <- full %>%
      group_by(Ticket) %>%
      mutate(Ticket_size = n()) %>%
      ungroup() %>%
      arrange(-Ticket_size )
    full_mutate
    
    new_data_mutate <- new_data %>%
     left_join(unique(select(full_mutate, Ticket, Ticket_size)), by = "Ticket")
    new_data_mutate  %>%
      arrange(-Ticket_size)
    

    PD: I used titanic data for this example.