rdata-wranglingforcats

Collapse levels of a factor when number of observations within a level are below a limit


I would like a way to collapse levels of a factor based on the number of observations for each level.

For example, if I have the tibble below with a factor column of animals (four levels: cat, dog, hamster, goldfish), can I collapse levels with less than 2 observations into a level called "other"?

# A tibble: 7 × 1
  animal  
  <fct>   
1 cat     
2 cat     
3 cat     
4 dog     
5 dog     
6 hamster 
7 goldfish

This should result in the following...

# A tibble: 7 × 2
  animal   animal2
  <fct>    <fct>  
1 cat      cat    
2 cat      cat    
3 cat      cat    
4 dog      dog    
5 dog      dog    
6 hamster  other  
7 goldfish other  

I would like to be able to adjust the cut-off (e.g. groups with less that 5 observations) and ideally this would be done using tidyverse.


Solution

  • You're looking for forcats::fct_lump_min; which collapse to 'Other' levels that appear less than min times:

    library(forcats)
    library(dplyr)
    df %>% 
      mutate(animal2 = fct_lump_min(animal, min = 2),
             animal3 = fct_lump_min(animal, 3))
    

    output

    # A tibble: 7 × 3
      animal   animal2 animal3
      <fct>    <fct>   <fct>  
    1 cat      cat     cat    
    2 cat      cat     cat    
    3 cat      cat     cat    
    4 dog      dog     Other  
    5 dog      dog     Other  
    6 hamster  Other   Other  
    7 goldfish Other   Other