rdplyrgroup-bydata-cleaningr-factor

R group by with each grouped element associated with most common factor


I want to group by column a and choose the most common factor b for each unique a. For example:

tibble(a = c(1,1,1,2,2,2), b = factor(c('cat', 'dog', 'cat', 'cat', 'dog', 'dog'))) %>%
    reframe(b = most_common(b), .by = a)

I want this to produce:

a b
1 cat
2 dog

However, the most_common function doesn't exist. Is there an efficient R function for this purpose? This must be a pretty common need for data cleaning (what I need it for). I searched and found people implementing mode functions. I could use one of those, but they seemed inefficient. Is there a better approach to this overall problem?


Solution

  • We can use table + max.col

    d <- table(df)
    data.frame(
      a = as.numeric(row.names(d)),
      b = colnames(d)[max.col(d)]
    )
    

    which gives

      a   b
    1 1 cat
    2 2 dog
    

    or using dplyr like below

      group_by(a) %>%
      summarise(b = names(which.max(table(b))))
    

    which gives

    # A tibble: 2 × 2
          a b
      <dbl> <chr>
    1     1 cat
    2     2 dog