rdplyrr-factor

Redefine factor levels and order within groups


This is the simple example for illustration. I want a summary of data presented in a predetermined order. I want to order col2 values depending on col1 values, and also include rows for factor levels within the col1 group that are not in the data (eg using group_by ( ..., .drop=FALSE). Some values in col2 appear in more than col1 group. There is no logic that can be applied to determine the order of col2. You may call it a two-level factor maybe?

For example , my input data could be:

df <- read.table(
  header = TRUE,
  sep=",",
  text = "
col1,col2
Tunnels,Dick
Tunnels,Tom
Tunnels,Tom
Beatles,George
Beatles,Paul
Beatles,Ringo
Beatles,Ringo
UK Artists,Gilbert
"
)

and my required output would be

 col1       col2        n
 Beatles    John        0
 Beatles    Paul        1
 Beatles    George      1
 Beatles    Ringo       2
 UK Artists Gilbert     1
 UK Artists George      0
 Tunnels    Tom         2
 Tunnels    Dick        1
 Tunnels    Harry       0

The following , of course, does not work

col2_tunnels <- c("Tom", "Dick", "Harry")
col2_beatles <- c("John", "Paul", "George", "Ringo")
col2_artists <- c("Gilbert", "George")
col2_order <- unique(c(col2_tunnels, col2_beatles, col2_artists)) # cannot have duplicates
col1_order <- c("Beatles", "UK Artists", "Tunnels")

df %>%
  mutate(
    col1 = factor(col1, levels = col1_order),
    col2 = factor(col2, levels = col2_order)
  ) %>%
  group_by(col1, col2, .drop = FALSE) %>%
  summarise(n = n(), )

The only way forward I can see is to split the data by col1 levels and use a named list of vectors defining the factor order for each level of col1. While writing the question I found this worked

col2_fctlist <- list(
  Tunnels = c("Tom", "Dick", "Harry"),
  Beatles = c("John", "Paul", "George", "Ringo"),
  'UK Artists' = c("Gilbert", "George")
)

x <- lapply(col1_order, function(col1grp)
  df %>% filter(col1==col1grp) %>% 
    mutate(col2 = factor(col2, levels = col2_fctlist[[col1grp]])) %>% 
    group_by(col1, col2, .drop = FALSE) %>%
    summarise(n = n(), )
)

do.call(rbind, x)

Although I have found a solution that I think works for me, I'm still posting in case anybody can offer a better solution?


Solution

  • With a join:

    library(tidyverse)
    enframe(col2_fctlist, name = "col1", value = "col2") %>% unnest(col2) %>% 
      left_join(df %>% count(col1, col2)) %>% 
      replace_na(list(n = 0))
    
            col1    col2 n
    1    Tunnels     Tom 2
    2    Tunnels    Dick 1
    3    Tunnels   Harry 0
    4    Beatles    John 0
    5    Beatles    Paul 1
    6    Beatles  George 1
    7    Beatles   Ringo 2
    8 UK Artists Gilbert 1
    9 UK Artists  George 0
    

    Or with imap_dfr:

    imap_dfr(col2_fctlist, 
             ~ df %>% 
               filter(col1 == .y) %>% 
               mutate(col2 = factor(col2, levels = .x)) %>% 
               count(col2, .drop = FALSE), 
             .id = "col1")