This is the simple example for illustration. I want a summary of data presented in a predetermined order. I want to order col2 values depending on col1 values, and also include rows for factor levels within the col1 group that are not in the data (eg using group_by ( ..., .drop=FALSE). Some values in col2 appear in more than col1 group. There is no logic that can be applied to determine the order of col2. You may call it a two-level factor maybe?
For example , my input data could be:
df <- read.table(
header = TRUE,
sep=",",
text = "
col1,col2
Tunnels,Dick
Tunnels,Tom
Tunnels,Tom
Beatles,George
Beatles,Paul
Beatles,Ringo
Beatles,Ringo
UK Artists,Gilbert
"
)
and my required output would be
col1 col2 n
Beatles John 0
Beatles Paul 1
Beatles George 1
Beatles Ringo 2
UK Artists Gilbert 1
UK Artists George 0
Tunnels Tom 2
Tunnels Dick 1
Tunnels Harry 0
The following , of course, does not work
col2_tunnels <- c("Tom", "Dick", "Harry")
col2_beatles <- c("John", "Paul", "George", "Ringo")
col2_artists <- c("Gilbert", "George")
col2_order <- unique(c(col2_tunnels, col2_beatles, col2_artists)) # cannot have duplicates
col1_order <- c("Beatles", "UK Artists", "Tunnels")
df %>%
mutate(
col1 = factor(col1, levels = col1_order),
col2 = factor(col2, levels = col2_order)
) %>%
group_by(col1, col2, .drop = FALSE) %>%
summarise(n = n(), )
The only way forward I can see is to split the data by col1 levels and use a named list of vectors defining the factor order for each level of col1. While writing the question I found this worked
col2_fctlist <- list(
Tunnels = c("Tom", "Dick", "Harry"),
Beatles = c("John", "Paul", "George", "Ringo"),
'UK Artists' = c("Gilbert", "George")
)
x <- lapply(col1_order, function(col1grp)
df %>% filter(col1==col1grp) %>%
mutate(col2 = factor(col2, levels = col2_fctlist[[col1grp]])) %>%
group_by(col1, col2, .drop = FALSE) %>%
summarise(n = n(), )
)
do.call(rbind, x)
Although I have found a solution that I think works for me, I'm still posting in case anybody can offer a better solution?
With a join
:
library(tidyverse)
enframe(col2_fctlist, name = "col1", value = "col2") %>% unnest(col2) %>%
left_join(df %>% count(col1, col2)) %>%
replace_na(list(n = 0))
col1 col2 n
1 Tunnels Tom 2
2 Tunnels Dick 1
3 Tunnels Harry 0
4 Beatles John 0
5 Beatles Paul 1
6 Beatles George 1
7 Beatles Ringo 2
8 UK Artists Gilbert 1
9 UK Artists George 0
Or with imap_dfr
:
imap_dfr(col2_fctlist,
~ df %>%
filter(col1 == .y) %>%
mutate(col2 = factor(col2, levels = .x)) %>%
count(col2, .drop = FALSE),
.id = "col1")