rggplot2ggalluvial

Working with ggalluvial ggsankey library with missing combinations and dropouts


I'm trying to represent the movements of patients between several treatment groups measured in 3 different years. However, there're dropouts where some patients from 1st year are missing in the 2nd year or there are patients in the 2nd year who weren't in the 1st. Same for 3rd year. I have a label called "none" for these combinations, but I don't want it to be in the plot.

An example plot with only 2 years: Plot with 'none' values

EDIT

I have tried with geom_sankey as well (https://rdrr.io/github/davidsjoberg/ggsankey/man/geom_sankey.html). Although it is more accurate to what I'm looking for. I don't know how to omit the stratum groups without labels (NA). In this case, I'm using my full data, not a dummy example. I can't share it but I can try to create an example if needed. This is the code I've tried:

data = bind_rows(data_2015,data_2017,data_2019) %>% 
  select(sip, Year, Grp) %>%
  mutate(Grp = factor(Grp), Year = factor(Year)) %>%
  arrange(sip) %>% 
  pivot_wider(names_from = Year, values_from = Grp)

df_sankey = data %>% make_long(`2015`,`2017`,`2019`)

ggplot(df_sankey, aes(x = x, 
               next_x = next_x, 
               node = node, 
               next_node = next_node,
               fill = factor(node),
               label = node,
               color=factor(node) )) +
  geom_sankey(flow.alpha = 0.5, node.color = 1) +
  geom_sankey_label(size = 3.5, color = 1, fill = "white") +
  scale_fill_viridis_d() +
  scale_colour_viridis_d() +
  theme_sankey(base_size = 16) +
  theme(legend.position = "none") + xlab('')

Figure:

Geom_sankey() image

Any idea how to omit the missing groups every year as stratum (without omitting them in the alluvium) will be super helpful. Thanks!


Solution

  • Solved! The solution was much easier I though. I'll leave here the solution in case someone else struggles with a similar problem.

    1. Create a wide table of counts per every group / cohort.
    # Data with 3 cohorts for years 2015, 2017 and 2019
    # Grp is a factor with 3 levels: 1 to 6
    # sip is a unique ID
    
    library(tidyverse)
    
    data_wide = data %>%
      select(sip, Year, Grp) %>%
      mutate(Grp = factor(Grp, levels=c(1:6)), Year = factor(Year)) %>%
      arrange(sip) %>% 
      pivot_wider(names_from = Year, values_from = Grp)
    
    1. Using ggsankey package we can transform it as the specific type the package expects. There's already an useful function for this.
    df_sankey = data %>% make_long(`2015`,`2017`,`2019`)
    
    # The tibble accounts for every change in X axis and Y categorical value (node):
    
    > head(df_sankey)
    # A tibble: 6 × 4
      x     node  next_x next_node
      <fct> <chr> <fct>  <chr>    
    1 2015  3     2017   2        
    2 2017  2     2019   2        
    3 2019  2     NA     NA       
    4 2015  NA    2017   1        
    5 2017  1     2019   1        
    6 2019  1     NA     NA   
    
    
    1. Looks like using the pivot_wider() to pass it to make_long() created a situation where each combination for every value was completed, including missings as NA. Drop NA values in 'node' and create the plot.
    df_sankey %>% drop_na(node) %>% 
    
    ggplot(aes(x = x, 
                   next_x = next_x, 
                   node = node, 
                   next_node = next_node,
                   fill = factor(node),
                   label = node,
                   color=factor(node) )) +
      geom_sankey(flow.alpha = 0.5, node.color = 1) +
      geom_sankey_label(size = 3.5, color = 1, fill = "white") +
      scale_fill_viridis_d() +
      scale_colour_viridis_d() +
      theme_sankey(base_size = 16) +
      theme(legend.position = "none") + xlab('')
    

    Solved!

    enter image description here