rggplot2ggalt

Prepare date data for dumbbell plot


I have a dataset that presents a few challenges for transformation in preparation for creating a dumbbell plot:

  1. Single Date Groups: Some groups have only one date. In these cases, the start and end dates are the same, and h_sequ is 1.
  2. Two Date Groups: Other groups have a clear start and end date, signified by h_sequ values of 1 and 2. An example of this is group 12.
  3. Three Date Groups: There are also groups with three dates, where h_sequ takes values 1, 2, and 3, such as group 33.
  4. And also in group 33 there is a unique case where h_sequ has values of 1, 1, 2, 3.
 group h_sequ date      
   <int>  <int> <date>    
 1     1      1 2012-03-27
 2     1      1 2012-03-27
 3    10      1 2016-10-25
 4    10      1 2016-10-25
 5    12      1 2021-06-25
 6    12      2 2022-05-18
 7    31      1 2019-11-28
 8    31      1 2019-11-28
 9    31      2 2021-03-24
10    33      1 2013-09-03
11    33      1 2013-09-03
12    33      2 2019-01-04
13    33      3 2020-07-28
14    35      1 2015-10-21
15    35      2 2017-06-28

data <- structure(list(group = c(1L, 1L, 10L, 10L, 12L, 12L, 31L, 31L, 
31L, 33L, 33L, 33L, 33L, 35L, 35L), h_sequ = c(1L, 1L, 1L, 1L, 
1L, 2L, 1L, 1L, 2L, 1L, 1L, 2L, 3L, 1L, 2L), date = structure(c(15426, 
15426, 17099, 17099, 18803, 19130, 18228, 18228, 18710, 15951, 
15951, 17900, 18471, 16729, 17345), class = "Date")), class = c("tbl_df", 
"tbl", "data.frame"), row.names = c(NA, -15L))

The main question is how to implement the logic for the date column to accommodate these scenarios in a combined dumbbell plot. So far, I have used summarization to get the minimum and maximum dates for each group, but I need to integrate this approach with the specific structure of my data, taking into account the varying number of dates per group.

So far I have this:

library(ggplot2)
library(ggalt)
library(dplyr)

data %>%
  summarise(start_date = min(date), end_date = max(date), .by = group) %>%
  ggplot(aes(x = start_date, xend = end_date, y = group)) +
  geom_dumbbell(color = "red3", size = 3)

enter image description here


Solution

  • I would probably manually dodge the co-occurring points, and join the points with geom_path. This allows a complete display of all your data.

    library(tidyverse)
    
    data %>% 
      mutate(group = factor(group)) %>%
      mutate(dodge = (row_number() - median(row_number()))/n()/3.2, 
                      .by = c(group, date)) %>%
      ggplot(aes(date, group)) +
      geom_path(linewidth = 3, color = "gray") +
      geom_point(aes(y = as.numeric(group) + dodge, fill = factor(h_sequ)), 
                 shape = 21, size = 5) +
      scale_fill_manual("h_sequ", values = c("orange", "deepskyblue4", "red4")) +
      theme_minimal(base_size = 16) 
    

    enter image description here