rggplot2plotsassurvival-analysis

How to create a swimmer plot with two independent axises which the primary one representing follow-up length, while the secondary being calendar time


I want to create a swimmer plot with two independent axes. The primary axis should display bars representing follow-up duration, aligned at x = 0, while the secondary axis should show lines or dots representing calendar time, which don't need to align at x = 0.

Is there an efficient way to plot this using R (using ggplot2) or SAS? I've tried asking ChatGPT and searching online, but most of the answers I found only address connected axes.

Below is the sample data that ChatGPT has offered.

library(dplyr)

# Example data
data <- data.frame(
  ID = 1:10,
  start_date = as.Date(c('1950-01-01', '1960-02-01', '1970-03-01', '1980-04-01', '1990-05-01',
                         '1955-06-01', '1965-07-01', '1975-08-01', '1985-09-01', '1995-10-01')),
  end_date = as.Date(c('2000-01-01', '2010-02-01', '2020-03-01', '2015-04-01', '2025-05-01',
                       '1995-06-01', '2005-07-01', '2015-08-01', '2020-09-01', '2030-10-01'))
)

# Calculate follow-up length in years
data <- data %>%
  mutate(follow_up_years = as.numeric(difftime(end_date, start_date, units = "weeks")) / 52.25)

I have googled it on the internet and asked ChatGPT for possible solutions.


Solution

  • A secondary axis is just an inert annotation on the side of your plot. It is up to the user to convert one of the data fields into an appropriately scaled variable in the same units as the primary axis, and to provide the reverse transformation to allow the appropriate labels to appear in the secondary axis.

    In your case, you would have to convert the dates into the number of years since the earliest start date, then tell sec_axis how to convert that number back into a date.

    For many longitudinal studies (though not your sample data) this would make the scale of the date axis much more compressed than the follow-up time axis, which would make for a very confusing plot.

    However, for your sample data, you would do something like this:

    library(tidyverse)
    
    data %>%
      mutate(follow_up = (end_date - start_date)/365.25,
             start_date_trans = as.numeric((start_date - min(start_date))/365.25),
             end_date_trans = as.numeric((end_date - min(start_date))/365.25)) %>%
      ggplot(aes(y = factor(ID, ID))) +
      geom_col(aes(x = follow_up), width = 0.5, fill = 'gray70') +
      geom_linerange(aes(xmin = start_date_trans, xmax = end_date_trans),
                     color = 'red4', linewidth = 1) +
      scale_x_continuous('Follow up (years)',
                         sec.axis = sec_axis(~ .x * 365.25,
                                             breaks = seq(0, 365.25 * 80, 3652.5),
                                             labels = seq(1950, 2030, 10),
                                             name = "Actual time")) +
      labs(y = 'Participant ID') +
      theme_minimal(16)
    

    enter image description here

    From a data visualization perspective, this plot is difficult to understand and confusing for your audience. Having two separate plots seems a much better idea:

    data <- data %>%
      mutate(follow_up = as.numeric((end_date - start_date)/365.25),
             ID = fct_reorder(factor(ID), follow_up, .desc = TRUE))
    
    ggplot(data, aes(follow_up, ID)) +
      geom_col(aes(x = follow_up), width = 0.5, fill = 'gray70') +
      theme_minimal(16) +
      labs(x = 'Years of follow-up', y = 'Participant ID')
    
    ggplot(data, aes(start_date, ID)) +
      geom_linerange(aes(xmin = start_date, xmax = end_date), 
                     linewidth = 8, color = 'gray70') +
      theme_minimal(16) +
      labs(x = 'Date of participation', y = 'Participant ID')
    

    Follow-up time enter image description here

    Participation timeline enter image description here