rggplot2tidy

R Graph | Counting per Month by Start and End Dates


Having trouble developing code that will make a graph to count per month the number of employees in the company by the Original Hire Date and Termination Date:

structure(list(original_hire_date = c("7/8/2019", "7/15/2019", 
"7/29/2019", "8/5/2019", "8/12/2019", "8/19/2019", "8/26/2019", 
"8/26/2019", "8/26/2019", "9/3/2019", "9/9/2019", "9/9/2019", 
"10/8/2019", "9/30/2019", "9/30/2019", "9/30/2019", "9/30/2019", 
"9/30/2019", "9/30/2019", "9/30/2019", "9/30/2019", "9/30/2019", 
"9/30/2019", "10/14/2019", "10/28/2019"), termination_date = c(NA, 
NA, NA, "8/21/2020", NA, "6/30/2020", NA, "7/25/2020", NA, NA, 
NA, NA, "8/21/2020", "6/30/2020", NA, "6/30/2020", NA, "6/30/2020", 
"6/5/2020", "6/30/2020", "6/30/2020", NA, "3/2/2020", "8/27/2021", 
NA)), row.names = c(NA, -25L), class = c("tbl_df", "tbl", "data.frame"
))

The goal is to create a graph by Year and Month (ex. July 2019) and the count of employees during that time. Since some employees have not left, it is really subtracting and adding employees over time from when they start or left the company.

Example of the Graph: enter image description here


Solution

  • Here, I reshape long, count hires as +1 and terminations as -1, summarize by date, then calc the count as the cumulative sum of hires and terminations. (One possible variation might be to adjust the date on terminations to be one day later -- for example if someone worked a single day, we might think of them being +1 in the morning and -1 after the day, so they contributed +1 for the day, not zero. I have not made that adjustment here.)

    library(tidyverse)
    df1 |>
      pivot_longer(1:2) %>%
      mutate(change = if_else(name == "original_hire_date", 1, -1),
             date = lubridate::mdy(value)) %>%
      filter(!is.na(date)) |>
      arrange(date) |>
      count(date, wt = change, name = "change") |>
      complete(date = seq.Date(min(date), max(date), by = 1),  # to fill in all days,
               fill = list(change = 0)) |> # so count doesn't drift between observations
      mutate(count = cumsum(change)) |>
      ggplot(aes(date, count)) +
      geom_line()
    

    enter image description here