rdplyrdummy-variable

analyzing set of many dummy columns based on date ranges


I am analyzing staff turnover and need to create (1) a count of the number of staff hired and exited in a given year and (2) compute a cumulative "total staff" count across years. I have hire and exit dates like this:

ssh<-structure(list(HireDate = structure(c(1358, 4291, 5121, 6923, 9678, 12037, 16353, 17003, 18976, 19312, 19312, 19011), class = "Date"),     ExitDate = structure(c(15861, 15401, 17140, 17347, NA, NA,     16911, 18856, 19193, NA, NA, NA), class = "Date"), id = c("1", 
"2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12")), row.names = c(NA, -12L), class = c("tbl_df", "tbl", "data.frame"))

ssh$hireyear<-lubridate::year(ssh$HireDate)
ssh$exityear<-lubridate::year(ssh$ExitDate)

ssh$group<-c("a","b","c","a","","b","a","","","b","b","c")

For the simple accounting of hires and exits, I'd like to have a dummy variable for EACH year. So for the data above, if staff was hired in 2014, create a new column hired2014 equal to 1, else 0, like this:

 ssh$hire1984<-ifelse(ssh$hireyear==1984,1,0)

or

 ssh$exit2012<-ifelse(ssh$exityear==2012,1,0)

My full dataset ranges between 1972 and 2023, so I'd like an efficient method to compute all possible variables for any date range. This would yield a dataframe with many columns -- one for each year.

Next, I'd like to format the resulting dataframe by year, something like this:

Year  NumberHired  NumberExited  NetChange  CurrentTotal
1972  4            0             4          4
1973  2            1             1          5
1974  3            4             -1         4
.
.
2010  25           11            14         541   ...etc

I experimented creating a lookup table of years that I might populate with an aggregation of the dummy variables but am hitting a wall. Also, other solutions I've found on stackoverflow tend to focus on creating only one dummy variable.

Ideas? Thx!


Solution

  • We can use tidyr::pivot_wider to make your dummies:

    library(tidyr)
    library(dplyr)
    
    ssh |>
      mutate(dummy = 1) |>
      pivot_wider(
        names_from = hireyear, 
        values_from = dummy,
        names_prefix = "hired",
        values_fill = 0
      ) |>
      mutate(dummy = 1) |> 
      pivot_wider(
        names_from = exityear, 
        values_from = dummy,
        names_prefix = "exit",
        values_fill = 0
      )
    # # A tibble: 12 × 20
    #    HireDate   ExitDate   id    hired1973 hired1981 hired1984 hired1988 hired1996 hired2002
    #    <date>     <date>     <chr>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>
    #  1 1973-09-20 2013-06-05 1             1         0         0         0         0         0
    #  2 1981-10-01 2012-03-02 2             0         1         0         0         0         0
    #  3 1984-01-09 2016-12-05 3             0         0         1         0         0         0
    #  4 1988-12-15 2017-06-30 4             0         0         0         1         0         0
    #  5 1996-07-01 NA         5             0         0         0         0         1         0
    #  6 2002-12-16 NA         6             0         0         0         0         0         1
    #  7 2014-10-10 2016-04-20 7             0         0         0         0         0         0
    #  8 2016-07-21 2021-08-17 8             0         0         0         0         0         0
    #  9 2021-12-15 2022-07-20 9             0         0         0         0         0         0
    # 10 2022-11-16 NA         10            0         0         0         0         0         0
    # 11 2022-11-16 NA         11            0         0         0         0         0         0
    # 12 2022-01-19 NA         12            0         0         0         0         0         0
    # # ℹ 11 more variables: hired2014 <dbl>, hired2016 <dbl>, hired2021 <dbl>, hired2022 <dbl>,
    # #   exit2013 <dbl>, exit2012 <dbl>, exit2016 <dbl>, exit2017 <dbl>, exitNA <dbl>,
    # #   exit2021 <dbl>, exit2022 <dbl>
    

    And the summary table is fairly easily computed from your original data:

    hires = count(ssh, hireyear, name = "NumberHired") |>
      rename(year = hireyear)
    exits = count(ssh, exityear, name = "NumberExited", ) |>
      rename(year = exityear) |>
      filter(!is.na(year))
    full_join(hires, exits, by = "year") %>%
      replace(is.na(.), 0) |>
      arrange(year) |>
      mutate(
        NetChange = NumberHired - NumberExited,
        CurrentTotal = cumsum(NetChange)
      )
    # # A tibble: 13 × 5
    #     year NumberHired NumberExited NetChange CurrentTotal
    #    <dbl>       <int>        <int>     <int>        <int>
    #  1  1973           1            0         1            1
    #  2  1981           1            0         1            2
    #  3  1984           1            0         1            3
    #  4  1988           1            0         1            4
    #  5  1996           1            0         1            5
    #  6  2002           1            0         1            6
    #  7  2012           0            1        -1            5
    #  8  2013           0            1        -1            4
    #  9  2014           1            0         1            5
    # 10  2016           1            2        -1            4
    # 11  2017           0            1        -1            3
    # 12  2021           1            1         0            3
    # 13  2022           3            1         2            5