I am analyzing staff turnover and need to create (1) a count of the number of staff hired and exited in a given year and (2) compute a cumulative "total staff" count across years. I have hire and exit dates like this:
ssh<-structure(list(HireDate = structure(c(1358, 4291, 5121, 6923, 9678, 12037, 16353, 17003, 18976, 19312, 19312, 19011), class = "Date"), ExitDate = structure(c(15861, 15401, 17140, 17347, NA, NA, 16911, 18856, 19193, NA, NA, NA), class = "Date"), id = c("1",
"2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12")), row.names = c(NA, -12L), class = c("tbl_df", "tbl", "data.frame"))
ssh$hireyear<-lubridate::year(ssh$HireDate)
ssh$exityear<-lubridate::year(ssh$ExitDate)
ssh$group<-c("a","b","c","a","","b","a","","","b","b","c")
For the simple accounting of hires and exits, I'd like to have a dummy variable for EACH year. So for the data above, if staff was hired in 2014, create a new column hired2014
equal to 1, else 0, like this:
ssh$hire1984<-ifelse(ssh$hireyear==1984,1,0)
or
ssh$exit2012<-ifelse(ssh$exityear==2012,1,0)
My full dataset ranges between 1972 and 2023, so I'd like an efficient method to compute all possible variables for any date range. This would yield a dataframe with many columns -- one for each year.
Next, I'd like to format the resulting dataframe by year, something like this:
Year NumberHired NumberExited NetChange CurrentTotal
1972 4 0 4 4
1973 2 1 1 5
1974 3 4 -1 4
.
.
2010 25 11 14 541 ...etc
I experimented creating a lookup table of years that I might populate with an aggregation of the dummy variables but am hitting a wall. Also, other solutions I've found on stackoverflow tend to focus on creating only one dummy variable.
Ideas? Thx!
We can use tidyr::pivot_wider
to make your dummies:
library(tidyr)
library(dplyr)
ssh |>
mutate(dummy = 1) |>
pivot_wider(
names_from = hireyear,
values_from = dummy,
names_prefix = "hired",
values_fill = 0
) |>
mutate(dummy = 1) |>
pivot_wider(
names_from = exityear,
values_from = dummy,
names_prefix = "exit",
values_fill = 0
)
# # A tibble: 12 × 20
# HireDate ExitDate id hired1973 hired1981 hired1984 hired1988 hired1996 hired2002
# <date> <date> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 1973-09-20 2013-06-05 1 1 0 0 0 0 0
# 2 1981-10-01 2012-03-02 2 0 1 0 0 0 0
# 3 1984-01-09 2016-12-05 3 0 0 1 0 0 0
# 4 1988-12-15 2017-06-30 4 0 0 0 1 0 0
# 5 1996-07-01 NA 5 0 0 0 0 1 0
# 6 2002-12-16 NA 6 0 0 0 0 0 1
# 7 2014-10-10 2016-04-20 7 0 0 0 0 0 0
# 8 2016-07-21 2021-08-17 8 0 0 0 0 0 0
# 9 2021-12-15 2022-07-20 9 0 0 0 0 0 0
# 10 2022-11-16 NA 10 0 0 0 0 0 0
# 11 2022-11-16 NA 11 0 0 0 0 0 0
# 12 2022-01-19 NA 12 0 0 0 0 0 0
# # ℹ 11 more variables: hired2014 <dbl>, hired2016 <dbl>, hired2021 <dbl>, hired2022 <dbl>,
# # exit2013 <dbl>, exit2012 <dbl>, exit2016 <dbl>, exit2017 <dbl>, exitNA <dbl>,
# # exit2021 <dbl>, exit2022 <dbl>
And the summary table is fairly easily computed from your original data:
hires = count(ssh, hireyear, name = "NumberHired") |>
rename(year = hireyear)
exits = count(ssh, exityear, name = "NumberExited", ) |>
rename(year = exityear) |>
filter(!is.na(year))
full_join(hires, exits, by = "year") %>%
replace(is.na(.), 0) |>
arrange(year) |>
mutate(
NetChange = NumberHired - NumberExited,
CurrentTotal = cumsum(NetChange)
)
# # A tibble: 13 × 5
# year NumberHired NumberExited NetChange CurrentTotal
# <dbl> <int> <int> <int> <int>
# 1 1973 1 0 1 1
# 2 1981 1 0 1 2
# 3 1984 1 0 1 3
# 4 1988 1 0 1 4
# 5 1996 1 0 1 5
# 6 2002 1 0 1 6
# 7 2012 0 1 -1 5
# 8 2013 0 1 -1 4
# 9 2014 1 0 1 5
# 10 2016 1 2 -1 4
# 11 2017 0 1 -1 3
# 12 2021 1 1 0 3
# 13 2022 3 1 2 5