I have a dataset with ~ 330 000 rows. Each observation represents a period where an individual recieved a welfare benefit called "care allowance". The benefit is meant to replace income when the recipient has to be absent from work in order to care for their child full-time due to serious illness or to accompany them to a specialist healthcare institution.
There was a change in legislation regarding the welfare benefit in 2017, and one of my research questions concerns changes in the size and composition of the recipient population. My dataset contains information regarding regarding each case of benefit reception from Jan 1st 2016 to Dec 31 2021.
I want to portray the development in the amount of work days that have been compensated by the care allowance scheme over time. In many cases a period of care allowance reception can span years. I want to count the number of business days (e.g monday through friday) in the interval from the start date and end date of the reception period that falls within each of the years from 2016 to 2021.
I am only able to get the count of ordinary days for each year. I would be very appreciative of suggestions on how to modify my code so that df$bdays == df$days
and the vars(days16:days21)
count the number of business days instead.
Update
@Marcus' suggestion works well enough on a small dataset, but takes an unwieldy amount of time to execute on my larger dataset (over an hour and a half). I've come up with a solution using purrr::map2_dbl()
Original code:
library(bizdays)
library(lubridate)
library(dplyr)
id <- sort(sample(1:100, 1000, replace = T))
start_date <- sample(seq(ymd("2016-01-01"), ymd("2021-12-30"), by="day"), 1000)
end_date <- sample(seq(ymd("2016-01-01"), ymd("2021-12-31"), by="day"), 1000)
df <- data.frame(id, start_date, end_date) %>%
filter(end_date > start_date) %>%
mutate(interval = interval(start = start_date, end = end_date))
df <- df %>%
mutate(days16 = as.period(intersect(interval, interval(ymd("2016-01-01"), ymd("2016-12-31"))))%/%days(1),
days17 = as.period(intersect(interval, interval(ymd("2017-01-01"), ymd("2017-12-31"))))%/%days(1),
days18 = as.period(intersect(interval, interval(ymd("2018-01-01"), ymd("2018-12-31"))))%/%days(1),
days19 = as.period(intersect(interval, interval(ymd("2019-01-01"), ymd("2019-12-31"))))%/%days(1),
days20 = as.period(intersect(interval, interval(ymd("2020-01-01"), ymd("2020-12-31"))))%/%days(1),
days21 = as.period(intersect(interval, interval(ymd("2021-01-01"), ymd("2021-12-31"))))%/%days(1))
df[is.na(df)] <- 0
cal <- create.calendar(name = "mycal", weekdays=c("saturday", "sunday"))
df <- df %>%
mutate(days = days16 + days17 + days18 + days19 + days20 + days21) %>%
mutate(bdays = bizdays(start_date, end_date, cal)) %>%
arrange(id, start_date)
head(df, n = 10)
#> id start_date end_date interval days16 days17 days18
#> 1 1 2016-03-15 2017-04-20 2016-03-15 UTC--2017-04-20 UTC 289 110 0
#> 2 1 2016-07-10 2018-12-14 2016-07-10 UTC--2018-12-14 UTC 173 364 347
#> 3 1 2018-03-06 2021-01-11 2018-03-06 UTC--2021-01-11 UTC 0 0 298
#> 4 1 2018-09-01 2019-04-21 2018-09-01 UTC--2019-04-21 UTC 0 0 121
#> 5 2 2016-04-27 2019-04-28 2016-04-27 UTC--2019-04-28 UTC 247 364 364
#> 6 2 2016-08-13 2019-09-10 2016-08-13 UTC--2019-09-10 UTC 139 364 364
#> 7 2 2016-10-03 2017-10-05 2016-10-03 UTC--2017-10-05 UTC 88 277 0
#> 8 2 2018-05-12 2018-07-17 2018-05-12 UTC--2018-07-17 UTC 0 0 65
#> 9 2 2019-08-29 2021-10-11 2019-08-29 UTC--2021-10-11 UTC 0 0 0
#> 10 2 2019-10-08 2020-08-05 2019-10-08 UTC--2020-08-05 UTC 0 0 0
#> days19 days20 days21 days bdays
#> 1 0 0 0 399 287
#> 2 0 0 0 884 634
#> 3 364 364 10 1036 744
#> 4 111 0 0 232 164
#> 5 118 0 0 1093 782
#> 6 252 0 0 1119 801
#> 7 0 0 0 365 263
#> 8 0 0 0 65 46
#> 9 123 364 283 770 552
#> 10 83 217 0 300 216
Created on 2022-09-30 by the reprex package (v2.0.1)
I would apply the bizdays
function rowwise
to each entry (warning this might take a while to run). This allows you to take either the start/end date or the start/end of the year to define the arguments to bizdays
. Also move up your definition of calendar, setting it to financial = FALSE
. Otherwise, if the last day of the year falls on a workday (when tallying the years), it won't get counted.
cal <- create.calendar(name = "mycal", weekdays=c("saturday", "sunday"), financial = FALSE)
df <- df %>%
rowwise() |>
mutate(
days16 = bizdays(max(start_date, ymd("2016-01-01")), min(end_date, ymd("2016-12-31")), cal),
days17 = bizdays(max(start_date, ymd("2017-01-01")), min(end_date, ymd("2017-12-31")), cal),
days18 = bizdays(max(start_date, ymd("2018-01-01")), min(end_date, ymd("2018-12-31")), cal),
days19 = bizdays(max(start_date, ymd("2019-01-01")), min(end_date, ymd("2019-12-31")), cal),
days20 = bizdays(max(start_date, ymd("2020-01-01")), min(end_date, ymd("2020-12-31")), cal),
days21 = bizdays(max(start_date, ymd("2021-01-01")), min(end_date, ymd("2021-12-31")), cal)
)
df[is.na(df) | df < 0] <- 0
df <- df %>%
mutate(days = days16 + days17 + days18 + days19 + days20 + days21) %>%
mutate(bdays = bizdays(start_date, end_date, cal)) %>%
arrange(id, start_date)
df |>
as.data.frame() |>
head(n = 10)
#> id start_date end_date interval days16 days17 days18
#> 1 1 2017-02-06 2017-04-03 2017-02-06 UTC--2017-04-03 UTC 0 41 0
#> 2 1 2017-07-18 2018-05-27 2017-07-18 UTC--2018-05-27 UTC 0 119 105
#> 3 1 2019-02-06 2019-12-26 2019-02-06 UTC--2019-12-26 UTC 0 0 0
#> 4 1 2019-04-29 2020-02-15 2019-04-29 UTC--2020-02-15 UTC 0 0 0
#> 5 2 2016-01-07 2018-08-05 2016-01-07 UTC--2018-08-05 UTC 257 260 155
#> 6 2 2016-02-22 2016-11-17 2016-02-22 UTC--2016-11-17 UTC 194 0 0
#> 7 2 2016-12-04 2021-05-19 2016-12-04 UTC--2021-05-19 UTC 20 260 261
#> 8 2 2018-08-28 2020-09-26 2018-08-28 UTC--2020-09-26 UTC 0 0 90
#> 9 3 2016-10-21 2017-10-24 2016-10-21 UTC--2017-10-24 UTC 51 212 0
#> 10 3 2017-02-08 2021-07-04 2017-02-08 UTC--2021-07-04 UTC 0 233 261
#> days19 days20 days21 days bdays
#> 1 0 0 0 41 41
#> 2 0 0 0 224 224
#> 3 232 0 0 232 232
#> 4 177 33 0 210 210
#> 5 0 0 0 672 672
#> 6 0 0 0 194 194
#> 7 261 262 99 1163 1163
#> 8 261 193 0 544 544
#> 9 0 0 0 263 263
#> 10 261 262 131 1148 1148
all(df3$days == df3$bdays)
#> [1] TRUE