I have some events with specific dates and date ranges that should be used to add a new value to the events based on which date range they occurred. I know I can accomplish this with the code below, but it does not scale well. It seems like there should be a way to vectorize it.
Here's the representative data
events
#> event_name event_date
#> 1 event a 2012-05-23
#> 2 event b 2018-03-12
#> 3 event c 2019-08-17
#> 4 event d 2022-01-03
leader_tenure
#> leader start end
#> 1 Manning 2012-04-09 2017-12-30
#> 2 Hale 2017-12-30 2019-11-09
#> 3 Whitney 2019-11-10 2022-05-30
Here's the working but inefficent and difficult-to-scale code
library(dplyr)
library(lubridate)
events <- data.frame(
event_name = c("event a", "event b", "event c", "event d"),
event_date = ymd(c("2012-05-23", "2018-03-12", "2019-08-17", "2022-01-03"
))
)
leader_tenure <- data.frame(
leader = c("Manning", "Hale", "Whitney"),
start = ymd(c("2012-04-09", "2017-12-30", "2019-11-10")),
end = ymd(c("2017-12-30", "2019-11-09", "2022-05-30"))
)
events |>
mutate(
leader = case_when(
event_date %within% interval(ymd("2012-04-09"), ymd("2017-12-30")) ~ "Manning",
event_date %within% interval(ymd("2017-12-30"), ymd("2019-11-09")) ~ "Hale",
event_date %within% interval(ymd("2019-11-10"), ymd("2022-05-30")) ~ "Whitney"
)
)
#> event_name event_date leader
#> 1 event a 2012-05-23 Manning
#> 2 event b 2018-03-12 Hale
#> 3 event c 2019-08-17 Hale
#> 4 event d 2022-01-03 Whitney
This is a good use case for a join, which is an efficient way to relate data between two tables. Since v1.1.0 in early 2023, dplyr has supported a variety of non-equi joins, including "overlap joins" like this. https://dplyr.tidyverse.org/reference/join_by.html
You might want to think about what to do in case multiple leaders could be matched to a given event. The code below would add a row for each. But if that situation suggests a data quality problem, you could specify relationship = "one-to-one"
within left_join
to make sure it produces an error so you can fix it before you continue with other steps. Or if it's ok that multiple leaders could match, but you only want want to show one leader for each event, you could specify multiple = "first"
or multiple = "any"
to get only one for each event. (Or if there's a heuristic you want to apply to pick one, you could join to all of them, then use arrange
+filter
or slice_min
/slice_max
to pick the best.)
events |>
left_join(leader_tenure, join_by(event_date |> between(start, end)))
event_name event_date leader start end
1 event a 2012-05-23 Manning 2012-04-09 2017-12-30
2 event b 2018-03-12 Hale 2017-12-30 2019-11-09
3 event c 2019-08-17 Hale 2017-12-30 2019-11-09
4 event d 2022-01-03 Whitney 2019-11-10 2022-05-30