I'm trying to take data from a nearly-weekly snapshot [some weeks are missing due to holidays, server issues, etc.] and determine the number of service dates between one snapshot and the next. I have a calendar of all dates of service to reference, but I have no idea where to begin with finding the number of dates in the calendar between one data snapshot and the next. I also need to determine the number of days the service actually ran [some only run Monday-Wednesday, for example] so I can calculate the total mileage each route drove during the period of the study.
I have a minimum dataset below. The actual dataset is about half a million rows for this fiscal year.
### Minimum Working Example
## Required Packages
library(dplyr)
library(lubridate)
library(tidyr)
## Setup the calendar of dates where we need to track mileage
calendar_mwe <-
tibble(date = seq.Date(ymd(20240520), ymd(20240801), by = "day")) %>%
mutate(day = weekdays(date),
summer = case_when(date <= "2024-06-16" ~ 0, TRUE ~ 1)) %>%
filter(!day %in% c("Saturday", "Sunday"),
!date %in% c(
ymd(20240527), # Memorial Day
seq.Date(ymd(20240617), ymd(20240621), by = "day"), # Break week
ymd(20240704) # Independence Day
))
## Setup a working example of raw data
# Set rand seed
set.seed(12345)
# Create raw route data
runroute_mwe <-
tibble(
route_id = rep(100:119, 4),
# Dates the snapshots were taken, including some missing weeks
dates = rep(c(ymd(20240521), ymd(20240528), ymd(20240618), ymd(20240702)), each = 20),
# Distance the route traveled
route_dist = rep(runif(20, min = 0.5, max = 12), 4),
# Number of days that route was in service [assume 1 = Mon, 2 = Mon-Tue, 3 = Mon-Wed etc.]
route_days = rep(sample.int(5, size = 20, replace = TRUE, prob = c(.1, .1, .1, .1, .6)), 4)
) %>%
mutate( # Set a change to mileage based on dates to mimic actual data
route_dist = case_when(
dates == min(dates) ~ route_dist + runif(1, min = 1, max = 2),
dates == max(dates) ~ route_dist - runif(1, min = 1, max = 2),
TRUE ~ route_dist
)
)
> head(calendar_mwe)
# A tibble: 6 × 3
date day summer
<date> <chr> <dbl>
1 2024-05-20 Monday 0
2 2024-05-21 Tuesday 0
3 2024-05-22 Wednesday 0
4 2024-05-23 Thursday 0
5 2024-05-24 Friday 0
6 2024-05-28 Tuesday 0
> head(runroute_mwe)
# A tibble: 6 × 4
route_id dates route_dist route_days
<int> <date> <dbl> <int>
1 100 2024-05-21 10.6 5
2 101 2024-05-21 12.4 5
3 102 2024-05-21 11.0 1
4 103 2024-05-21 12.5 4
5 104 2024-05-21 7.53 3
6 105 2024-05-21 4.20 5
From here, I can determine the number of days between each snapshot as follows:
## Number of days between snapshots
runroute_daysbtwn <- runroute_mwe %>%
nest(.by = dates) %>%
mutate(days_between = difftime(lead(dates, 1), dates, units = "day"))
> head(runroute_daysbtwn)
# A tibble: 4 × 3
dates data days_between
<date> <list> <drtn>
1 2024-05-21 <tibble [20 × 3]> 7 days
2 2024-05-28 <tibble [20 × 3]> 21 days
3 2024-06-18 <tibble [20 × 3]> 14 days
4 2024-07-02 <tibble [20 × 3]> NA days
This is great-ish: It gives me the number of calendar days between each snapshot, minus the last snapshot [which is fine, since we don't have another snapshot after that one yet]. What it doesn't give me is the number of service days as defined by calendar_mwe
.
So, this morning, I thought to try something else:
## Snapshot start and end dates
runroute_datesbtwn <- runroute_mwe %>%
nest(.by = dates) %>%
mutate(end_date = as.Date(lead(dates) - days(1)),
end_date = as.Date(ifelse(is.na(end_date), dates, end_date)))
## Number of dates in calendar between the start and end dates
for (i in 1:nrow(runroute_datesbtwn)) {
runroute_calendar <- calendar_mwe %>%
filter(between(date, runroute_datesbtwn$dates[i], runroute_datesbtwn$end_date[i])) %>%
nrow()
runroute_datesbtwn$dates_between[i] <- runroute_calendar
}
> head(runroute_datesbtwn)
# A tibble: 4 × 4
dates data end_date dates_between
<date> <list> <date> <int>
1 2024-05-21 <tibble [20 × 3]> 2024-05-27 4
2 2024-05-28 <tibble [20 × 3]> 2024-06-17 14
3 2024-06-18 <tibble [20 × 3]> 2024-07-01 6
4 2024-07-02 <tibble [20 × 3]> 2024-07-02 1
This works well and gives me the number of service dates between each snapshot. The unnest()
ed version of runroute_datesbtwn
now looks like this:
> head(unnest(runroute_datesbtwn, everything()))
# A tibble: 6 × 6
dates route_id route_dist route_days end_date dates_between
<date> <int> <dbl> <int> <date> <int>
1 2024-05-21 100 10.6 5 2024-05-27 4
2 2024-05-21 101 12.4 5 2024-05-27 4
3 2024-05-21 102 11.0 1 2024-05-27 4
4 2024-05-21 103 12.5 4 2024-05-27 4
5 2024-05-21 104 7.53 3 2024-05-27 4
6 2024-05-21 105 4.20 5 2024-05-27 4
Okay, so now I have the run/route information, a calendar, the start and end dates for each snapshot, and the number of service dates between each snapshot. If every run/route ran Monday-Friday, this would be all I'd need to calculate total mileage driven for each run and route. The problem comes in with one of the initial columns in runroute_mwe
: route_days
.
Here, we assume that if route_days = 1
then the run only operated on Mondays, if route_days = 2
then the run operated Monday-Tuesday, if route_days = 3
then the run operated Monday-Wednesday, and so forth barring dates in the calendar where no runs occurred.
For the first snapshot, it's easy to say that I just subtract 1 from route_days
when I go to calculate a new column of total distance traveled [e.g. total_dist = (route_days - 1) * route_dist
] since the dates between the two snapshots are Tuesday-Friday. This won't work for any of the other snapshots, however, so I feel stuck.
I feel like I need to find some way to get the number of each weekday between the snapshots, but I'm not sure how to go about that. I'd imagine a df with additional <int>
columns for a count of each weekday [M-F] which would allow me to do some ridiculous case_when()
statement to make my final calculations, though that all seems rather cumbersome.
EDIT: Re-wrote question, MWE, toy data, and things I've tried to hopefully be more clear with what my problem is and where I need assistance. This morning I was able to make some headway on the initial problem, but it's still not solved.
Okay, so I managed to solve this question for myself. I did end up going the route I indicated in my last paragraph above. See code below for my solution. If anyone has a better solution for my problem, please let me know.
## Additional package requirement
library(fuzzyjoin)
## Snapshot start and end dates
runroute_datesbtwn_temp <- runroute_mwe %>%
nest(.by = dates) %>%
mutate(end_date = as.Date(lead(dates) - days(1)),
end_date = as.Date(ifelse(is.na(end_date), dates, end_date))) %>%
select(start_date = dates, end_date)
## Fuzzy join calendar_mwe to runroute_datesbtwn
runroute_calendar <- calendar_mwe %>%
# This gives me the calendar with identified start_date and end_date for each snapshot in runroute_mwe
fuzzy_left_join(
runroute_datesbtwn_temp,
by = c(
"date" = "start_date",
"date" = "end_date"
),
match_fun = list(`>=`, `<=`)
) %>%
# Filter out the null values
filter(!is.na(start_date)) %>%
# Create a column for each day
mutate(
mon = ifelse(day == "Monday", 1, 0),
tue = ifelse(day == "Tuesday", 1, 0),
wed = ifelse(day == "Wednesday", 1, 0),
thu = ifelse(day == "Thursday", 1, 0),
fri = ifelse(day == "Friday", 1, 0)
) %>%
group_by(start_date, end_date, summer) %>%
# Determine the number of that weekday occurring in each snapshot
summarise(
mon = sum(mon),
tue = sum(tue),
wed = sum(wed),
thu = sum(thu),
fri = sum(fri),
.groups = "drop"
)
> head(runroute_calendar)
# A tibble: 4 × 8
start_date end_date summer mon tue wed thu fri
<date> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2024-05-21 2024-05-27 0 0 1 1 1 1
2 2024-05-28 2024-06-17 0 2 3 3 3 3
3 2024-06-18 2024-07-01 1 2 1 1 1 1
4 2024-07-02 2024-07-02 1 0 1 0 0 0
From here, it was a simple case of a quick join, unnest, and mutate to get the data I actually needed:
runroute_datesbtwn <- runroute_datesbtwn %>%
# Join in the new calendar with the count of weekdays per snapshot
left_join(
runroute_calendar,
by = c(
"dates" = "start_date",
"end_date" = "end_date"
)
) %>%
unnest(everything()) %>%
mutate(
# Calculate the actual number of days the route ran
# This was the initial problem I was trying to solve here
route_days_tot = case_when(
route_days == 5 ~ select(., mon:fri) %>% rowSums(na.rm = TRUE),
route_days == 4 ~ select(., mon:thu) %>% rowSums(na.rm = TRUE),
route_days == 3 ~ select(., mon:wed) %>% rowSums(na.rm = TRUE),
route_days == 2 ~ select(., mon:tue) %>% rowSums(na.rm = TRUE),
route_days == 1 ~ select(., mon) %>% rowSums(na.rm = TRUE)
),
# Calculate the total mileage
total_mileage = route_days_tot * route_dist
) %>%
# Select columns for final output
select(
snapshot_date = dates,
route_id,
route_dist,
route_days,
total_mileage
)
> head(runroute_datesbtwn)
# A tibble: 6 × 6
snapshot_startdate snapshot_enddate route_id route_dist route_days total_mileage
<date> <date> <int> <dbl> <int> <dbl>
1 2024-05-21 2024-05-27 100 10.6 5 42.3
2 2024-05-21 2024-05-27 101 12.4 5 49.4
3 2024-05-21 2024-05-27 102 11.0 1 0
4 2024-05-21 2024-05-27 103 12.5 4 37.4
5 2024-05-21 2024-05-27 104 7.53 3 15.1
6 2024-05-21 2024-05-27 105 4.20 5 16.8