I have a dataset with fiscal year, customer ID, customer type, entrance dates, and release dates, and I'm trying to create a dataframe with daily summary info of counts of active customers for each customer type. The reason I'm working this is as an intermediary step prior to doing some time series analysis to predict future states.
#### SETUP ----
library(tidyverse)
#### Minimum Working Example DF ----
set.seed(789456)
mwe_2020 <-
# Create initial table with year, ID, customer type, and entrance date
tibble(
fy_year = rep(2020, 4000),
cus_id = sample(10:100000, 4000, replace = FALSE),
cus_gr = sample(0:12, 4000, replace = TRUE),
entrance_date = as_date("2019-07-01")
)
mwe_release <-
# Create a dataframe of dates between the entrance date and the end of the FY
tibble(
release_date = seq.Date(as_date("2019-07-01"), as_date("2020-06-30"), "day")
) %>%
mutate(
release_day = weekdays(release_date)
) %>%
filter(
# Remove weekends as no releases are processed on the weekend
!grepl("Sat|Sun", release_day)
)
mwe_release_sample <- mwe_2020 %>%
# Take a sample of 1000 records to apply a random release date to
slice_sample(n = 1000) %>%
mutate(
release_date = sample(mwe_release$release_date, 1000, replace = TRUE)
) %>%
select(
cus_id,
release_date
)
mwe_2020 <- mwe_2020 %>%
# Join the two dfs to get the final MWE data
left_join(mwe_release_sample, by = "cus_id")
mwe_2021 <-
# Create initial table with year, ID, customer type, and entrance date
tibble(
fy_year = rep(2021, 4000),
cus_id = sample(10:100000, 4000, replace = FALSE),
cus_gr = sample(0:12, 4000, replace = TRUE),
entrance_date = as_date("2020-07-01")
)
mwe_release <-
# Create a dataframe of dates between the entrance date and the end of the FY
tibble(
release_date = seq.Date(as_date("2020-07-01"), as_date("2021-06-30"), "day")
) %>%
mutate(
release_day = weekdays(release_date)
) %>%
filter(
# Remove weekends as no releases are processed on the weekend
!grepl("Sat|Sun", release_day)
)
mwe_release_sample <- mwe_2021 %>%
# Take a sample of 1000 records to apply a random release date to
slice_sample(n = 1000) %>%
mutate(
release_date = sample(mwe_release$release_date, 1000, replace = TRUE)
) %>%
select(
cus_id,
release_date
)
mwe_2021 <- mwe_2021 %>%
# Join the two dfs to get the final MWE data
left_join(mwe_release_sample, by = "cus_id")
rm(mwe_release, mwe_release_sample)
mwe <- bind_rows(mwe_2020, mwe_2021)
rm(mwe_2020, mwe_2021)
So, you can see with the above data, we have a total of 8,000 rows spanning two fiscal years. The actual data is something like 500,000 rows spanning five fiscal years.
I was able to get what I'm looking for to work for a single fiscal year with the following:
#### Output ----
mwe_final <-
tibble(
cus_gr = "",
key_date = "",
n_cus = ""
) %>%
mutate(
cus_gr = as.integer(cus_gr),
key_date = as_date(key_date),
n_cus = as.integer(n_cus)
)
for (i in seq.Date(as_date("2019-07-01"), as_date("2020-06-30"), "day")) {
mwe_active <- mwe %>%
filter(
fy_year == 2020,
is.na(release_date) | release_date >= as_date(i)
) %>%
mutate(key_date = as_date(i)) %>%
group_by(cus_gr, key_date) %>%
summarise(n_cus = n(), .groups = "drop")
mwe_final <- mwe_final %>%
bind_rows(mwe_active)
}
This gave me the expected output of one row per cus_gr
per day in the selected fiscal year:
> head(mwe_final)
# A tibble: 6 × 3
cus_gr key_date n_cus
<int> <date> <int>
1 NA NA NA
2 0 2019-09-03 303
3 1 2019-09-03 292
4 2 2019-09-03 295
5 3 2019-09-03 317
6 4 2019-09-03 291
> tail(mwe_final)
# A tibble: 6 × 3
cus_gr key_date n_cus
<int> <date> <int>
1 7 2020-06-30 244
2 8 2020-06-30 227
3 9 2020-06-30 225
4 10 2020-06-30 233
5 11 2020-06-30 235
6 12 2020-06-30 215
I'm not sure how to loop over each day in each fiscal year. My expected output should look like the above, but instead of dates 2019-07-01
through 2020-06-30
, I need the script to run to present day over each fiscal year in the dataset. I'm trying to make this script as hands-off as possible so it can be run anytime with loading in the new data and looping from the beginning of the dataset and earliest fiscal year through the end of the dataset to whatever the current day is or the end of the most recent fiscal year.
I'm guessing I'll need to create a new table with the fiscal years and dates something like the following:
# Fiscal years and dates
fy_dates <- tibble(
fy = 2020:2025,
start = c("2019-07-01", "2020-07-01", "2021-07-01", "2022-07-01", "2023-07-01", "2024-07-01")
end = c("2020-06-30", "2021-06-30", "2022-06-30", "2023-06-30", "2024-06-30", "2025-06-30")
But I'm not sure how to combine this new information into the for
loop itself to make sure it loops correctly. I'd love some pointers here if anyone can help me out.
Avoid calling bind_rows
inside a loop which can cause excessive copying. Instead, build a list of tibbles using lapply
inside a call to Map
(elementwise loop and wrapper of mapply
) using your fiscal years tibble. Then, call bind_rows
outside of loop. Of course, there are purrr
synonyms such as map
and pmap
.
aggregate_release <- function(fy, start, end) {
seq_dts <- seq.Date(start, end, "day")
tibbles <- lapply(seq_dts, \(dt) {
mwe %>%
filter(fy_year == fy, is.na(release_date) | release_date >= as_date(dt)) %>%
mutate(key_date = as_date(dt)) %>%
group_by(cus_gr, key_date) %>%
summarise(n_cus = n(), .groups = "drop")
})
bind_rows(tibbles)
}
release_fy_tibbles <- with(fy_dates, Map(aggregate_release, fy=fy, start=start, end=end))
release_final <- bind_rows(release_fy_tibbles)