I wrote a simple function using sapply()
to define the water year (Oct 1st - Sept 30th) for a vector of dates. It works well except that for every 10-fold increase in the length of vector supplied, the function takes 100 times longer, making it prohibitive for large date vectors (I need to apply it to 300k dates). My understanding was that the apply family of functions are vectorized and should be an efficient way to work with large datasets. What am I missing here and can I make it more effecient?
wateryear <- function(dates){
y <- year(dates)
m <- month(dates)
sapply(m, function(x) {ifelse(x <= 9, paste0(y-1,"-",y), paste0(y, "-", y+1))})
}
d<-c("2020/01/01")
system.time(wateryear(rep(d,100))) # 0.008
system.time(wateryear(rep(d,1000))) # 0.651
system.time(wateryear(rep(d,10000))) # 63.854
This vectorized approach runs 300x faster for n = 1000. Looks like it takes about 0.4sec for n = 300k.
library(lubridate)
wateryear2 <- function(dates){
d <- ymd(dates)
y <- year(d) + (month(d) > 9)
paste(y-1, y, sep = "-")
}
bench::mark(
wateryear(rep(d,1000)),
wateryear2(rep(d,1000))
)
# A tibble: 2 × 13
expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result memory time gc
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list> <list> <list> <list>
1 wateryear(rep(d, 1000)) 968ms 968ms 1.03 32MB 0 1 0 968ms <chr [1,000]> <Rprofmem [4,296 × 3]> <bench_tm> <tibble>
2 wateryear2(rep(d, 1000)) 3.04ms 3.09ms 322. 711KB 6.23 155 3 482ms <chr [1,000]> <Rprofmem [380 × 3]> <bench_tm> <tibble>