rfunctionperformanceapplysapply

Function has 100-fold increase in duration for each 10-fold increase in input data


I wrote a simple function using sapply() to define the water year (Oct 1st - Sept 30th) for a vector of dates. It works well except that for every 10-fold increase in the length of vector supplied, the function takes 100 times longer, making it prohibitive for large date vectors (I need to apply it to 300k dates). My understanding was that the apply family of functions are vectorized and should be an efficient way to work with large datasets. What am I missing here and can I make it more effecient?

wateryear <- function(dates){
  y <- year(dates)
  m <- month(dates)
  sapply(m, function(x) {ifelse(x <= 9, paste0(y-1,"-",y), paste0(y, "-", y+1))})
}
d<-c("2020/01/01")
system.time(wateryear(rep(d,100))) # 0.008
system.time(wateryear(rep(d,1000))) # 0.651
system.time(wateryear(rep(d,10000))) # 63.854

Solution

  • This vectorized approach runs 300x faster for n = 1000. Looks like it takes about 0.4sec for n = 300k.

    library(lubridate)
    wateryear2 <- function(dates){
      d <- ymd(dates)
      y <- year(d) + (month(d) > 9)
      paste(y-1, y, sep = "-")
    }
    
    bench::mark(
      wateryear(rep(d,1000)),
      wateryear2(rep(d,1000))
    )
    
    
    # A tibble: 2 × 13
      expression                    min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result        memory                 time       gc      
      <bch:expr>               <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list>        <list>                 <list>     <list>  
    1 wateryear(rep(d, 1000))     968ms    968ms      1.03      32MB     0        1     0      968ms <chr [1,000]> <Rprofmem [4,296 × 3]> <bench_tm> <tibble>
    2 wateryear2(rep(d, 1000))   3.04ms   3.09ms    322.       711KB     6.23   155     3      482ms <chr [1,000]> <Rprofmem [380 × 3]>   <bench_tm> <tibble>
    

    enter image description here