rrcurlr-download.file

Faster way to download multiple files in R


I write a small downloader in R, in order to download some log files from remote server in one run:

file_remote <- fun_to_list_URLs()
file_local <- fun_to_gen_local_paths()
credentials <- "usr/pwd"

downloader <- function(file_remote, file_local, credentials) {
  data_bin <- RCurl::getBinaryURL(
    file_remote,
    userpwd = credentials,
    ftp.use.epsv = FALSE,
    forbid.reuse = TRUE
  )
  
  writeBin(data_bin, file_local)
}
  
purrr::walk2(
  file_remote,
  file_local,
  ~ downloader(
    file_remote = .x,
    file_local = .y,
    credentials = credentials
  )
)

This works, but slowly, especially compare it to some FTP clients like WinSCP, downloading 64 log files, each 2kb, takes minutes.

Is there a faster way to download a lot of files in R?


Solution

  • The curl package has a way to perform async requests, which means that downloads are performed simultaneously instead of one after another. Especially with smaller files this should give you a large boost in performance. Here is a barebone function that does that (since version 5.0.0, the curl package has a native version of this function also called multi_download):

    # total_con: max total concurrent connections.
    # host_con: max concurrent connections per host.
    # print: print status of requests at the end.
    multi_download <- function(file_remote, 
                               file_local,
                               total_con = 1000L, 
                               host_con  = 1000L,
                               print = TRUE) {
      
      # check for duplication (deactivated for testing)
      # dups <- duplicated(file_remote) | duplicated(file_local)
      # file_remote <- file_remote[!dups]
      # file_local <- file_local[!dups]
      
      # create pool
      pool <- curl::new_pool(total_con = total_con,
                             host_con = host_con)
      
      # function performed on successful request
      save_download <- function(req) {
        writeBin(req$content, file_local[file_remote == req$url])
      }
      
      # setup async calls
      invisible(
        lapply(
          file_remote, function(f) 
            curl::curl_fetch_multi(f, done = save_download, pool = pool)
        )
      )
      
      # all created requests are performed here
      out <- curl::multi_run(pool = pool)
      
      if (print) print(out)
      
    }
    

    Now we need some test files to compare it to your baseline approach. I use covid data from the Johns Hopkins University GitHub page as it contains many small csv files which should be similar to your files.

    file_remote <- paste0(
      "https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_daily_reports/",
      format(seq(as.Date("2020-03-03"), as.Date("2022-06-01"), by = "day"), "%d-%m-%Y"),
      ".csv"
    )
    file_local <- paste0("/home/johannes/Downloads/test/", seq_along(file_remote), ".bin")
    

    We could also infer the file names from the URLs but I assume that is not what you want. So now lets compare the approaches for these 821 files:

    res <- bench::mark(
      baseline(),
      multi_download(file_remote, 
                     file_local,
                     print = FALSE),
      check = FALSE
    )
    #> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
    summary(res)
    #> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
    #> # A tibble: 2 × 6
    #>   expression                                                min median `itr/sec`
    #>   <bch:expr>                                             <bch:> <bch:>     <dbl>
    #> 1 baseline()                                               2.8m   2.8m   0.00595
    #> 2 multi_download(file_remote, file_local, print = FALSE)  12.7s  12.7s   0.0789 
    #> # … with 2 more variables: mem_alloc <bch:byt>, `gc/sec` <dbl>
    summary(res, relative = TRUE)
    #> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
    #> # A tibble: 2 × 6
    #>   expression                                               min median `itr/sec`
    #>   <bch:expr>                                             <dbl>  <dbl>     <dbl>
    #> 1 baseline()                                              13.3   13.3       1  
    #> 2 multi_download(file_remote, file_local, print = FALSE)   1      1        13.3
    #> # … with 2 more variables: mem_alloc <dbl>, `gc/sec` <dbl>
    

    The new approach is 13.3 times faster than the original one. I would assume that the difference will be bigger the more files you have. Note though, that this benchmark is not perfect as my internet speed fluctuates quite a bit.

    The function should also be improved in terms of handling errors (currently you get a message how many requests have been successful and how many errored, but no indication which files exist). My understanding is also that multi_run writes files to the memory before save_download writes them to disk. With small files this is fine, but it might be an issue with larger ones.

    baseline function

    baseline <- function() {
      credentials <- "usr/pwd"
      downloader <- function(file_remote, file_local, credentials) {
        data_bin <- RCurl::getBinaryURL(
          file_remote,
          userpwd = credentials,
          ftp.use.epsv = FALSE,
          forbid.reuse = TRUE
        )
        writeBin(data_bin, file_local)
      }
      
      purrr::walk2(
        file_remote,
        file_local,
        ~ downloader(
          file_remote = .x,
          file_local = .y,
          credentials = credentials
        )
      )
    }
    

    Created on 2022-06-05 by the reprex package (v2.0.1)