rperformanceload

Best file type for loading data in to R (speed wise)?


I'm running some analysis where I'm getting quite a few datasets that are between 2-3G's. Right now, I'm saving this as .RData file types. Then, later I'm loading these files to continue working, which is taking some time to load in. My question is: would saving then load these files as .csv's be faster. Is data.table the fastest package for reading in .csv files? I guess I'm looking for the optimum workflow in R.


Solution

  • Based on the comments and some of my own research, I put together a benchmark.

    library(bench)
    
    nr_of_rows <- 1e7
    set.seed(1)
    df <- data.frame(
      Logical = sample(c(TRUE, FALSE, NA), prob = c(0.85, 0.1, 0.05), nr_of_rows, replace = TRUE),
      Integer = sample(1L:100L, nr_of_rows, replace = TRUE),
      Real = sample(sample(1:10000, 20) / 100, nr_of_rows, replace = TRUE),
      Factor = as.factor(sample(labels(UScitiesD), nr_of_rows, replace = TRUE))
    )
    
    baseRDS <- function() {
      saveRDS(df, "dataset.Rds")
      readRDS("dataset.Rds")
    }
    
    baseRDS_nocompress <- function() {
      saveRDS(df, "dataset.Rds", compress = FALSE)
      readRDS("dataset.Rds")
    }
    
    baseRData <- function() {
      save(list = "df", file = "dataset.Rdata")
      load("dataset.Rdata")
      df
    }
    
    data.table <- function() {
      data.table::fwrite(df, "dataset.csv")
      data.table::fread("dataset.csv")
    }
      
    feather <- function(variables) {
      feather::write_feather(df, "dataset.feather")
      as.data.frame(feather::read_feather("dataset.feather"))
    }
    
    fst <- function() {
      fst::write.fst(df, "dataset.fst")
      fst::read.fst("dataset.fst")
    }
    
    # only works on Unix systems
    # fastSave <- function() {
    #   fastSave::save.pigz(df, file = "dataset.RData", n.cores = 4)
    #   fastSave::load.pigz("dataset.RData")
    # }
    
    results <- mark(
      baseRDS(),
      baseRDS_nocompress(),
      baseRData(),
      data.table(),
      feather(),
      fst(),
      check = FALSE
    )
    

    Results

    summary(results)
    # A tibble: 6 x 13
      expression                min   median `itr/sec` mem_alloc
      <bch:expr>           <bch:tm> <bch:tm>     <dbl> <bch:byt>
    1 baseRDS()              15.74s   15.74s    0.0635     191MB
    2 baseRDS_nocompress() 720.82ms 720.82ms    1.39       191MB
    3 baseRData()            18.14s   18.14s    0.0551     191MB
    4 data.table()            4.43s    4.43s    0.226      297MB
    5 feather()            794.13ms 794.13ms    1.26       191MB
    6 fst()                233.96ms 304.28ms    3.29       229MB
    # ... with 8 more variables: `gc/sec` <dbl>, n_itr <int>,
    #   n_gc <dbl>, total_time <bch:tm>, result <list>,
    #   memory <list>, time <list>, gc <list>
    
    > summary(results,  relative = TRUE)
    # A tibble: 6 x 13
      expression             min median `itr/sec` mem_alloc
      <bch:expr>           <dbl>  <dbl>     <dbl>     <dbl>
    1 baseRDS()            67.3   51.7       1.15      1.00
    2 baseRDS_nocompress()  3.08   2.37     25.2       1.00
    3 baseRData()          77.5   59.6       1         1.00
    4 data.table()         18.9   14.5       4.10      1.56
    5 feather()             3.39   2.61     22.8       1   
    6 fst()                 1      1        59.6       1.20
    # ... with 8 more variables: `gc/sec` <dbl>, n_itr <int>,
    #   n_gc <dbl>, total_time <bch:tm>, result <list>,
    #   memory <list>, time <list>, gc <list>
    

    Based on this, the fst package is the fastest. It's followed by base R on the second place with the option compress = FALSE. This produces large files though. I wouldn't recommend saving anything in csv except you want to open it with a different program. In that case data.table would be your choice. Otherwise I would either recommend saveRDS or fst.