I have a simple operation where I read several csvs, bind them, and then export, but vroom
is performing much slower than other methods. I must be doing something wrong, but I'm not sure what, or why.
library(readr)
library(vroom)
library(data.table)
library(microbenchmark)
write_csv(mtcars, "test.csv")
microbenchmark(
readr={
t <- read_csv("test.csv", col_types=cols())
write_csv(t, "test.csv")
},data.tabl={
t <- fread("test.csv")
fwrite(t, "test.csv", sep=",")
},vroom={
t <- vroom("test.csv", delim=",", show_col_types = F)
vroom_write(t, "test.csv", delim=",")
},
times=10
)
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> readr 12.636961 12.662955 15.865400 12.928211 13.503029 41.104583 10
#> data.tabl 2.200815 2.275252 2.633456 2.342797 2.529283 4.830134 10
#> vroom 57.376353 57.915135 64.280365 58.496847 58.966311 117.150837 10
Created on 2021-07-01 by the reprex package (v2.0.0)
To do a test with more data, I used the CSV from https://www.datosabiertos.gob.pe/dataset/vacunaci%C3%B3n-contra-covid-19-ministerio-de-salud-minsa, which contains 7.3+ million rows, and used a slight variation of your code:
library(readr)
library(vroom)
library(data.table)
library(microbenchmark)
csv_file <- "vacunas_covid.csv.gz"
microbenchmark(
readr={
t <- read_csv(csv_file, col_types=cols())
write_csv(t, csv_file)
},data.table={
t <- fread(csv_file)
fwrite(t, csv_file, sep=",")
},vroom={
t <- vroom(csv_file, delim=",", show_col_types = F)
vroom_write(t, csv_file, delim=",")
},
times=5
)
The results were:
Unit: seconds
expr min lq mean median uq max neval cld
readr 101.72094 105.75384 109.16869 106.08111 108.06967 124.21788 5 c
data.table 28.18751 30.32570 31.06592 30.44838 33.12746 33.24055 5 a
vroom 48.65399 51.52445 55.78264 52.89823 53.83582 72.00071 5 b
From the results, vroom
is at least 2x than readr
using a big dataset, and data.table
is ~1.7x faster than vroom
. Perhaps the issue with the original example is that the data is small, and the indexing that vroom
performs is contributing to the difference.
Just in case the code and results are at: https://gist.github.com/jmcastagnetto/fef3f3a2778028e7efb6836d6d8e3f8e