I have a csv.gz file that (from what I've been told) before compression was 70GB in size. My machine has 50GB of RAM, so anyway I will never be able to open it as a whole in R.
I can load for example the first 10m rows as follows:
library(vroom)
df <- vroom("HUGE.csv.gz", delim= ",", n_max = 10^7)
For what I have to do, it is fine to load 10m rows at the time, do my operations, and continue with the next 10m rows. I could do this in a loop.
I was therefore trying the skip
argument.
df <- vroom("HUGE.csv.gz", delim= ",", n_max = 10^7, skip = 10^7)
This results in an error:
Error: The size of the connection buffer (131072) was not large enough
to fit a complete line:
* Increase it by setting `Sys.setenv("VROOM_CONNECTION_SIZE")`
I increased this with Sys.setenv("VROOM_CONNECTION_SIZE" = 131072*1000)
, however, the error persists.
Is there a solution to this?
Edit: I found out that random access to a gzip compressed csv (csv.gz) is not possible. We have to start from top. Probably the easiest is to decompress and save, then skip should work.
I haven't been able to figure out vroom
solution for very large more-than-RAM (gzipped) csv files. However, the following approach has worked well for me and I'd be grateful to know about approaches with better querying speed while also saving disk space.
split
sub-command inxsv
from https://github.com/BurntSushi/xsv to split the large csv file into comfortably-within-RAM chunks of say, 10^5, lines and save them in a folder.data.table::fread
one-by-one (to avoid low-memory error) using a for
loop and save all of them into a folder as compressed parquet
files using arrow
package which saves space and prepares the large table for fast querying. For even faster operations, it is advisable to re-save the parquet
files partitioned by the fields by which you need to frequently filter.arrow::open_dataset
and query that multi-file parquet folder using dplyr
commands. It takes minimum disk space and gives the fastest results in my experience.I use data.table::fread
with explicit definition of column classes of each field for fastest and most reliable parsing of csv files. readr::read_csv
has also been accurate but slower. However, auto-assignment of column classes by read_csv
as well as the ways in which you can custom-define column classes by read_csv
is actually the best - so less human-time but more machine-time - which means that it may be faster overall depending on scenario. Other csv parsers have thrown errors for the kind of csv files that I work with and waste time.
You may now delete the folder containing chunked csv files to save space, unless you want to experiment loop-reading them with other csv parsers.
Other previously successfully approaches: Loop read all csv chunks as mentioned above and save them into:
disk.frame
package. Then that folder may be queried using dplyr
or data.table
commands explained in the documentation. It has facility to save in compressed fst
files which saves space, though not as much as parquet
files.DuckDB
database which allows querying with SQL
or dplyr
commands. Using database-tables approach won't save you disk space. But DuckDB
also allows querying partitioned/un-partitioned parquet files (which saves disk space) with SQL
commands.EDIT: - Improved Method Below
I experimented a little and found a much better way to do the above operations. Using the code below, the large (compressed) csv file will be chunked automatically within R environment (no need to use any external tool like xsv
) and all chunks will be written in parquet
format in a folder ready for querying.
library(readr)
library(arrow)
fyl <- "...path_to_big_data_file.csv.gz"
pqFolder <- "...path_to_folder_where_chunked_parquet_files_are_to_be_saved"
f <- function(x, pos){
write_parquet(x,
file.path(pqFolder, paste0(pos, ".parquet")),
compression = "gzip",
compression_level = 9)
}
read_csv_chunked(
fyl,
col_types = list(Column1="f", Column2="c", Column3="T", ...), # all column specifications
callback = SideEffectChunkCallback$new(f),
chunk_size = 10^6)
If, instead of parquet
, you want to use -
disk.frame
, the callback function may be used to create chunked compressed fst
files for dplyr
or data.table
style querying.DuckDB
, the callback function may be used to append
the chunks into a database table for SQL
or dplyr
style querying.By judiciously choosing the chunk_size
parameter of readr::read_csv_chunked
command, the computer should never run out of RAM while running queries.
PS: I use gzip
compression for parquet
files since they can then be previewed with ParquetViewer
from https://github.com/mukunku/ParquetViewer. Otherwise, zstd
(not currently supported by ParquetViewer
) decompresses faster and hence improves reading speed.
EDIT 2:
I got a csv file which was really big for my machine: 20 GB gzipped and expands to about 83 GB, whereas my home laptop has only 16 GB. Turns out that the read_csv_chunked
method I mentioned in earlier EDIT fails to complete. It always stops working after some time and does not create all parquet
chunks. Using my previous method of splitting the csv file with xsv
and then looping over them creating parquet
chunks worked. To be fair, I must mention it took multiple attempts this way too and I had programmed a check to create only additional parquet
chunks when running the program on successive attempts.
EDIT 3:
VROOM does have difficulty when dealing with huge files since it needs to store the index in memory as well as any data you read from the file. See development thread https://github.com/r-lib/vroom/issues/203
EDIT 4:
Additional tip: The chunked parquet files created by the above mentioned method may be very conveniently queried using SQL with DuckDB method mentioned at https://duckdb.org/docs/data/parquet and https://duckdb.org/2021/06/25/querying-parquet.html
DuckDB method is significant because R Arrow method currently suffers from a very serious limitation which is mentioned in the official documentation page https://arrow.apache.org/docs/r/articles/dataset.html.
Specifically, and I quote: "In the current release, arrow supports the dplyr
verbs mutate(), transmute(), select(), rename(), relocate(), filter()
, and arrange()
. Aggregation is not yet supported, so before you call summarise()
or other verbs with aggregate functions, use collect()
to pull the selected subset of the data into an in-memory R data frame."
The problem is that if you use collect()
on a very big dataset, the RAM usage spikes and the system crashes. Whereas, using SQL statements to do the same aggregation job on the same big-dataset with DuckDB does not cause RAM usage spikes and does not cause system crash. So until Arrow fixes itself for aggregation queries for big-data, SQL from DuckDB provides a nice solution to querying big datasets in chunked parquet format.