I have a 11GB .csv file which I would ultimately need as a big.matrix
object. From what I have read I think I need to create a filebacked big.matrix
object but I cannot figure out how to do this.
The file is too large for me to load directly into R and manipulate from there as I have done with smaller datasets. How do I produce a big.matrix
object from the .csv file?
See if this can be of help. I post as an answer because it contains too much code for a comment.
The strategy is to read chunks of 10K rows at a time and coerce them to a sparse matrix. Then, rbind
those sub-matrices together.
It uses data.table::fread
for speed and a function in package fpeek
to count the number of lines in the data file. This function is also fast.
library(data.table)
library(Matrix)
flname <- "your_filename"
nlines <- fpeek::peek_count_lines(flname)
chunk <- 10*1024
passes <- nlines %/% chunk
remaining <- nlines %% chunk
skip <- 0
data_list <- vector("list", length = passes + (remaining > 0))
for(i in seq_len(passes)) {
tmp <- fread(flname, sep = ",", colClasses = "double", skip = skip, nrows = chunk)
data_list[[i]] <- Matrix(as.matrix(tmp), sparse = TRUE)
skip <- skip + chunk
}
if(remaining > 0) {
tmp <- fread(flname, sep = ",", colClasses = "double", skip = skip)
data_list[[passes + 1L]] <- Matrix(as.matrix(tmp), sparse = TRUE)
}
sparse_mat <- do.call(rbind, data_list)
rm(data_list)
With the following test data all went alright. I also tried it with a bigger matrix.
The path
is optional.
path <- "~/Temp"
flname <- file.path(path, "big_example.csv")
a <- matrix(1:(25*1024), ncol = 1)
b <- matrix(rbinom(25*1024*10, size = 1, prob = 0.01), ncol = 10)
a <- cbind(a, b)
dim(a)
write.csv(a, fl, row.names = FALSE)