rcsvbigdatar-bigmemory

Loading an 11 GB .csv file as a big.matrix object


I have a 11GB .csv file which I would ultimately need as a big.matrix object. From what I have read I think I need to create a filebacked big.matrix object but I cannot figure out how to do this.

The file is too large for me to load directly into R and manipulate from there as I have done with smaller datasets. How do I produce a big.matrix object from the .csv file?


Solution

  • See if this can be of help. I post as an answer because it contains too much code for a comment.

    The strategy is to read chunks of 10K rows at a time and coerce them to a sparse matrix. Then, rbind those sub-matrices together.
    It uses data.table::fread for speed and a function in package fpeek to count the number of lines in the data file. This function is also fast.

    library(data.table)
    library(Matrix)
    
    flname <- "your_filename"
    nlines <- fpeek::peek_count_lines(flname)
    chunk <- 10*1024
    
    passes <- nlines %/% chunk
    remaining <- nlines %% chunk
    skip <- 0
    
    data_list <- vector("list", length = passes + (remaining > 0))
    for(i in seq_len(passes)) {
      tmp <- fread(flname, sep = ",", colClasses = "double", skip = skip, nrows = chunk)
      data_list[[i]] <- Matrix(as.matrix(tmp), sparse = TRUE)
      skip <- skip + chunk
    }
    if(remaining > 0) {
      tmp <- fread(flname, sep = ",", colClasses = "double", skip = skip)
      data_list[[passes + 1L]] <- Matrix(as.matrix(tmp), sparse = TRUE)
    }
    
    sparse_mat <- do.call(rbind, data_list)
    rm(data_list)
    

    Test data

    With the following test data all went alright. I also tried it with a bigger matrix.

    The path is optional.

    path <- "~/Temp"
    flname <- file.path(path, "big_example.csv")
    a <- matrix(1:(25*1024), ncol = 1)
    b <- matrix(rbinom(25*1024*10, size = 1, prob = 0.01), ncol = 10)
    a <- cbind(a, b)
    dim(a)
    write.csv(a, fl, row.names = FALSE)