rmatrixsubsetsubmatrixr-bigmemory

Removing columns from a "big.matrix" gives error: "cannot allocate vector of size 37.6 Gb"


I have a large "big.matrix" and I need to remove a few columns from it. It was created from a CSV file (with 72 million rows) using

BigMat <- read.big.matrix("matrix.csv", type="double", header=TRUE,
                     backingfile="matrix.bin",
                     descriptorfile="matrix.desc")

This successfully loads the matrix into R but I do not have enough memory space to create a new object when trying to subset this matrix:

BigMatSub <- BigMat[, 5:71]

It gave me: Error: cannot allocate vector of size 37.6 Gb.

Is there any way of the removing columns while without hitting memory limit? I need to have it as "big.matrix" object in the end to use in biglasso().

The matrix is sparse with many zero values.

Any help is much appreciated.


Solution

  • So you are using package bigmemory. No wonder you could store the full matrix "in the memory" in the first place.

    I haven't used bigmemory before. But intuitively, if the subset we want to extract is still too large, we still want a "big.matrix" after subsetting, instead of coercing it to a regular dense matrix. The error message you got implies that the usual "[" does not respect a "big.matrix" object, and attempts to return a dense matrix that is 37.6 GB. Wow! This implies that your "big.matrix" roughly has 75,322,188 rows!

    Searching "subset" in the package's PDF manual, I find that you could try:

    BigMatSubset <- deepcopy(BigMat, cols = 5:71)
    

    Interesting, the manual also documents "[". But it does not explicitly state if we are going to lose "big.matrix" class and get a regular matrix instead. For verification, you could extract a very small subset:

    what <- BigMat[1:10, 1:4]
    

    and see if what is a regular dense matrix.


    Update

    Searching "[r] deepcopy" gives only 7 posts (excluding this answer) so far. The most relevant one is:

    I also discovered function sub.big.matrix when reading those posts. Searching "[r] sub.big.matrix" gives only 2 posts so far (excluding this answer), both answered by Charles Determan, an author of bigmemory:

    I am now convinced that sub.big.matrix is a better way to go.

    All these posts are tagged with . So I will edit your question to include this tag, too.