I need to create a large file of 0's and 1's (approx 500K rows and 20K columns) for which I used the bigmemory package in R.
Now as this is new to me, and I have not managed to quite find the answer to my query yet.
big1 = big.matrix(nrow=nrow(mm),ncol=nrow(cod),init=0,type="char",dimnames = list(as.character(mm$id),cod$coding),backingfile = "big1.bin", descriptorfile = "big1.desc")
is.filebacked(big1) #TRUE
big2 = filebacked.big.matrix(nrow=nrow(mm),ncol=nrow(cod),init=0,type="char",dimnames = list(as.character(mm$id),cod$coding),backingfile = "big2.bin", descriptorfile = "big2.desc")
## presently the for loop step takes about 2 hours
for (i in 1:nrow(big1)){
big1[i, match(some_columns)] = 1
}
}
## eventually writing out the big.matrix to file using write.big.matrix also takes about 2 hours.
sessionInfo R version 3.3.0 Platform: x86_64-pc-linux-gnu (64-bit) Running under: Scientific Linux 6.9
What is the difference between these two? I would like to know what the difference is when assigning 1's to some cells in either big1 or big2? Are these saved in the backing and descriptor files when initialled in both cases? Or does one have to do something else?
I had saved the session's .RData (using big1 without backing & descriptor filesin the first instance) and then when trying to load it into R it caused a fatal error and terminated the session. So I would like to know what I could do more efficiently here to load the .RData, rather than having to waste a few hours each time to redo everything.
Many Thanks.
First, you can either use big.matrix
or filebacked.big.matrix
. See the first lines of the function big.matrix
:
if (!is.null(backingfile)) {
if (!shared)
warning("All filebacked objects are shared.")
return(filebacked.big.matrix(nrow = nrow, ncol = ncol,
type = type, init = init, dimnames = dimnames, separated = separated,
backingfile = backingfile, backingpath = backingpath,
descriptorfile = descriptorfile, binarydescriptor = binarydescriptor))
}
So, if you provide an argument backingfile
, filebacked.big.matrix
will be called.
Secondly, as standard R matrices, big matrices are stored column-wise and should be accessed column-wise if you care about efficiency. Something like this:
big1[, some_column_indices] <- 1
Thirdly, for the last part of your question, you can't store a big.matrix
object because it is an external pointer to a C++ object and when you load it back in memory, this pointer is null and it makes your session crash. You need to use descriptors (for example, when using parallelism). There are least 3 questions on SO which are about this question.
Hope I answered your interrogations.