rmemory-managementdiskfilehash

How can I save results in a list in a memory efficient way?


In my current project I have a calculation function that runs on one element of a vector A and returns a list element that I insert into list B. The return element contains a number of large arbitrarily sized matrices that relate to the first list.

As an example let's take a function that takes an original number n and generates a random matrix of n x n.

vector.A <- sample(1:2000, 15000, replace = TRUE)

list.B <- as.list(rep(NA, length(vector.A)))

arbitraryMatrix <- function(n) {
    matrix(rnorm(n*n), ncol = n, nrow = n)
}

for ( i  in which(is.na(list.B)) ) {
    print(i)
    list.B[[i]] <- arbitraryMatrix( vector.A[i] )
}

This function slows down the larger list.B gets (in fact I'm pretty sure it will crash R before it finishes the loop). It occurred to me that no element of list.B is ever accessed again after it's created so it could be written to disk rather than taking up memory in a way that slows down the calculations.

I could write a script that would do this by saving chunks into .rda files but I was hoping someone had a more elegant solution.

The FF package looked like an interesting possibility for this http://cran.r-project.org/web/packages/ff/ff.pdf but as far as I can tell it doesn't support list objects.

Caveats:

EDIT: I'm considering the mmap package that maps r objects to temporary files but I'm still trying to work out how to use it for this problem.


Solution

  • Here's an answer using the package. It's a good method because it has an impressively tiny memory footprint, that barely increases as the function progresses. So that's fulfilled one of your objectives.

    However, it's a bad method because it has two substantial drawbacks... (1) it is incredibly slow, if you open a process monitor you can see the disk and memory swapping going on at a rather leisurely rate (on my machine, at least). In fact it's so slow I'm not sure if it gets slower as it gets further along. I haven't run it to completion, only just past the point where I got an error when I ran the function in memory (about item 350 or so) to convince myself it was better than running in memory (at which point the disk object was 73 GB). And that's second drawback, the disk object is creates is massive.

    So here's hoping someone else comes along with a better answer to your question (perhaps with mmap?), I'll be most interested to see.

    # set up disk storage object
    library(filehash)
    dbCreate("myTestDB")
    db <- dbInit("myTestDB")
    
    # put data on disk
    db$A <- sample(1:2000, 15000, replace = TRUE)
    db$B <- as.list(rep(NA, length(db$A)))
    
    # function
    arbitraryMatrix <- function(n) {
      matrix(rnorm(n*n), ncol = n, nrow = n)
    }
    
    # run function by accessing disk objects
    for ( i  in which(is.na(db$B)) ) {
      print(i)
      db$B[[i]] <- arbitraryMatrix( db$A[i] )
    }
    
    # run function by accessing disk objects, following
    # Jon's comment to treat db as a list
    for ( i  in which(is.na(db$B)) ) {
      print(i)
      db[[as.character(i)]] <- arbitraryMatrix( db$A[i] )
    }
    # use db[[as.character(1)]] etc to access the list items