rparallel-processinglapplyrevolution-r

R: Lapply equivalent for Revoscaler/Revolution Enterprise?


Have Revolution Enterprise. Want to run 2 simple but computationally intensive operations on each of 121k files in a directory, outputting to new files. Was hoping to use some Revoscaler function that chunked/parallel processed the data similarly to lapply. So I'd have lapply(list of files, function), but using a faster Rxdf (revoscaler) function that might actually finish, since I suspect basic lapply would never complete.

So is there a Revoscaler version of lapply? Will running it from Revolution Enterprise automatically chunk things?

I see parlapply, mclapply (http://www.inside-r.org/r-doc/parallel/clusterApply)...can I run these using cores on the same desktop? Aws servers? Do I get anything out of running these packages in Revoscaler if its not a native Rxdf function? I guess then that this is a question more on what I can use as a "cluster" in this situation.


Solution

  • There is rxExec, which behaves like lapply in the single-core scenario, and like parLapply in the multi-core/multi-process scenario. You would use it like this:

    # vector of file names to operate on
    files <- list.files()
    
    rxSetComputeContext("localpar")
    rxExec(function(fname) { 
        ...
    }, fname=rxElemArg(files))
    

    Here, func is the function that carries out the operations you want on the files; and you pass it to rxExec much like you would to lapply. The rxElemArg function tells rxExec to execute func on each of the different values of files. Setting the compute context to "localpar" starts up a local cluster of slave processes, so the operations will run in parallel. By default, the number of slaves is 4, but you can change this with rxOptions(numCoresToUse).

    How much speedup can you expect to get? That depends on your data. If your files are small and most of the time is taken up by computations, then doing things in parallel can get you a big speedup. However, if your files are large, then you may run into I/O bottlenecks especially if all the files are on the same hard disk.