rropenscidrake-r-package

Faster way to slice a (raw) vector?


Problem

I am looking for a fast (ideally constant-time) way to take a large slice a long raw vector in R. For example:

obj <- raw(2^32)
obj[seq_len(2^31 - 1)]

Even with ALTREP, base R takes too long.

system.time(obj[seq_len(2^31 - 1)])
#>    user  system elapsed 
#>  19.470  38.853 148.288 

Why?

Because I am trying to speed up storr in order speed up drake. I want storr to save long raw vectors more quickly. writeBin() is super fast, but it still cannot handle vectors more than 2^31 - 1 bytes long. So I want to save the data in manageable chunks as described here. This almost works, but creating the chunks is too slow, and it duplicates too much data in memory.

Ideas

Let's create a function

slice_raw <- function(obj, from, to) {
  # ???
}

which is essentially equivalent to

obj[seq(from, to, by = 1L)]

and which is O(1) in both time and memory. In theory, all we should need to do is

  1. Pass obj to a C function.
  2. Create a new pointer to the first byte of obj.
  3. Increment the new pointer to the start of the slice.
  4. Create a RAWSXP at the new pointer with the appropriate length (less than 2^31 bytes).
  5. Return the RAWSXP.

I have a background in C, but I struggle to take full control of R's internals. I would like to access the C pointers inside SEXPs so I can do basic pointer arithmetic and create R vectors of known lengths from undecorated C pointers. The resources I found on R's C internals do not seem to explain how to wrap or unwrap pointers. Do we need Rcpp for this?

The following rough sketch gets at what I am trying to do.

library(inline)
sig <- c(
  x = "raw",         # Long raw vector with more than 2^31 - 1 bytes.
  start = "integer", # Should probably be R_xlen_t.
  bytes = "integer"  # <= 2^31 - 1. Ideally coercible to R_xlen_t.
)
body <- "
Rbyte* result;           // Just a reference. Want to avoid copying data.
result = RAW(x) + start; // Trying to do ordinary pointer arithmetic.
return asRaw(result);    // Want to return a raw vector of length `bytes`.
"
slice_raw <- cfunction(sig = sig, body = body)

EDIT: some more potential workarounds

Thanks to Dirk for spurring my thinking on this one. For small enough data, we can use fst to save a single-column data frame, where the column is the raw vector we actually care about. This use of fst is faster than writeBin()

library(fst)
wrapper <- data.frame(actual_data = raw(2^31 - 1))
system.time(write_fst(wrapper, tempfile()))
#>    user  system elapsed 
#>   0.362   0.019   0.103
system.time(writeBin(wrapper$actual_data, tempfile()))
#>    user  system elapsed 
#>   0.314   1.340   1.689

Created on 2019-06-16 by the reprex package (v0.3.0)

Unfortunately, it is difficult to create data frames with 2^31 or more rows. One hack is to convert the raw vector into a matrix first, and we avoid the usual integer overflow because (2^31 - 1)^2 bytes is several exabytes.

library(fst)
x <- raw(2^32)
m <- matrix(x, nrow = 2^16, ncol = 2^16)
system.time(write_fst(as.data.frame(m), tempfile()))
#>    user  system elapsed 
#>   8.776   1.459   9.519

Created on 2019-06-16 by the reprex package (v0.3.0)

We still leave saveRDS() in the dust, but we no longer beat writeBin(). The conversion from a data frame to a matrix is slow, and I am not sure it would scale well.

library(fst)
x <- raw(2^30)
m <- matrix(x, nrow = 2^15, ncol = 2^15)
system.time(write_fst(as.data.frame(m), tempfile()))
#>    user  system elapsed 
#>   1.998   0.408   2.409
system.time(writeBin(as.raw(m), tempfile()))
#>    user  system elapsed 
#>   0.329   0.839   1.397

Created on 2019-06-16 by the reprex package (v0.3.0)

If, like Dirk suggested, we can use an R_xlen_t to index the rows of a data frame, we might be able to avoid converting anything.


Solution

  • Although, currently, data.frame's with long vector columns are not supported very well, you can still use fst to serialize long raw vectors:

    # method for writing a raw vector to disk
    write_raw <- function(x, path, compress = 50) {
    
      # create a list and add required attributes
      y <- list(X = x)
      attributes(y) <- c(attributes(y), class = "data.frame")
    
      # serialize and compress to disk
      fst::write_fst(y, path, compress)
    }
    
    # create raw vector of length >2^31
    x <- rep(as.raw(0:255), 2^23 + 10)
    
    # write raw vector
    write_raw(x, "raw_vector.fst", 100)
    

    With this scheme there is no need to split the vector in multiple parts (which, as you already indicate, will slow down serialization significantly). The raw vector can be re-read without any copying or slicing:

    # method for reading a raw vector from disk
    read_raw <- function(path) {
    
      # read from disk
      z <- fst::read_fst(path)
    
      z$X
    }
    
    z <- read_raw("raw_vector.fst")
    
    fst::hash_fst(x) == fst::hash_fst(z)
    #> [1] TRUE TRUE
    

    (note that at the moment you need the fst development version for reading with long vector support)

    In your setup, you will always be serializing the complete raw vector to disk as a whole (just like saveRDS(). Because you do not need random access to the stored vector, the meta-data stored in the fst file is a bit of an overkill. You might also test a setup where you compress the raw vector using compress_fst() and then store the result using saveRDS(raw_vec, compress = FALSE).

    The advantage of such a setup would be that the compressor can use bigger chunks for compression, increasing the compression ratio (effect can be significant). Using larger chunks can also speed up compression.

    On the other hand, the disadvantage is that you are not compressing during the write to disk as with write_fst(), so that effect might slow down your serialization. And you don't have random access anymore, but you don't really need that anyway.

    If you implement a two-step process (first compressing the data and then serializing it), you will be able to allow for different compressors if the user would opt for that (for example slower compressors with very high compression ratio for slow disks).