I am looking for a fast (ideally constant-time) way to take a large slice a long raw vector in R. For example:
obj <- raw(2^32)
obj[seq_len(2^31 - 1)]
Even with ALTREP, base R takes too long.
system.time(obj[seq_len(2^31 - 1)])
#> user system elapsed
#> 19.470 38.853 148.288
Because I am trying to speed up storr
in order speed up drake
. I want storr
to save long raw vectors more quickly. writeBin()
is super fast, but it still cannot handle vectors more than 2^31 - 1 bytes long. So I want to save the data in manageable chunks as described here. This almost works, but creating the chunks is too slow, and it duplicates too much data in memory.
Let's create a function
slice_raw <- function(obj, from, to) {
# ???
}
which is essentially equivalent to
obj[seq(from, to, by = 1L)]
and which is O(1) in both time and memory. In theory, all we should need to do is
obj
to a C function.obj
.RAWSXP
at the new pointer with the appropriate length (less than 2^31 bytes).RAWSXP
.I have a background in C, but I struggle to take full control of R's internals. I would like to access the C pointers inside SEXP
s so I can do basic pointer arithmetic and create R vectors of known lengths from undecorated C pointers. The resources I found on R's C internals do not seem to explain how to wrap or unwrap pointers. Do we need Rcpp
for this?
The following rough sketch gets at what I am trying to do.
library(inline)
sig <- c(
x = "raw", # Long raw vector with more than 2^31 - 1 bytes.
start = "integer", # Should probably be R_xlen_t.
bytes = "integer" # <= 2^31 - 1. Ideally coercible to R_xlen_t.
)
body <- "
Rbyte* result; // Just a reference. Want to avoid copying data.
result = RAW(x) + start; // Trying to do ordinary pointer arithmetic.
return asRaw(result); // Want to return a raw vector of length `bytes`.
"
slice_raw <- cfunction(sig = sig, body = body)
Thanks to Dirk for spurring my thinking on this one. For small enough data, we can use fst
to save a single-column data frame, where the column is the raw vector we actually care about. This use of fst
is faster than writeBin()
library(fst)
wrapper <- data.frame(actual_data = raw(2^31 - 1))
system.time(write_fst(wrapper, tempfile()))
#> user system elapsed
#> 0.362 0.019 0.103
system.time(writeBin(wrapper$actual_data, tempfile()))
#> user system elapsed
#> 0.314 1.340 1.689
Created on 2019-06-16 by the reprex package (v0.3.0)
Unfortunately, it is difficult to create data frames with 2^31 or more rows. One hack is to convert the raw vector into a matrix first, and we avoid the usual integer overflow because (2^31 - 1)^2 bytes is several exabytes.
library(fst)
x <- raw(2^32)
m <- matrix(x, nrow = 2^16, ncol = 2^16)
system.time(write_fst(as.data.frame(m), tempfile()))
#> user system elapsed
#> 8.776 1.459 9.519
Created on 2019-06-16 by the reprex package (v0.3.0)
We still leave saveRDS()
in the dust, but we no longer beat writeBin()
. The conversion from a data frame to a matrix is slow, and I am not sure it would scale well.
library(fst)
x <- raw(2^30)
m <- matrix(x, nrow = 2^15, ncol = 2^15)
system.time(write_fst(as.data.frame(m), tempfile()))
#> user system elapsed
#> 1.998 0.408 2.409
system.time(writeBin(as.raw(m), tempfile()))
#> user system elapsed
#> 0.329 0.839 1.397
Created on 2019-06-16 by the reprex package (v0.3.0)
If, like Dirk suggested, we can use an R_xlen_t
to index the rows of a data frame, we might be able to avoid converting anything.
Although, currently, data.frame's with long vector columns are not supported very well, you can still use fst
to serialize long raw vectors:
# method for writing a raw vector to disk
write_raw <- function(x, path, compress = 50) {
# create a list and add required attributes
y <- list(X = x)
attributes(y) <- c(attributes(y), class = "data.frame")
# serialize and compress to disk
fst::write_fst(y, path, compress)
}
# create raw vector of length >2^31
x <- rep(as.raw(0:255), 2^23 + 10)
# write raw vector
write_raw(x, "raw_vector.fst", 100)
With this scheme there is no need to split the vector in multiple parts (which, as you already indicate, will slow down serialization significantly). The raw vector can be re-read without any copying or slicing:
# method for reading a raw vector from disk
read_raw <- function(path) {
# read from disk
z <- fst::read_fst(path)
z$X
}
z <- read_raw("raw_vector.fst")
fst::hash_fst(x) == fst::hash_fst(z)
#> [1] TRUE TRUE
(note that at the moment you need the fst development version for reading with long vector support)
In your setup, you will always be serializing the complete raw vector to disk as a whole (just like saveRDS()
. Because you do not need random access to the stored vector, the meta-data stored in the fst file is a bit of an overkill. You might also test a setup where you compress the raw vector using compress_fst()
and then store the result using saveRDS(raw_vec, compress = FALSE)
.
The advantage of such a setup would be that the compressor can use bigger chunks for compression, increasing the compression ratio (effect can be significant). Using larger chunks can also speed up compression.
On the other hand, the disadvantage is that you are not compressing during the write to disk as with write_fst()
, so that effect might slow down your serialization. And you don't have random access anymore, but you don't really need that anyway.
If you implement a two-step process (first compressing the data and then serializing it), you will be able to allow for different compressors if the user would opt for that (for example slower compressors with very high compression ratio for slow disks).