rdata.table

Is there a faster way to populate this vector?


I am trying to populate a new vector based on values of an original vector.

For example:

# Key
let <- c("a", "b", "c")
num <- c("one", "two", "three") 

# Given the following: 
v1 <-  c("one", "two", "three", "two", "one")
# Create the following using the key above:
v2 <- c("a", "b", "c", "b", "a")

I have used data.table and have had reasonable success, but I'm wondering if there's a strategy I've overlooked. I want to be able to do this on 1billion+ length vectors but am running into memory issues.

# EXAMPLE

# Create a number and letters that correspond to each other:
data_key <- c(1:100)
letter_class <- sample(letters, 100, replace = TRUE)

# Create vector of numbers
v1 <- sample(data_key, 1e8, replace = TRUE)
v2 <- c() # Make a v2 with letter_class that corresponds to number value in v1


# Create data with data_key and letter_class
key_table <- data.table(
  data_key,
  letter_class
)

d1 <- data.table(data_key = v1)

# Subset-method
t1 <- Sys.time()
v2_sub <- key_table[d1, , on = "data_key"][["letter_class"]]
Sys.time() - t1 
# Time difference of 3.457874 secs

# Merge-Method
t2 <- Sys.time()
v2_merge <- merge(d1,
            key_table,
            by = "data_key", 
            all.x = TRUE)[["letter_class"]]
Sys.time() - t2
# Time difference of 7.833402 secs

I have 32GB of RAM.


Solution

  • The fastmatch package has a faster version of match. Benchmarking with your sample data, it's a bit better than twice as fast and uses less memory.

    library(fastmatch)
    bench::mark(
      base = letter_class[match(v1, data_key)],
      fastmatch = letter_class[fmatch(v1, data_key)]
    )
    # A tibble: 2 × 13
    #   expression      min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result memory     time          
    #   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list> <list>     <list>        
    # 1 base          1.98s    1.98s     0.505    1.49GB    0.505     1     1      1.98s <chr>  <Rprofmem> <bench_tm [1]>
    # 2 fastmatch  823.87ms 823.87ms     1.21     1.12GB    1.21      1     1   823.87ms <chr>  <Rprofmem> <bench_tm [1]>