rfunctionoptimizationvectorization

How do you vectorize a function that assigns a value to an element based on a look up to a reference table?


I'm trying to match a category to each string in a vector based on it's adjacent value on a reference data.frame. Essentially this is the same thing as vlookup in Excel.

Here is an example of the vector I'm assigning values to:

x <- c("A", "B", "C", "A", "D", "F")

Here is an example of the data.frame called class_ref_table I am referencing:

code phase desc
A phase1 Desc 1
B phase2 Desc 2
C phase3 Desc 3
D phase4 Desc 4
E phase5 Desc 5
F phase6 Desc 6
class_ref_table <- structure(list(code = c("A", "B", "C", "D", "E", "F"), phase = c("phase1", 
"phase2", "phase3", "phase4", "phase5", "phase6"), desc = c("Desc 1", 
"Desc 2", "Desc 3", "Desc 4", "Desc 5", "Desc 6")), class = "data.frame", row.names = c(NA, 
6L))

Here is my current code:

class_reference <- function(vector, column) {
  temp_vector <- seq_along(vector)
  for (word in class_ref_table[[1]]) {
    z <- which(class_ref_table[[1]] == word)
    temp_vector[grep(word, vector)] <- class_ref_table[z, column]
  }
  temp_vector
}

Result example:

class_reference(x, 2)

[1] "phase1"
[2] "phase2"
[3] "phase3"
[4] "phase1"
[5] "phase4"
[6] "phase6"

Currently this code works fine, but it's a little slow. Currently it takes ~8.5 minutes to run this across ~3000 vectors with a total of ~100,000 elements. Is there a way to vectorize this function so it's really fast?

I vectorized a similar function before which reduced the run time from exponential time to linear time (10 minutes for 400,000 elements down to 0.2 seconds), however I have no idea how to do something similar here.

Any suggestions would be appreciated.


Solution

  • You could try indexing using match:

    class_ref_table[match(x, class_ref_table$code), "phase"]
    

    Output

    # [1] "phase1" "phase2" "phase3" "phase1" "phase4" "phase6"
    

    (If you want to return the whole data frame, just remove the “phase” from the above code)

    This seems to be about 12x faster than the original function:

    microbenchmark::microbenchmark(
      op_function = class_reference(vector = x, 2),
      match = class_ref_table[match(x, class_ref_table$code), "phase"]
    )
    
    #        expr     min       lq      mean   median       uq     max neval cld
    # op_function 180.030 190.2205 202.04175 197.2335 199.0600 562.735   100  a 
    # match        14.746  15.5965  17.16169  16.4825  17.8845  54.653   100   b
    

    This seems to hold true if you increase the size of the data frame (here, 1000x the size of the example data)

    big_df <- data.frame(sapply(df, rep.int, times = 1e3))
    
    # Unit: microseconds
    #        expr        min           lq         mean      median          uq         max neval cld
    # op_function 951551.544 1038882.8955 1196466.9788 1201759.472 1314753.790 1638215.506   100  a 
    # match          146.573     166.3865     189.6446     183.148     211.942     250.211   100   b