I'm trying to match a category to each string in a vector based on it's adjacent value on a reference data.frame. Essentially this is the same thing as vlookup in Excel.
Here is an example of the vector I'm assigning values to:
x <- c("A", "B", "C", "A", "D", "F")
Here is an example of the data.frame called class_ref_table
I am referencing:
code | phase | desc |
---|---|---|
A | phase1 | Desc 1 |
B | phase2 | Desc 2 |
C | phase3 | Desc 3 |
D | phase4 | Desc 4 |
E | phase5 | Desc 5 |
F | phase6 | Desc 6 |
class_ref_table <- structure(list(code = c("A", "B", "C", "D", "E", "F"), phase = c("phase1",
"phase2", "phase3", "phase4", "phase5", "phase6"), desc = c("Desc 1",
"Desc 2", "Desc 3", "Desc 4", "Desc 5", "Desc 6")), class = "data.frame", row.names = c(NA,
6L))
Here is my current code:
class_reference <- function(vector, column) {
temp_vector <- seq_along(vector)
for (word in class_ref_table[[1]]) {
z <- which(class_ref_table[[1]] == word)
temp_vector[grep(word, vector)] <- class_ref_table[z, column]
}
temp_vector
}
Result example:
class_reference(x, 2)
[1] "phase1"
[2] "phase2"
[3] "phase3"
[4] "phase1"
[5] "phase4"
[6] "phase6"
Currently this code works fine, but it's a little slow. Currently it takes ~8.5 minutes to run this across ~3000 vectors with a total of ~100,000 elements. Is there a way to vectorize this function so it's really fast?
I vectorized a similar function before which reduced the run time from exponential time to linear time (10 minutes for 400,000 elements down to 0.2 seconds), however I have no idea how to do something similar here.
Any suggestions would be appreciated.
You could try indexing using match
:
class_ref_table[match(x, class_ref_table$code), "phase"]
Output
# [1] "phase1" "phase2" "phase3" "phase1" "phase4" "phase6"
(If you want to return the whole data frame, just remove the “phase”
from the above code)
This seems to be about 12x faster than the original function:
microbenchmark::microbenchmark(
op_function = class_reference(vector = x, 2),
match = class_ref_table[match(x, class_ref_table$code), "phase"]
)
# expr min lq mean median uq max neval cld
# op_function 180.030 190.2205 202.04175 197.2335 199.0600 562.735 100 a
# match 14.746 15.5965 17.16169 16.4825 17.8845 54.653 100 b
This seems to hold true if you increase the size of the data frame (here, 1000x the size of the example data)
big_df <- data.frame(sapply(df, rep.int, times = 1e3))
# Unit: microseconds
# expr min lq mean median uq max neval cld
# op_function 951551.544 1038882.8955 1196466.9788 1201759.472 1314753.790 1638215.506 100 a
# match 146.573 166.3865 189.6446 183.148 211.942 250.211 100 b