I need to apply an optimization function to millions of lines of data that returns a number based on two input variables. Applying the function using mapply
takes forever. Since there are a limited range of values in the data with lots of repeats, I figured it would be faster to make a lookup table for the entire range of possible input combinations, and then just use a join from dplyr to fill in the data. My attempts have failed because of some strange behavior in my lookup table (actually a data frame). Below is a repeatable example. I'm trying to understand why I cannot reference some values in the lookup table, but others work OK.
## build a lookup table with three columns
lookup <- transform(expand.grid(0:180, seq(0, 35, by=0.1)), v=0)
colnames(lookup) <- c("d", "s", "v")
lookup$v <- 1:nrow(lookup)
## generate some fake data with random values
set.seed(42) ## for sake of reproducibility
data <- data.frame(d=sample(lookup$d, 1000, replace=TRUE),
s=round(runif(n=1000, min=0, max=35), 1))
## left join gives a lot of NAs
joined <- dplyr::left_join(x=data, y=lookup, by=c('d', 's'))
joined[1:20, ]
# d s v
# 1 53 5.3 NA
# 2 124 20.2 NA
# 3 172 21.5 39088
# 4 137 5.3 NA
# 5 52 18.8 34081
# 6 67 20.0 36268
# 7 87 21.1 38279
# 8 64 29.4 NA
# 9 97 18.5 33583
# 10 159 8.6 15726
# 11 40 10.4 18865
# 12 138 29.6 53715
# 13 33 15.7 NA
# 14 171 24.4 NA
# 15 126 33.2 60219
# 16 120 20.1 36502
# 17 15 17.3 31329
# 18 104 17.4 NA
# 19 34 3.3 NA
# 20 92 33.6 60909
## unable to reference certain values...
lookup$v[which(lookup$d == 177 & lookup$s == 8.2)]
#integer(0)
## ...even though they exist
lookup[15020, ]
# d s v
#15020 177 8.2 15020
## other values are OK
lookup$v[which(lookup$d == 177 & lookup$s == 8.3)]
#[1] 15201
You're dealing with floating point precision and thus not getting what you expect. Let's start with your second question.
Instead of
> lookup[15020, ]
d s v
15020 177 8.2 15020
>
> lookup$s[15020] == 8.2
[1] FALSE
try
> tol <- .Machine$double.eps^0.5
> lookup$s[15020] - 8.2 < tol
[1] TRUE
> lookup$v[which(lookup$d == 177 & abs(lookup$s - 8.2) < tol)]
[1] 15020
Read more about this in this answer.
This might also relate to your first question, that merge
ing has issues.
Instead using merge
(or dplyr::*join
) we could expand the tol
erance logic.
> f <- \(x, y, tol=.Machine$double.eps^0.5) {
+ which(
+ colSums(
+ abs(t(lookup[, c('d', 's')]) - c(x, y)) < rep_len(tol, 2),
+ ) == 2L
+ )
+ }
> res1 <- data |> transform(v=lookup$v[mapply(f, data$d, data$s)])
> res1 |> head(20)
d s v
1 53 5.3 9647
2 124 20.2 36687
3 172 21.5 39088
4 137 5.3 9731
5 52 18.8 34081
6 67 20.0 36268
7 87 21.1 38279
8 64 29.4 53279
9 97 18.5 33583
10 159 8.6 15726
11 40 10.4 18865
12 138 29.6 53715
13 33 15.7 28451
14 171 24.4 44336
15 126 33.2 60219
16 120 20.1 36502
17 15 17.3 31329
18 104 17.4 31599
19 34 3.3 6008
20 92 33.6 60909
Since this is slow we could employ Rcpp
.
> Rcpp::cppFunction('
+ IntegerVector match_tol(NumericMatrix lookup, NumericMatrix data) {
+ int n = data.nrow();
+ IntegerVector res(n);
+ double tol = sqrt(std::numeric_limits<double>::epsilon());
+ for (int i = 0; i < n; ++i) {
+ double d = data(i, 0); // first data column
+ double s = data(i, 1); // second
+ for (int j = 0; j < lookup.nrow(); ++j) {
+ if (lookup(j, 0) == d && std::abs(lookup(j, 1) - s) < tol) {
+ res[i] = lookup(j, 2);
+ break;
+ }
+ }
+ }
+ return res;
+ }
+ ')
> res2 <- data |> transform(v=match_tol(as.matrix(lookup), as.matrix(data)))
> res2 |> head(20)
d s v
1 53 5.3 9647
2 124 20.2 36687
3 172 21.5 39088
4 137 5.3 9731
5 52 18.8 34081
6 67 20.0 36268
7 87 21.1 38279
8 64 29.4 53279
9 97 18.5 33583
10 159 8.6 15726
11 40 10.4 18865
12 138 29.6 53715
13 33 15.7 28451
14 171 24.4 44336
15 126 33.2 60219
16 120 20.1 36502
17 15 17.3 31329
18 104 17.4 31599
19 34 3.3 6008
20 92 33.6 60909
Other solutions, including base::merge(., sort=FALSE)
on numeric or character format might still have issues*:
> joined1 <- merge(data, lookup, all.x=TRUE, sort=FALSE)
> joined1 |> head(20)
d s v
1 53 5.3 9647
2 124 20.2 36687
3 172 21.5 39088
4 137 5.3 9731
5 52 18.8 34081
6 67 20.0 36268
7 87 21.1 38279
8 64 29.4 53279
9 97 18.5 33583
10 159 8.6 15726
11 40 10.4 18865
12 138 29.6 53715
13 33 15.7 28451
14 171 24.4 44336
15 126 33.2 60219
16 120 20.1 36502
17 15 17.3 31329
18 104 17.4 31599 ## *
19 104 17.4 31599 ## *
20 34 3.3 6008
> joined2 <- merge(
+ sapply(data, as.character),
+ sapply(lookup, as.character),
+ all.x=TRUE, sort=FALSE) |>
+ sapply(as.numeric) |>
+ as.data.frame() ## optional
> joined2 |> head(20)
d s v
1 53 5.3 9647
2 124 20.2 36687
3 172 21.5 39088
4 137 5.3 9731
5 52 18.8 34081
6 67 20.0 36268
7 87 21.1 38279
8 64 29.4 53279
9 97 18.5 33583
10 159 8.6 15726
11 40 10.4 18865
12 138 29.6 53715
13 33 15.7 28451
14 171 24.4 44336
15 126 33.2 60219
16 120 20.1 36502
17 15 17.3 31329
18 104 17.4 31599 ## *
19 104 17.4 31599 ## *
20 34 3.3 6008
Compare:
> data |> head(20)
d s
1 53 5.3
2 124 20.2
3 172 21.5
4 137 5.3
5 52 18.8
6 67 20.0
7 87 21.1
8 64 29.4
9 97 18.5
10 159 8.6
11 40 10.4
12 138 29.6
13 33 15.7
14 171 24.4
15 126 33.2
16 120 20.1
17 15 17.3
18 104 17.4
19 34 3.3
20 92 33.6
Data:
> lookup <- expand.grid(d=0:180, s=seq(0, 35, by=0.1)) |>
+ transform(v=seq_len(35/.1 + 1))
> set.seed(42)
> data <- data.frame(d=sample(lookup$d, 1000, replace=TRUE),
+ s=round(runif(n=1000, min=0, max=35), 1))