Why does my left.join fail? Indexing issue?

I need to apply an optimization function to millions of lines of data that returns a number based on two input variables. Applying the function using mapply takes forever. Since there are a limited range of values in the data with lots of repeats, I figured it would be faster to make a lookup table for the entire range of possible input combinations, and then just use a join from dplyr to fill in the data. My attempts have failed because of some strange behavior in my lookup table (actually a data frame). Below is a repeatable example. I'm trying to understand why I cannot reference some values in the lookup table, but others work OK.

## build a lookup table with three columns
lookup <- transform(expand.grid(0:180, seq(0, 35, by=0.1)), v=0)
colnames(lookup) <- c("d", "s", "v")
lookup$v <- 1:nrow(lookup)

## generate some fake data with random values
set.seed(42)  ## for sake of reproducibility
data <- data.frame(d=sample(lookup$d, 1000, replace=TRUE), 
                   s=round(runif(n=1000, min=0, max=35), 1))

## left join gives a lot of NAs 
joined <- dplyr::left_join(x=data, y=lookup, by=c('d', 's'))
joined[1:20, ]
#      d    s     v
# 1   53  5.3    NA
# 2  124 20.2    NA
# 3  172 21.5 39088
# 4  137  5.3    NA
# 5   52 18.8 34081
# 6   67 20.0 36268
# 7   87 21.1 38279
# 8   64 29.4    NA
# 9   97 18.5 33583
# 10 159  8.6 15726
# 11  40 10.4 18865
# 12 138 29.6 53715
# 13  33 15.7    NA
# 14 171 24.4    NA
# 15 126 33.2 60219
# 16 120 20.1 36502
# 17  15 17.3 31329
# 18 104 17.4    NA
# 19  34  3.3    NA
# 20  92 33.6 60909

## unable to reference certain values...
lookup$v[which(lookup$d == 177 & lookup$s == 8.2)]
#integer(0)

## ...even though they exist
lookup[15020, ]
#      d   s   v
#15020 177 8.2 15020

## other values are OK
lookup$v[which(lookup$d == 177 & lookup$s == 8.3)]
#[1] 15201

Solution

You're dealing with floating point precision and thus not getting what you expect. Let's start with your second question.

Instead of

> lookup[15020, ]
        d   s     v
15020 177 8.2 15020
>
> lookup$s[15020] == 8.2
[1] FALSE

try

> tol <- .Machine$double.eps^0.5
> lookup$s[15020] - 8.2 < tol
[1] TRUE
> lookup$v[which(lookup$d == 177 & abs(lookup$s - 8.2) < tol)]
[1] 15020

Read more about this in this answer.

This might also relate to your first question, that mergeing has issues.

Instead using merge (or dplyr::*join) we could expand the tolerance logic.

> f <- \(x, y, tol=.Machine$double.eps^0.5) {
+   which(
+     colSums(
+       abs(t(lookup[, c('d', 's')]) - c(x, y)) < rep_len(tol, 2),
+     ) == 2L
+   )
+ }
> res1 <- data |> transform(v=lookup$v[mapply(f, data$d, data$s)])
> res1 |> head(20) 
     d    s     v
1   53  5.3  9647
2  124 20.2 36687
3  172 21.5 39088
4  137  5.3  9731
5   52 18.8 34081
6   67 20.0 36268
7   87 21.1 38279
8   64 29.4 53279
9   97 18.5 33583
10 159  8.6 15726
11  40 10.4 18865
12 138 29.6 53715
13  33 15.7 28451
14 171 24.4 44336
15 126 33.2 60219
16 120 20.1 36502
17  15 17.3 31329
18 104 17.4 31599
19  34  3.3  6008
20  92 33.6 60909

Since this is slow we could employ Rcpp.

> Rcpp::cppFunction('
+ IntegerVector match_tol(NumericMatrix lookup, NumericMatrix data) {
+   int n = data.nrow();
+   IntegerVector res(n);
+   double tol = sqrt(std::numeric_limits<double>::epsilon());
+   for (int i = 0; i < n; ++i) {
+     double d = data(i, 0);  // first data column
+     double s = data(i, 1);  // second
+     for (int j = 0; j < lookup.nrow(); ++j) {
+       if (lookup(j, 0) == d && std::abs(lookup(j, 1) - s) < tol) {
+         res[i] = lookup(j, 2);
+         break;
+       }
+     }
+   }
+   return res;
+ }
+ ')

> res2 <- data |> transform(v=match_tol(as.matrix(lookup), as.matrix(data)))
> res2 |> head(20)
     d    s     v
1   53  5.3  9647
2  124 20.2 36687
3  172 21.5 39088
4  137  5.3  9731
5   52 18.8 34081
6   67 20.0 36268
7   87 21.1 38279
8   64 29.4 53279
9   97 18.5 33583
10 159  8.6 15726
11  40 10.4 18865
12 138 29.6 53715
13  33 15.7 28451
14 171 24.4 44336
15 126 33.2 60219
16 120 20.1 36502
17  15 17.3 31329
18 104 17.4 31599
19  34  3.3  6008
20  92 33.6 60909

Other solutions, including base::merge(., sort=FALSE) on numeric or character format might still have issues*:

> joined1 <- merge(data, lookup, all.x=TRUE, sort=FALSE) 
> joined1 |> head(20)
     d    s     v
1   53  5.3  9647
2  124 20.2 36687
3  172 21.5 39088
4  137  5.3  9731
5   52 18.8 34081
6   67 20.0 36268
7   87 21.1 38279
8   64 29.4 53279
9   97 18.5 33583
10 159  8.6 15726
11  40 10.4 18865
12 138 29.6 53715
13  33 15.7 28451
14 171 24.4 44336
15 126 33.2 60219
16 120 20.1 36502
17  15 17.3 31329
18 104 17.4 31599  ## *
19 104 17.4 31599  ## *
20  34  3.3  6008

> joined2 <- merge(
+   sapply(data, as.character), 
+   sapply(lookup, as.character), 
+   all.x=TRUE, sort=FALSE) |> 
+   sapply(as.numeric) |> 
+   as.data.frame()  ## optional
> joined2 |> head(20)
     d    s     v
1   53  5.3  9647
2  124 20.2 36687
3  172 21.5 39088
4  137  5.3  9731
5   52 18.8 34081
6   67 20.0 36268
7   87 21.1 38279
8   64 29.4 53279
9   97 18.5 33583
10 159  8.6 15726
11  40 10.4 18865
12 138 29.6 53715
13  33 15.7 28451
14 171 24.4 44336
15 126 33.2 60219
16 120 20.1 36502
17  15 17.3 31329
18 104 17.4 31599  ## *
19 104 17.4 31599  ## *
20  34  3.3  6008

Compare:

> data |> head(20)
     d    s
1   53  5.3
2  124 20.2
3  172 21.5
4  137  5.3
5   52 18.8
6   67 20.0
7   87 21.1
8   64 29.4
9   97 18.5
10 159  8.6
11  40 10.4
12 138 29.6
13  33 15.7
14 171 24.4
15 126 33.2
16 120 20.1
17  15 17.3
18 104 17.4
19  34  3.3
20  92 33.6

Data:

> lookup <- expand.grid(d=0:180, s=seq(0, 35, by=0.1)) |>
+   transform(v=seq_len(35/.1 + 1))
> set.seed(42)
> data <- data.frame(d=sample(lookup$d, 1000, replace=TRUE), 
+                    s=round(runif(n=1000, min=0, max=35), 1))