rjoindplyrindexingleft-join

Why does my left.join fail? Indexing issue?


I need to apply an optimization function to millions of lines of data that returns a number based on two input variables. Applying the function using mapply takes forever. Since there are a limited range of values in the data with lots of repeats, I figured it would be faster to make a lookup table for the entire range of possible input combinations, and then just use a join from dplyr to fill in the data. My attempts have failed because of some strange behavior in my lookup table (actually a data frame). Below is a repeatable example. I'm trying to understand why I cannot reference some values in the lookup table, but others work OK.

## build a lookup table with three columns
lookup <- transform(expand.grid(0:180, seq(0, 35, by=0.1)), v=0)
colnames(lookup) <- c("d", "s", "v")
lookup$v <- 1:nrow(lookup)

## generate some fake data with random values
set.seed(42)  ## for sake of reproducibility
data <- data.frame(d=sample(lookup$d, 1000, replace=TRUE), 
                   s=round(runif(n=1000, min=0, max=35), 1))

## left join gives a lot of NAs 
joined <- dplyr::left_join(x=data, y=lookup, by=c('d', 's'))
joined[1:20, ]
#      d    s     v
# 1   53  5.3    NA
# 2  124 20.2    NA
# 3  172 21.5 39088
# 4  137  5.3    NA
# 5   52 18.8 34081
# 6   67 20.0 36268
# 7   87 21.1 38279
# 8   64 29.4    NA
# 9   97 18.5 33583
# 10 159  8.6 15726
# 11  40 10.4 18865
# 12 138 29.6 53715
# 13  33 15.7    NA
# 14 171 24.4    NA
# 15 126 33.2 60219
# 16 120 20.1 36502
# 17  15 17.3 31329
# 18 104 17.4    NA
# 19  34  3.3    NA
# 20  92 33.6 60909
## unable to reference certain values...
lookup$v[which(lookup$d == 177 & lookup$s == 8.2)]
#integer(0)

## ...even though they exist
lookup[15020, ]
#      d   s   v
#15020 177 8.2 15020

## other values are OK
lookup$v[which(lookup$d == 177 & lookup$s == 8.3)]
#[1] 15201

Solution

  • You're dealing with floating point precision and thus not getting what you expect. Let's start with your second question.

    Instead of

    > lookup[15020, ]
            d   s     v
    15020 177 8.2 15020
    >
    > lookup$s[15020] == 8.2
    [1] FALSE
    

    try

    > tol <- .Machine$double.eps^0.5
    > lookup$s[15020] - 8.2 < tol
    [1] TRUE
    > lookup$v[which(lookup$d == 177 & abs(lookup$s - 8.2) < tol)]
    [1] 15020
    

    Read more about this in this answer.

    This might also relate to your first question, that mergeing has issues.

    Instead using merge (or dplyr::*join) we could expand the tolerance logic.

    > f <- \(x, y, tol=.Machine$double.eps^0.5) {
    +   which(
    +     colSums(
    +       abs(t(lookup[, c('d', 's')]) - c(x, y)) < rep_len(tol, 2),
    +     ) == 2L
    +   )
    + }
    > res1 <- data |> transform(v=lookup$v[mapply(f, data$d, data$s)])
    > res1 |> head(20) 
         d    s     v
    1   53  5.3  9647
    2  124 20.2 36687
    3  172 21.5 39088
    4  137  5.3  9731
    5   52 18.8 34081
    6   67 20.0 36268
    7   87 21.1 38279
    8   64 29.4 53279
    9   97 18.5 33583
    10 159  8.6 15726
    11  40 10.4 18865
    12 138 29.6 53715
    13  33 15.7 28451
    14 171 24.4 44336
    15 126 33.2 60219
    16 120 20.1 36502
    17  15 17.3 31329
    18 104 17.4 31599
    19  34  3.3  6008
    20  92 33.6 60909
    

    Since this is slow we could employ Rcpp.

    > Rcpp::cppFunction('
    + IntegerVector match_tol(NumericMatrix lookup, NumericMatrix data) {
    +   int n = data.nrow();
    +   IntegerVector res(n);
    +   double tol = sqrt(std::numeric_limits<double>::epsilon());
    +   for (int i = 0; i < n; ++i) {
    +     double d = data(i, 0);  // first data column
    +     double s = data(i, 1);  // second
    +     for (int j = 0; j < lookup.nrow(); ++j) {
    +       if (lookup(j, 0) == d && std::abs(lookup(j, 1) - s) < tol) {
    +         res[i] = lookup(j, 2);
    +         break;
    +       }
    +     }
    +   }
    +   return res;
    + }
    + ')
    

    > res2 <- data |> transform(v=match_tol(as.matrix(lookup), as.matrix(data)))
    > res2 |> head(20)
         d    s     v
    1   53  5.3  9647
    2  124 20.2 36687
    3  172 21.5 39088
    4  137  5.3  9731
    5   52 18.8 34081
    6   67 20.0 36268
    7   87 21.1 38279
    8   64 29.4 53279
    9   97 18.5 33583
    10 159  8.6 15726
    11  40 10.4 18865
    12 138 29.6 53715
    13  33 15.7 28451
    14 171 24.4 44336
    15 126 33.2 60219
    16 120 20.1 36502
    17  15 17.3 31329
    18 104 17.4 31599
    19  34  3.3  6008
    20  92 33.6 60909
    

    Other solutions, including base::merge(., sort=FALSE) on numeric or character format might still have issues*:

    > joined1 <- merge(data, lookup, all.x=TRUE, sort=FALSE) 
    > joined1 |> head(20)
         d    s     v
    1   53  5.3  9647
    2  124 20.2 36687
    3  172 21.5 39088
    4  137  5.3  9731
    5   52 18.8 34081
    6   67 20.0 36268
    7   87 21.1 38279
    8   64 29.4 53279
    9   97 18.5 33583
    10 159  8.6 15726
    11  40 10.4 18865
    12 138 29.6 53715
    13  33 15.7 28451
    14 171 24.4 44336
    15 126 33.2 60219
    16 120 20.1 36502
    17  15 17.3 31329
    18 104 17.4 31599  ## *
    19 104 17.4 31599  ## *
    20  34  3.3  6008
    

    > joined2 <- merge(
    +   sapply(data, as.character), 
    +   sapply(lookup, as.character), 
    +   all.x=TRUE, sort=FALSE) |> 
    +   sapply(as.numeric) |> 
    +   as.data.frame()  ## optional
    > joined2 |> head(20)
         d    s     v
    1   53  5.3  9647
    2  124 20.2 36687
    3  172 21.5 39088
    4  137  5.3  9731
    5   52 18.8 34081
    6   67 20.0 36268
    7   87 21.1 38279
    8   64 29.4 53279
    9   97 18.5 33583
    10 159  8.6 15726
    11  40 10.4 18865
    12 138 29.6 53715
    13  33 15.7 28451
    14 171 24.4 44336
    15 126 33.2 60219
    16 120 20.1 36502
    17  15 17.3 31329
    18 104 17.4 31599  ## *
    19 104 17.4 31599  ## *
    20  34  3.3  6008
    

    Compare:

    > data |> head(20)
         d    s
    1   53  5.3
    2  124 20.2
    3  172 21.5
    4  137  5.3
    5   52 18.8
    6   67 20.0
    7   87 21.1
    8   64 29.4
    9   97 18.5
    10 159  8.6
    11  40 10.4
    12 138 29.6
    13  33 15.7
    14 171 24.4
    15 126 33.2
    16 120 20.1
    17  15 17.3
    18 104 17.4
    19  34  3.3
    20  92 33.6
    

    Data:

    > lookup <- expand.grid(d=0:180, s=seq(0, 35, by=0.1)) |>
    +   transform(v=seq_len(35/.1 + 1))
    > set.seed(42)
    > data <- data.frame(d=sample(lookup$d, 1000, replace=TRUE), 
    +                    s=round(runif(n=1000, min=0, max=35), 1))