r performance nearest-neighbor multivariate-time-series

Fast way to find k nearest neighbors from only past, in R?

Assume I have a numeric data with 10,000 rows and 8 columns. I want to obtain the first k neighbors for each row (skipping first 1,000 rows) using euclidean distance but the catch is for each row I am only interested in the previous rows. (e.g. for the 2001th row, I only search first 2000 rows).

Changing the reference for each row is too slow. The fastest function I could write was using RANN to get 2k (or 5k) closest neighbors then filtering out the future observations.

A slow example: (200 rows, 5 columns, 3 nearest neighbors)

data = matrix(rnorm(1000), nrow = 200, ncol = 5)
result <- list()
for (i in c(101:200)) {
    distances <- apply(data[1:(i-1),], 1, function(x) {
        dist(rbind(x, data[i, ]))
    })
    neighbors <- sort(distances, index.return = TRUE)$ix[1:3]
    result[[i - 100]] <- neighbors
}

The fastest approach was using RANN::nn2(data = data[1:200,], query = data[101:200,], k = 2*k), then filtering the future ones (and hope to have at least k values).

The filtering part and unneccessary computation of nearest neighbors increase the time complexity significantly.

I would be glad to hear any suggestions.

Solution

A solution using RANN::nn2. The idea is to break the data into chunks and process each separately before combining the results. It processes a 10k-by-8 matrix in a fraction of a second.

library(RANN) # for the nn2 function
library(Rfast) # for the rowOrder function
library(data.table) # for the rbindlist function

nr <- 1e4L
start <- 1001L
nc <- 8L
k <- 3L
data <- matrix(rnorm(nr*nc), nr, nc)

system.time({
  n <- 450L # chunk size
  n1 <- n - 1L
  
  nn1 <- rbindlist(
    lapply(
      # split the data into chunks
      as.data.frame(matrix(start:nr, n)),
      function(i) {
        # initialize a matrix with -Inf
        d <- matrix(Inf, n1, n1)
        # fill the lower diagonal of m with the negative of the distance matrix
        d[sequence(n1:1, seq(1, n1^2, n))] <- dist(data[i,])
        # get the nearest neighbor from previous chunks
        nn <- nn2(data[1:(i[1] - 1L),], data[i,], k)
        # bind the two distance matrices together
        out <- rowOrder(cbind(nn$nn.dists, rbind(Inf, d)))[,1:k]
        # which neighbors are from a previous chunk?
        iPrev <- which(out <= k)
        # indices of nearest neighbors from a previous chunk
        out[iPrev] <- nn$nn.idx[cbind(((iPrev - 1L) %% n) + 1L, out[iPrev])]
        # indices of nearest neighbors from current chunk
        out[-iPrev] <- out[-iPrev] - k + i[1] - 1L
        # convert to a data.table in order to use rbindlist
        as.data.table(out)
      }
    ), FALSE
  )
})
#>    user  system elapsed 
#>    0.46    0.00    0.47

Compare the result to a filtering approach. Note that the code below is not guaranteed to find the nearest previous neighbor, but it could be modified to iteratively increase k for rows that fail to do so.

system.time({
  nn2 <- with(
    # get the 100 nearest neighbors
    nn2(data, data[start:nr,], 100),
    # find the nearest from a previous row
    matrix(nn.idx[cbind(1:nrow(nn.idx), c(rowOrder(1 - (nn.idx < start:nr), TRUE)[,1:k]))], ncol = k)
  )
})
#>    user  system elapsed 
#>    1.31    0.00    1.31

identical(unlist(nn1, 0, 0), c(nn2))
#> [1] TRUE