Assume I have a numeric data with 10,000 rows and 8 columns. I want to obtain the first k neighbors for each row (skipping first 1,000 rows) using euclidean distance but the catch is for each row I am only interested in the previous rows. (e.g. for the 2001th row, I only search first 2000 rows).
Changing the reference for each row is too slow. The fastest function I could write was using RANN to get 2k (or 5k) closest neighbors then filtering out the future observations.
A slow example: (200 rows, 5 columns, 3 nearest neighbors)
data = matrix(rnorm(1000), nrow = 200, ncol = 5)
result <- list()
for (i in c(101:200)) {
distances <- apply(data[1:(i-1),], 1, function(x) {
dist(rbind(x, data[i, ]))
})
neighbors <- sort(distances, index.return = TRUE)$ix[1:3]
result[[i - 100]] <- neighbors
}
The fastest approach was using RANN::nn2(data = data[1:200,], query = data[101:200,], k = 2*k)
, then filtering the future ones (and hope to have at least k values).
The filtering part and unneccessary computation of nearest neighbors increase the time complexity significantly.
I would be glad to hear any suggestions.
A solution using RANN::nn2
. The idea is to break the data into chunks and process each separately before combining the results. It processes a 10k-by-8 matrix in a fraction of a second.
library(RANN) # for the nn2 function
library(Rfast) # for the rowOrder function
library(data.table) # for the rbindlist function
nr <- 1e4L
start <- 1001L
nc <- 8L
k <- 3L
data <- matrix(rnorm(nr*nc), nr, nc)
system.time({
n <- 450L # chunk size
n1 <- n - 1L
nn1 <- rbindlist(
lapply(
# split the data into chunks
as.data.frame(matrix(start:nr, n)),
function(i) {
# initialize a matrix with -Inf
d <- matrix(Inf, n1, n1)
# fill the lower diagonal of m with the negative of the distance matrix
d[sequence(n1:1, seq(1, n1^2, n))] <- dist(data[i,])
# get the nearest neighbor from previous chunks
nn <- nn2(data[1:(i[1] - 1L),], data[i,], k)
# bind the two distance matrices together
out <- rowOrder(cbind(nn$nn.dists, rbind(Inf, d)))[,1:k]
# which neighbors are from a previous chunk?
iPrev <- which(out <= k)
# indices of nearest neighbors from a previous chunk
out[iPrev] <- nn$nn.idx[cbind(((iPrev - 1L) %% n) + 1L, out[iPrev])]
# indices of nearest neighbors from current chunk
out[-iPrev] <- out[-iPrev] - k + i[1] - 1L
# convert to a data.table in order to use rbindlist
as.data.table(out)
}
), FALSE
)
})
#> user system elapsed
#> 0.46 0.00 0.47
Compare the result to a filtering approach. Note that the code below is not guaranteed to find the nearest previous neighbor, but it could be modified to iteratively increase k
for rows that fail to do so.
system.time({
nn2 <- with(
# get the 100 nearest neighbors
nn2(data, data[start:nr,], 100),
# find the nearest from a previous row
matrix(nn.idx[cbind(1:nrow(nn.idx), c(rowOrder(1 - (nn.idx < start:nr), TRUE)[,1:k]))], ncol = k)
)
})
#> user system elapsed
#> 1.31 0.00 1.31
identical(unlist(nn1, 0, 0), c(nn2))
#> [1] TRUE