c++data-structuresknnspace-partitioning

C++ Data Structure for k Nearest Neighbour Search in D dimension using only point cloud as query points


I have a point cloud of N points in D-dimensional space with periodic boundary conditions, where N can range from 500 to 10^8 and D can range from 1 to 20. The distribution of points varies wildly, from completely uniform to very clumped together. For each point in the point cloud I need to find the k nearest neighbours to that point. I also need to find how many points exist within a distance of each point, specifically the maxnorm distance. I don't need to know which points are within the radius, just how many, but it would be a nice addition.

I've tried kd-trees, but they don't handle the wrapping boundaries, and for the larger trees, duplication is not feasible. Additionally, it gets slow at higher dimensions.

I've just come across Vantage Point Trees, and tried out some code, but it is slower than the kd-tree. Although the code I found uses a recursive search method, with no batching. One the positive side, it can natively handle the wrapping conditions, and as such doesn't require duplication.

I'm about to see if I can squeeze some more performance out of the VP tree by converting to an iterative approach and seeing if I could batch search, but I had a thought. All these data structures work for finding nearest neighbours to arbitrary query points, while my query points are restricted to being points in the point cloud. I figure this restriction might allow for some more performant structure (maybe a nav-mesh of sorts?). I tried searching for structures that would handle this, but my google-fu is failing me. So just wondering if anyone knows of a data structure that can handle the following:

Thanks


Solution

  • I doubt that there is a complete and definite answer to your very complex problem, so I just share my thoughts. Your problem specification combines a number of things that don't work well together (high dimension, non-Euclidean metric, completely different types of queries). If an algorithm has to assume the generic case, it is necessarily slow.

    Let's first sort out the special cases where good data structures are known.

    If all this does not apply (if you have a practical application in mind, please share with us), you case is very generic.

    In addition to the algorithms you mentioned, you should also try out Geometric Near-Neighbor Access trees (GNAT). http://infolab.stanford.edu/~sergey/near.html They apply to generic metrics (including yours) and also handle non-uniform distributions.

    Also, I think your expectations are very high. You could compare to a good kd-tree implementation (for instance, https://github.com/mariusmuja/flann) that solves the problem with a Euclidean metric only. If that takes long, you should not expect more general metrics to solve faster.

    Admittedly, more generic method cannot use your constraint that queries are points in the cloud. I would be very interested in if there is any such solution.