c++mathcomputational-geometrynearest-neighborlocality-sensitive-hash

How to solve nearest neighbor through the R-nearest neighbor?


Citing the E2LSH manual (it's not important that's about this specific library, this quote should be true for NN problem in general):

E 2LSH can be also used to solve the nearest neighbor problem, where, given the query q, the data structure is required the report the point in P that is closest to q. This can be done by creating several R-near neighbor data structures, for R = R1, R2, . . . Rt , where Rt should be greater than the maximum distance from any query point to its nearest neighbor. The nearest neighbor can be then recovered by querying the data structures in the increasing order of the radiae, stopping whenever the first point is found

Could someone rephrase this please? I don't this procedure to find the nearest neighbor using the R-near neighbor approach.


Solution

  • I will provide an example, which should clear things up. Assume that our dataset consists of only one point, p and a query point arrives, q. Let's assume* that the distance of p and q is 3,9.

    Now, by using E2LSH$, I can create a data structure that solves the R-nearest neighbor problem, i.e. it will answer yes (and fetch me the point) that lies inside a radius R. If no such point exists, it will answer no.

    Let's say that I choose to build 5 data structures of that kind, starting from R = 1 to 5. In our mind, this is what we have done so far:

    enter image description here

    So now remember, that d(p, q) = 3,9, thus we expect to ask the data structure that is built with R = 4 and find for us the query point q.

    Now let's pretend that we do not know d(p, q), so we start searching from the smallest Radius we have picked, that's 1. So, we ask, is there anything in Radius (equal to 1) from our dataset? No!

    From R = 2? No! From R = 3? No! From R = 4? Yes, and that's q! So now we are done. 4 is the Rt you mention in your question.


    * That's a strong assumption and E2LSH suffers for having to put the user to input that parameter R, since usually we do not know what value should R have, too big and we will waste space and time, too small and we won't find our query!

    $ I heard there is something more modern than E2LSH now, in Ilya Razenshteyn's homepage.