apache-sparkpysparkapache-spark-sqlnearest-neighboreuclidean-distance

How to find the nearest neighbors of 1 Billion records with Spark?


Given 1 Billion records containing following information:

    ID  x1  x2  x3  ... x100
    1   0.1  0.12  1.3  ... -2.00
    2   -1   1.2    2   ... 3
    ...

For each ID above, I want to find the top 10 closest IDs, based on Euclidean distance of their vectors (x1, x2, ..., x100).

What's the best way to compute this?


Solution

  • Performing a brute-force comparison of all records against all records is a losing battle. My suggestion would be to go for a ready-made implementation of k-Nearest Neighbor algorithm such as the one provided by scikit-learn then broadcast the resulting arrays of indices and distances and go further.

    Steps in this case would be:

    1- vectorize the features as Bryce suggested and let your vectorizing method return a list (or numpy array) of floats with as many elements as your features

    2- fit your scikit-learn nn to your data:

    nbrs = NearestNeighbors(n_neighbors=10, algorithm='auto').fit(vectorized_data)
    

    3- run the trained algorithm on your vectorized data (training and query data are the same in your case)

    distances, indices = nbrs.kneighbors(qpa)
    

    Steps 2 and 3 will run on your pyspark node and are not parallelizable in this case. You will need to have enough memory on this node. In my case with 1.5 Million records and 4 features, it took a second or two.

    Until we get a good implementation of NN for spark I guess we would have to stick to these workarounds. If you'd rather like to try something new, then go for http://spark-packages.org/package/saurfang/spark-knn