algorithmazure-cosmosdbnearest-neighborkdtreelocality-sensitive-hash

Matching millions of people: k-d tree or locality-sensitive hashing?


I am looking for a performant algorithm to match a large number of people by location, gender and age according to this data structure:

For any person P the algorithm should return candidates C for which applies:

The algorithm should return the first 100 candidates C in order of distance (Lat/Long). The algorithm should be optimized for both search and updates because people may change their location often.

My current thinking is that k-d tree could be more suitable than locality-sensitive-hashing for these needs and that I should go into this direction.

What would be your advise for me? What should I look for? What risks do you see?

Thanks!

Update:


Solution

  • Here is some info by Microsoft how to use their spatial indexing ('spatial' is the keyword you want to search for).

    The query you are looking for is a k-nearest neighbor query (kNN Search) with k=100.

    If you want to serialize the index yourself, have a look at R+tree or R*trees, they are quite good for page based serialization. There are lots of open source example for these trees. Here is my own implementation in Java, unfortunately it does not support Serialization.

    About the other indexes: