algorithmgeolocationgeocodinggeogeohashing

Geohash: Using libgeohash to find neighbors


In my application I am storing Geohash of all users in a table and want to find neighbors of a user using those Geohashes.

As per info I gathered about Geohash on Wiki:

When used in a database, the structure of geohashed data has two advantages. First, data indexed by geohash will have all points for a given rectangular area in contiguous slices (the number of slices depends on the precision required and the presence of geohash "fault lines"). This is especially useful in database systems where queries on a single index are much easier or faster than multiple-index queries. Second, this index structure can be used for a quick-and-dirty proximity search - the closest points are often among the closest geohashes.

So for e.g. to find neighbors of "sj8101b085", I just had planned to search hashes by doing:

SELECT * FROM Users WHERE Geohash LIKE 'sj8101b085%'

And then by firing same query by reducing the hash length one by one i.e. "sj8101b08%", "sj8101b0%" and so on till I get desired number of neighbors. I was under impression that this is all I need to do.

But then I found this C library libgeohash referred at the bottom of the same article. The library has a function called GEOHASH_get_adjacent which gives us neighboring hashes of given hash. A geohash string represents a rectangle area on earth. And this function returns geohashes representing neighboring rectangles. It means I got to run this function in recursion (neighbors and then neighbors of neighbors and so on) till I get desired number of neighbors.

Now I am really confused how shall I write my search algorithm? Using first approach or using second?


Solution

  • A geohash is a bit-string, where the even bits represent longitude and the odd bits represent latitude. Each bit of the representation of the longitude, for example, selects half of the feasible area. The initial feasible area is [-180, 180], and if the first bit for longitude is 0, the next feasible area becomes [-180, 0], and if it is 1, it becomes [0, 180]. The first two bits, taken together, select the half of the earth above or below the equator as well as the half of the earth to the left or to the right of the prime meridian. You can think of this as a "rectangular area", as it is called in your Wikipedia link. The first four bits, taken together, select half of the northern or southern hemisphere, as well as half of the eastern or western hemisphere. Et cetera.

    The geohash, ezs42, presented in your link is base 32, so each character represents 5 bits of the geohash. The implication of the example hash being 5 characters, is that a geohash is 25 bits, 13 of which are for the longitude, and 12 of which are for the latitude. This means that the longitude is divided in half 13 times, while the latitude is divided in half 12 times, and the geohash selects both one out of the twelve latitudinal extents and one out of the thirteen longitudinal extents. Each character that you remove from the end of the hash eliminates 5 bits from the geohash; this amounts to either 3 divisions for longitude and 2 divisions for latitude, or vice versa. Said otherwise, it increases your longitudinal extent by a factor of 8, and your latitudinal extent by a factor of 4, or vice versa. Querying for that geohash gives all points that fall within the corresponding "rectangular" area.

    I am not familiar with libgeohash; however, from your description, it sounds as though you give it a geohash and it gives you back a collection of geohashes which represent the neighboring "rectangular" areas at the granularity implied by the input. Presumably, if you use this to find nearest neighbors, you'll need to keep track of those geohashes that you have visited and which you have not, and you'll have to repeatedly ask for neighbors, until you find the points for which you are looking. Visually, this would look like a fanning out from your initial geohash of "rectangles" that are the size of your original "rectangle". You'll need to be careful to not simply consider the first point you find in one of the neighboring areas, as another neighboring area might have a point that is closer to your query point; that is, you'll need to consider points from all neighbors before searching for the k that are nearest to your query point (this means, for example, that you will need to ask for and query points from the neighbors of all 8 neighbors of the original "rectangle" before looking for your k nearest on the second iteration of the neighbor approach).

    Considering the libgeohash neighbor approach, if your original "rectangle" is small (say, inches by inches), and your points are sparse enough, it could take an enormous amount of time until you cover enough of the earth by this fanning out technique until you find your points. With the prefix approach, on the other hand, it might be that your points are dense enough that increasing your extents by 4 and 8 times gives a large number of points to consider. In either case, if you're looking for k nearest neighbors, you'll still need to test all resulting points for distance to select k of them that are nearest. In the end, your choice will depend upon your data; however, I'd suggest starting with the prefix approach, since it is significantly simpler than the neighboring "rectangular" area approach.