elasticsearchgeolocationgissystem-designs2

Is This A Working, Performant Method for the Actual Matching of Elasticsearch Docs in an Index with a Google S2-based Location System?


Background

I’m building an app that requires users search for other users based on location/geographic proximity, among several other search preferences. Some of these preferences filter search results, while others are used to rank the results that pass the aforementioned filters. Location/geographic proximity is one of the preferences used to filter results. Search functionality is provided by Elasticsearch—technically, Elastic Cloud hosted by Google Cloud Platform, and the API to access it is written into Google Cloud Functions that besides exposing an HTTP API to the client also handle ingestion of data to ES and events that update the index. Note that until we start getting close to 10⁴—10⁵ order of magnitude numbers of user documents in the ES index, I have no plans to implement any kind of geosharding or multiple indices—this kind of optimization will be important later, and I will certainly implement it if we get that kind of traction, but as of now it’s premature IMO. So this question will consider all queries as taking place within one global geoshard and a single index, and only concerns the practical aspects of filtering/matching documents based on Google s2-based cell coverings generated from user device location and user-provided search radius preference.

As I understand it, using s2 with ES would work like this: A user provides or changes their device location and/or search radius. That location (latitude and longitude) is mapped to a single leaf cell (lowest level in s2) using s2 and would be stored in the user DB for persistence (as well as data mining, etc) and also passed on to be ingested into ES (at this point, if we were using geosharding, some worker processes would have to move documents around if the user’s location required it, but again, we’re not planning on doing this until later).

When a user requests search results (which in this app may also be—and often is—at the same time they change their location and/or search radius, or provide it initially in onboarding, to pre-calculate search results), the user provided request parameters device latitude/longitude and search radius are used by s2 to generate a covering of cells for the circle those parameters define. This cell covering consists of s2 cells of varying intermediate levels (and depending on which implementation of the s2 library one is using and how much of the API is exposed, this can be further parametrized/constrained, for example, by min/max cell level and min/max number of cells in the covering). So for example, a cell covering generated from some location (lat, lon) with radius n kilometers would generate a list of s2 cell IDs of (potentially) varying size, something like [ 'abcde', 'abcdf', 'abce', 'abcf', 'abcga' ], where, in this case, let’s just say three of those cells are level 7 (i.e. abcde, abcdf, and abcga) and two are level 6 (i.e. abce and abcf—each ~4x the area of the level 7 cells).

The Questions

So, the first question I have is: to filter for other users whose locations are within the circle defined by the first user’s search parameters in Elasticsearch (like the example above), do I merely need to perform an ES simple query string search using each of the cell IDs in the list, something like (using the example list of cell IDs from the example cell covering above):

GET /_search
{
  "query": {
    "simple_query_string": {
      "fields": [ "locationId" ],
      "query": "abcde* | abcdf* | abce* | abcf* | abcga*"
    }
  }
}

Where example user documents in ES look more or less like this (this is a nonsense schema—except for the locationId field, which is what I want to show here):

{
  "field0": "value",
  "field1": [ "value0", "value1" ],
  "field2": 35,
  "field3": 45,
  "locationId": "abcdefghijklm...n", // i.e. s2 leaf cell ID
  ...
}

Since each user document ingested into ES has this locationId field whose value is a s2 leaf cell ID string, users whose leaf cells are within any of the covering cells from the searching user’s query would have location values/leaf cell IDs whose first several characters match the query above (e.g. users with a location ID/s2 leaf cell ID of abcdefghijklm...n or abcedfghijklm...n would match the above search request, but a user with the ID abdefghijklm...n would not—note: ellipsis is used here just to indicate that the leaf cell IDs are generally very long, 64-bit integers). But is this performant? What is the best way to search the ES index given a list of cell IDs from a cell covering like that given above? Or is it that simple?

Another question I have is: can I just do the s2 calculations for leaf cell generation for a user’s location and for the cell coverings for search results requests on the client and then send it to ES without the need for any intervening location microservice (since I’m foregoing any geosharding at least initially anyway, and don’t need any service to move documents around indices and shards either)? Is there any benefit to doing these things on another remote service? I could see a security argument for making the exact methods by which users are bucketed and located more opaque by only sending lat/lon/radius to a location microservice that runs the s2 library, but it’s also adding latency, complexity, and cost, and if it’s a fast computation that can be done on the client (and speed doesn’t have to be super fast here anyway on the client—the user expects some “magic” to be happening at this point as their search results are generated), I feel Iike I would rather just do these things on the client for the time being. To a degree, the search requests will be protected because we will only accept those sent with a valid user token, which will still allow users who have valid tokens to abuse the system, but also allow us to identify and penalize them. Or are the s2 calculations so trivial I should just put them in my remote service, between the token authorization and the actual API call to ES?

Thank you for your help.

PS I am aware of this other thread, but it is definitely not the same question as this (it doesn’t even concern Elasticsearch), so I ask you to please not flag it as such.


Distal Background

(Note: not necessary to understand or answer the questions above—StackOverflow just asked me for this so I have provided it)

What I have currently does not use s2 at all, so this is irrelevant. My initial approach, meant to simplify and avoid using s2 or a similiar library at all, involved creating a predefined list of urban areas based on 2020 US Census-designated Urban Areas over a certain population threshold (500,000), simplifying and slightly buffering the shapefiles in GIS, which I would use to locate users on the client with a library like turf.js (either by the user providing their location or manually selecting one of the urban locations). For users whose device location did not match an urban location, I initially just matched them to the centroid of the nearest urban location, but this would have been a poor UX, because it would conflate people of different social and geographic affinities. So I tried using the rest of the 2020 US Census Urban Area data, the UAs under the above population threshold and above 10,000 population, to generate a population dot density vector point layer in GIS, from which I would use a k-means clustering algorithm with equal clustering to group these points and then generate a Voronoi tessellation from the centroids of these clusters, clipped to the boundaries of our service area (the territorial US), which I could then use to locate any "extra-urban" users. I designed each point in the dot density layer described to symbolize 10,000 people, and the k-means clustering with equal clusters was supposed to create clusters of 8 points, or ~80,000 people each. Then when "extra-urban" users requested search results, I would query not only their own cell, but also all of its immediate neighbors (which for a Voronoi tessellation is ~6), making each "extra-urban" search approximately equal (~560,000) to the smaller "urban" locations in terms of the number of users that could be expected in them. This extra step of making the clusters -> Voronoi polygons smaller and then querying one plus its neighbors was designed to avoid any boundary issues, for example, the kind that would arise if a user located in one polygon couldn’t receive results that included a user in a neighboring polygon, even if that other user may in fact be geographically closer than some other users in the user’s same polygon, while at the same time keeping the overall number of results within an expected range.

However, I had significant difficult with finding a built-in GIS k-means clustering algorithm that returned equal clusters, and even scouring the internet for help on the issue I was unsuccessful in this regard, and I was also unsuccessful in developing my own implementation of equal clustering using a k-means type algorithm. Overall, I felt that this entire endeavor was becoming far too complicated, and on top of that, this scheme was only actually intended to be a preliminary or naïve solution from its inception, with the ultimate goal of eventually transitioning to something like Google's s2 library for locations in the future, anyway.

So I decided to scrap this whole complicated two-layer, GIS-derived, proprietary scheme based on census data, and to just lean into Google’s s2 library from the outset...and so here I am.


Solution

  • I ended up figuring out some solutions for this on my own and I wanted to share my findings here for anyone who may read this question in the future.

    Note that in the implementations below I am using a port of Google's S2 library (originally written in C++) for Node.js by Radar Labs, which can be found here.

    Implementation

    While the @radarlabs/s2 port does not come with bindings for S2's built-in S2RegionTermIndexer class, which is more or less made for use cases like mine, it does expose the two useful convenience methods, s2.regionCoverer.getRadiusCoveringIds() and s2.regionCoverer.getRadiusCoveringTokens(), that I was able to use which make use of S2's RegionCoverer class, specifically the RegionCoverer::getCovering() method, combined with S2Cap, one of which simply returns a list of IDs for the resulting S2CellUnion covering, while the other converts them to a list of tokens before returning it.

    In my specific use case, users who share their device location can search for other users near their location by specifying a radius for their query that may range from 1 to 50 miles. Both for reasons of anonymizing/obfuscating exact user location data and for optimizing search, I reasoned that, given this restriction in possible query size and an upper limit on the number of S2 cells used to construct any given covering, I could restrict the S2Cell level in the query between a minimum and maximum level, based on the smallest and largest possible S2Cell that could possibly be used to cover any given S2Cap with that range of radii. I determined the minimum and maximum levels partially by examining the average sizes of S2 cells given here to give me a general idea of which levels would work best, and then I tested different levels experimentally, using a fork of this lightweight region covering web app that uses S2 (the Go port) + Web Assembly + JavaScript + Leaflet, which I customized with extra UI components that allowed me to get coverings (as lists of tokens) of specific S2Cap regions by providing latitude, longitude, and radius as either exact or interactive (on the map) input. The results of these experiments showed me that, based on my app's requirements, the minimum level of S2Cell that need ever be used is 16, while the maximum level is 7, constraining the number of levels to just 1/3 of all available levels.

    I then reasoned that, rather than store (and thus, index and search) more or less "exact" user locations as coordinates or even S2 leaf cells, I could accomplish index and search of these locations in Elasticsearch by instead storing every user location as a list that contains the level-16 (the experimentally-determined minimum level) parent of the user location, as well as every other parent cell up to level 7 (the experimentally-determined maximum level) as tokens, concatenated together as a string separated by whitespace. The location field containing this string of hierarchical tokens separated by whitespace in Elasticsearch need only then be provided with a mapping that defines the type as "text" and the analyzer as "whitespace".

    PUT /users_ctokenlist
    {
      "mappings": {
        "properties": {
          "location": {
            "type": "text",
            "analyzer": "whitespace"
          }
        }
      }
    }
    

    Creating the index in Elasticsearch.

    With this simple mapping, any string like the location strings described above are automatically split into terms by Elasticsearch upon ingestion of the data, and structured into an inverted index that can be searched extremely quickly. The reason this works is that the queries users create generate a list of S2Cell tokens within the same range of levels 7-16 using the s2.regionCoverer.getRadiusCoveringTokens() method, and if any users in the ES index are in the S2Cap region defined by the user's provided location and radius, one of those query tokens will match one of the tokens in the string stored with each user location. I tested the results of this query in a live instance of ES using a set of 100,000 random user locations (the maximum number of documents that would belong to any [geo]shard+index) that I generated in GIS, and the results of the ES query with this method of indexing and searching location data proved to be identical to the physically observed data in GIS. I was able to compare the results in GIS by using the "Export to GeoJSON" function built into that region coverer web tool that I mentioned, by using the same parameters for min/max level and max # of cells for the covering, and then overlaying the GeoJSON file and a circle of the same radius centered on the same point in GIS. I will include some screenshots below in the results section. Internally I am just referring this method of indexing/querying "Constrained Hierarchical Token List" or, more succinctly, "ctokenlist".

    A query looks something like this in Elasticsearch:

    GET /users_ctokenlist/_search
    {
      "size": 10000,
      "query": {
        "match": {
          "location": "89c15c 89c17 89c19 89c1f 89c24 89c2c 89c31 89c33 89c344 89c37 89c3c 89dd3 89dd5 89e81 89e83 89e85 89e86aac 89e9b 89e9d 89e9f"
        }
      }
    }
    

    Note: these S2 tokens represent a covering of an S2Cap with 50km radius centered on Tompkins Square Park in NYC's East Village.

    Results

    The results in Elasticsearch for the above query look like this:

    {
      "took": 2,
      "timed_out": false,
      "_shards": {
        "total": 1,
        "successful": 1,
        "skipped": 0,
        "failed": 0
      },
      "hits": {
        "total": {
          "value": 13,
          "relation": "eq"
        },
        "max_score": 8.8049755,
        "hits": [
          {
            "_index": "users_ctokenlist",
            "_id": "66749929-451c-4715-86c3-80d71615ece4",
            "_score": 8.8049755,
            "_source": {
              "uid": "66749929-451c-4715-86c3-80d71615ece4",
              "location": "89c34 89c31 89c30c 89c30b 89c30b4 89c30b5 89c30b5c 89c30b5d 89c30b5dc 89c30b5dd",
            }
          },
          // ...
          // +12 more hits like this
        ]
      }
    }
    

    For purposes of comparison, below is a list of the uid of each of the above 13 hits, so you can compare them to those from the test I performed in GIS, which I will also share below:

    01. 66749929-451c-4715-86c3-80d71615ece4
    02. 2b5bbb3c-0644-4819-83f9-faf34ec0167a
    03. 643e6521-60b8-45f7-892e-32353c600c46
    04. aea6b0f4-12a9-42cc-b312-3e53fe6a8be7
    05. 91929a78-83a3-48ae-babd-43844c7b0a1e
    06. 8f6a0f7c-69c1-46fb-9efd-3644cbadf1bb
    07. 0c46c5d4-a1de-494f-aae1-327a70586caa
    08. 6c76444a-1631-4704-abf7-206088800e2a
    09. fc3c352d-7898-4300-92de-5d462f5d6e71
    10. 822571a3-6dca-4446-9ec3-ca1a9cdd9416
    11. 143839b5-1dcb-4369-b6dc-92a51921f7a5
    12. 833b4149-327d-4f41-a80a-ec23b5bb6595
    13. a055cdec-4219-4ece-b450-46c503d2f500
    

    Below: the process of testing cell coverings using the S2 (Go) + Web Assembly + JS tool I mentioned above, using the same parameters used to generate the "ctokenlist" strings representing each data point's location in ES, as well as how these coverings were exported to GeoJSON for later importing into GIS:

    Image: generating the S2cap region covering and exporting as GeoJSON

    Below: the random dataset in GIS (grey dots), together with the same 50km radius S2Cap centered on Tompkins Square Park in NYC (translucent purple ellipse with thick purple border—the geometry seen here is circular, but the specific projection I was using in GIS makes it look compressed in the latitudinal direction), along with the cell covering generated using the given parameters (min_level = 16, max_level = 7, max_cells = 20), and the list of "hits"—those data points that intersect the cell covering (pink dotted line and translucent rhombs) are shown as red dots. The center of the query is symbolized using a purple "+".

    Image: the dataset in GIS, for comparison, with the S2Cap covering GeoJSON overlain

    The red dots (hits) are actually selected here in the GIS program, and their values/uids can be seen on the far right side of the image. If you compare this to the ES results above, you'll see that they are exactly the same—a successful result.

    Another Implementation

    I will also share that I tried another method, which I have been referring to internally as "Truncated Binary String" (or "tbstring" for short), which worked nearly equally as well, but the "ctokenlist" method prvoed to have a slight edge (I will explain why below). Nevertheless I will share it here in case anyone is interested. This method takes advantage of the fact that every S2 (leaf) cell ID contains all of the information for all of its hierarchical parents, encoded in its binary representation. For example, given a cell ID with the binary representation 1000100111000010010110010111100001010110111111110100001100011001, the first 3 digits represent one of six level 0 "faces" (numbered 0-5, or rather 000-101), and following this, every pair of digits represents one of four quadrants at any given level on the Hilbert Curve, down to level 30—finally, a trailing "1" is appended to the end. Because of this, it can be shown that the binary representation of the level-16 parent of this leaf cell (using my experimentally-determined range of levels 7-16, once again) is identical to the binary representation of the leaf cell, up to 3 + (16 * 2) = 35 digits. And indeed, if you calculate the level-16 parent cell ID and get its binary representation and compare it to the leaf cell, you will see this is the case. The parent ID is identical to the leaf cell ID up to the 35th digit, after which the leaf cell ID contains more information, and the level-16 parent cell ID just has a trailing "1" digit (as have all S2 cell IDs), followed after this by zeroes that fill out the 64 bits. The same is true for the level-7 parent of this leaf cell, and for any parent of this leaf cell—so we have:

    Leaf Cell ID:     {100|01|00|11|10|00|01|00|10|11|00|10|11|11|00|00|10|10|11|01|11|11|11|10|10|00|01|10|00|11|00}1
    Lvl 16 Parent ID: {100|01|00|11|10|00|01|00|10|11|00|10|11|11|00|00|10}10|00|00|00|00|00|00|00|00|00|00|00|00|00|0
    Lvl 7 Parent ID:  {100|01|00|11|10|00|01|00}10|00|00|00|00|00|00|00|00|00|00|00|00|00|00|00|00|00|00|00|00|00|00|0
    

    Knowing this information, we can just go ahead and truncate every cell ID after the 35th digit and store this truncated string in Elasticsearch instead of the entire string, because the minimum cell level in the implementation is 16 and any information after the 35th digit will never be utilized—therefore it's just wasting space in ES when you multiply it by 100,000 documents, and this truncation additionally helps to partially obfuscate other users' exact locations in the event this string need to be returned to the client for some reason, where it could be intercepted by anyone listening to network requests. To query these "truncated binary strings" in Elasticsearch, we can simply get the list of cell IDs that cover the query region using convenience method s2.regionCoverer.getRadiusCoveringIds() provided by Radar Labs' Node.js port of S2, convert each ID in the list to binary, and then replace all of the trailing zeroes and final trailing "1" with an asterisk (the prefix operator in Elasticsearch) using a regular expression (e.g. /([01]+)10*/). From here we just execute a "simple_query_string"-type ES query, employing prefix wildcards (the * character/operator) and Boolean OR logic (using the | character/operator) to get a query that will match any user's location stored as a "truncated binary string" that is either within or exactly matches one of the query region's (from the covering given by S2RegionCoverer using an S2Cap region defined by the user's location and radius preference) cell IDs—in other words, a "hit", in ES parlance.

    The whole query looks something like this in ES:

    GET /users_tbstring/_search
    {
      "size": 10000,
      "query": {
        "simple_query_string": {
          "query": "100010011100000101011* | 1000100111000001011* | 1000100111000001100* | 1000100111000001111* | 10001001110000100* | 10001001110000101* | 1000100111000011000* | 1000100111000011001* | 100010011100001101000* | 1000100111000011011* | 10001001110000111* | 1000100111011101001* | 1000100111011101010* | 1000100111101000000* | 1000100111101000001* | 1000100111101000010* | 10001001111010000110101010101* | 1000100111101001101* | 1000100111101001110* | 1000100111101001111*",
          "fields": [ "location" ]
        }
      }
    }
    

    The hits for this search are also right on, but it runs just slightly slower (fractions of milliseconds). The reason for this I believe is that prefix/wildcard queries are generally more expensive in ES than simply looking up terms in an inverted index.

    Anyway, if you made it this far, thank you for your time. I hope this post helps someone else who is wondering how to accomplish this without the S2RegionTermIndexer class available, and specifically in Elasticsearch.