amazon-dynamodb dynamodb-queries amazon-dynamodb-index

How to query multiple DynamoDB objects at the time

I have a table Tours with a Partition Key (PK) id, which is unique and a Global Secondary Index (GSI) geohash, which has repeated values.

I need to get all the geohashes in the array geohashNeighbors, and I am using the following query:

const geohashNeighbors = ['geo1', 'geo2', 'geo3']

const queryPromises = geohashNeighbors.map(geohash => 
    dynamodb.send(new QueryCommand({
        TableName: 'Tours',
        IndexName: 'geohash_index',
        KeyConditionExpression: 'geohash = :geohash',
        ExpressionAttributeValues: {
            ':geohash': geohash
        }
    }))
)

const results = await Promise.all(queryPromises)

As you can see, I am sending one request per item in the array, but I would like to send only one request and get all the items at once.

This is what I tried:

I could not find something like KeyConditionExpression: 'geohash IN geohashNeighbors', seems like DynamoDB does not provide search in expression like Postgres.
I tried GetBatchItem but it does not work for QueryCommand, only for GetCommand, which only works for the PK, not for the GSI.
I tried TransactGetItems but requires the PK for each item, and I do not have the id, only the geohash.
I tried begins_with(), but it does not work for the GSI, only the PK.
I do not want to run a Scan and filter the hashes, that will defy the purpose of adding an index and probably be slower.

I come from the relational DB world and have only a few weeks of experience with DynamoDB. I searched for hours and tried a few ideas, but either DynamoDB is lacking basic operations or I am missing something (probably the second).

Solution

Fundamentally, this is not how DynamoDB works. DynamoDB's "tables" and "indexes" have almost nothing in common with SQL's "tables" and "indexes". It can take a while to "unlearn" the relational mindset.

All data is stored in a partition, which can only be found by its Partition Key. There is no concept of "nearby" or "similar" partitions; you either know the Partition Key or not. (Imagine sorting your address book by the SHA-1 hash of people's names; there's no way to find "everyone named John", because the hashes will have nothing in common.)

This is true in both Tables and Secondary Indexes - the "index" is essentially a second copy of the table, with exactly the same structure. (Secondary Indexes are allowed to have duplicate items, but that doesn't change how you access them.)

If your Table or Index has only a Partition Key, there are exactly two ways of finding an item:

Query it by knowing its Partition Key
Scan all items, in no particular order, testing each one against a filter

If your Table or Index has a Partition Key plus a Sort Key, there are exactly three ways of finding an item:

Query it by knowing its Partition Key and Sort Key
Scan all items, in no particular order, testing each one against a filter
Query a single Partition, by knowing its Partition Key; then retrieve some or all items in that partition, in the order of their Sort Keys

You can run any of those in a loop, either locally, or using batch operations in the DynamoDB API; but every read or write ultimately needs to identify items using one of those methods.

AWS has quite a good section of the user manual about "Data modelling for DynamoDB". The key idea is that you design your Tables and Indexes around how you're going to use the data, not its abstract structure.

So, if your use case involves finding "neighbours" according to some algorithm, you need those neighbours to be together in a single partition. That probably involves calculating a Partition Key which groups them in some way, and a Sort Key which identifies sub-sets within that group. That can be either the Partition Key + Sort Key of the main Table, or of a Secondary Index.

I don't know much about geohashes, but a quick search suggests they're designed so that geographically close items are also lexically close, so you might be able to use:

Partition Key: First n characters of geohash
Sort Key: Full geohash

Once you have that, you can use a Query to:

Target one partition
Target multiple items within that partition; ideally, a continuous range based on Sort Key

You might still need to Query two partitions because the values you want are "on the border"; but you won't need a Query for every item.

For example, the data (in the table or GSI) could look like this:

GeoHashPrefix (PK)	Geohash (SK)	Something	Else
AA	AA123	blah	blah
AA	AA128	rhubarb	custard
AA	AA135	foo	bar
AA	AA200	left	right
AB	AB123	me	you

Then to find all geohashes starting with "AA" would just be a single Query with a KeyConditionExpression of:

GeoHashPrefix = "AA"

To find only those beginning with "AA1", you'd specify an extra condition on the Sort Key within that partition:

GeoHashPrefix = "AA"
GeoHash begins_with("AA1")

Or to find all geohashes between "AA125" and "AA135" (inclusive):

GeoHashPrefix = "AA"
GeoHash BETWEEN "AA125" AND "AA135"

A final note is that DynamoDB is not intended to be a universal data store. If the data you have is not suited for partitioning, or the access you need cannot be optimised on that basis, you should pick a different technology.

For instance, ElasticSearch / OpenSearch has optimised index for searching geospatial data based on various operators. There are also extensions for relational databases that add "GIS" functionality, such as PostGIS for PostgreSQL which is supported by AWS Aurora.