amazon-dynamodbamazon-dynamodb-streams

How to create GSI based on timestamp?


I'm working with a DynamoDB table which stores records for a maximum of 24 hours. We're talking about tens or even hundreds of millions of records. Most of the records are removed minutes after being added, some are removed tens of minutes or even hours later and some expire after 24h. For the expired ones I must do some business before removing them from the table so I must know their content. Based on the requirement of DynamoDB, the primary (partition) key for the table is a guid.

For detecting the records which are in the table for more than 24h I see two options:

  1. Use TTL and then read the expired items using dynamodb-streams but will the stream give me the original record? - NOK because TTL expired records can take days to be removed, which is not acceptable business-wise.

  2. Do something manual, rely on a field which represents the insert timestamp. The problem is that I can't afford to scan the entire table to find 1k - 2k of expired records out of hundreds of millions, I need something more efficient. My idea is to create a GIS with a composite key where the range is the insert timestamp and the partition is a value generated based on the timestamp like:

record1 - insert ts = 10:15:20.123 -> partition key base 10:15:20
record2 - insert ts = 10:15:21.234 -> partition key base 10:15:20
record3 - insert ts = 10:15:23.345 -> partition key base 10:15:20
record3 - insert ts = 10:15:25.456 -> partition key base 10:15:20

This way, I can run a job every 10 seconds, take the rounded timestamp 10:15:20 in this specific case and recreate the partition key from it. Use this partition key to fetch all the records ordered by the range key. In my example I used only the time but in the real implementation the date + time would be used.

What do you think?


Solution

  • Option 1

    The best and most cost effective solution. You can also distinguish a TTL delete from a manual/regular delete which can be useful if you don't want to apply business logic to items which were deleted for a different reason that expiry. The payload will have this unique identifier:

    "Records": [
        {
            ...
            "userIdentity": {
                "type": "Service",
                "principalId": "dynamodb.amazonaws.com"
            }
            ...
        }
    ]
    

    You will receive the entire item in the stream should your stream use NEW_AND_OLD_IMAGE view type.

    Be mindful TTL deletes aren't instantaneous, and in worse case scenarios can take days to delete items.

    Option 2

    This is much more tricky, especially at 100s of millions of items per day. Each index partition key can handle 1000 WCU per second, so using a timestamp as partition key creates a rolling hot partition should you require more than 1000 WCU. A more elegant solution is to use a random number within a known range. For example, let's imagine you expect peak to be 50k WCU. You would create a range (0-49) as your GSI partition key and assign the values randomly to each insert. Timestamp as your sort key.

    Now you run 50 Query requests to the GSI to obtain only the items for your desired time range. It's still much more efficient than a Scan but a little more resource intense on your host.