amazon-web-servicestime-seriesamazon-dynamodbkey-valuedocument-store

Selecting composite key for time-series data in DynamoDB


I have a time-series data which I want to store on DynamoDB, regarding events of developer tasks (contained in Data). I'm struggling to decide what partition key & sort key to choose to serve my access needs the best way. What keys / GSI should I choose to serve my access needs that are explained below?

Data:

TaskId   Date/Time  TeamId         Data
1            3/21/2018   teamA          Data
1            3/22/2018   teamA          Data
1            3/23/2019   teamA          Data
5            7/13/2019   teamA          Data
5            7/15/2019   teamA          Data
3            7/17/2019   teamA          Data
4            7/22/2019   teamC          Data
3            7/24/2019   teamA          Data
4            7/24/2019   teamC          Data
2            7/24/2019   teamB          Data
5            7/24/2019   teamA          Data
6            8/16/2019   teamA          Data
6            8/19/2019   teamA          Data
6            8/28/2019   teamA          Data

Storing:

Time-series.

Accessing:

  1. I would like to get all results within a timeframe with a specific TeamId. For example, by querying for teamA between 7/16/2019-8/20/2019, I would get:
3            7/17/2019   teamA          Data
3            7/24/2019   teamA          Data
5            7/24/2019   teamA          Data
6            8/16/2019   teamA          Data
6            8/19/2019   teamA          Data
  1. I would like to get latest results within a timeframe with a specificTeamId, for each TaskId. For example, by querying for teamA between 3/1/2019-8/1/2019, I would get:
1            3/23/2019   teamA          Data
3            7/24/2019   teamA          Data
5            7/24/2019   teamA          Data

Solution

  • Your first query is easily solvable by using TeamId as the hash key and Date/Time as the sort key. You could also create a GSI with those keys if there's a different key that you would prefer to use when writing to your table.

    Your second query is not possible to run entirely in DynamoDB. What you want is easily expressed in SQL as:

    SELECT taskId, MAX(dateTime), teamId, data FROM tasks WHERE teamId=123 AND dateTime > earlierDate AND dateTime < laterDate GROUP BY taskId
    

    In DynamoDB, it is possible to select the latest event in a date range for a single, specific task. It is also possible to select the latest event of all time for all tasks belonging to a specific team using a materialized aggregation of your table.

    You can use DynamoDB for this query, but in order to get the results you want, you will need to select all task events for a team in the given time range, and then in your application you will need to handle getting the latest event for each taskId. This is possible, but it will not have very good performance, and it will be very expensive.

    Your have two other options that I see. The first is to replicate your data to another database that can support analytics and/or search queries. You can do this in near-realtime using DynamoDB Streams, "continuously" (according to the AWS docs) using AWS Database Migration Service, or at periodic intervals using AWS DataPipelines. You can send your data to an analytics database like Amazon Redshift, a search database like ElasticSearch, or the sort-of-a-database solution that is Athena on S3.

    The other option is to use a different primary database. You may be interested in Amazon QLDB, which is a fully-managed, serverless database that supports PartiQL (almost a superset of SQL) and Ion (a superset of json) documents. You might also consider Amazon Aurora or Amazon DocumentDB.