I have a time-series data which I want to store on DynamoDB, regarding events of developer tasks (contained in Data). I'm struggling to decide what partition key & sort key to choose to serve my access needs the best way. What keys / GSI should I choose to serve my access needs that are explained below?
Data:
TaskId Date/Time TeamId Data
1 3/21/2018 teamA Data
1 3/22/2018 teamA Data
1 3/23/2019 teamA Data
5 7/13/2019 teamA Data
5 7/15/2019 teamA Data
3 7/17/2019 teamA Data
4 7/22/2019 teamC Data
3 7/24/2019 teamA Data
4 7/24/2019 teamC Data
2 7/24/2019 teamB Data
5 7/24/2019 teamA Data
6 8/16/2019 teamA Data
6 8/19/2019 teamA Data
6 8/28/2019 teamA Data
Storing:
Time-series.
Accessing:
teamA
between 7/16/2019-8/20/2019
, I would get:3 7/17/2019 teamA Data
3 7/24/2019 teamA Data
5 7/24/2019 teamA Data
6 8/16/2019 teamA Data
6 8/19/2019 teamA Data
teamA
between 3/1/2019-8/1/2019
, I would get:1 3/23/2019 teamA Data
3 7/24/2019 teamA Data
5 7/24/2019 teamA Data
Your first query is easily solvable by using TeamId
as the hash key and Date/Time
as the sort key. You could also create a GSI with those keys if there's a different key that you would prefer to use when writing to your table.
Your second query is not possible to run entirely in DynamoDB. What you want is easily expressed in SQL as:
SELECT taskId, MAX(dateTime), teamId, data FROM tasks WHERE teamId=123 AND dateTime > earlierDate AND dateTime < laterDate GROUP BY taskId
In DynamoDB, it is possible to select the latest event in a date range for a single, specific task. It is also possible to select the latest event of all time for all tasks belonging to a specific team using a materialized aggregation of your table.
You can use DynamoDB for this query, but in order to get the results you want, you will need to select all task events for a team in the given time range, and then in your application you will need to handle getting the latest event for each taskId. This is possible, but it will not have very good performance, and it will be very expensive.
Your have two other options that I see. The first is to replicate your data to another database that can support analytics and/or search queries. You can do this in near-realtime using DynamoDB Streams, "continuously" (according to the AWS docs) using AWS Database Migration Service, or at periodic intervals using AWS DataPipelines. You can send your data to an analytics database like Amazon Redshift, a search database like ElasticSearch, or the sort-of-a-database solution that is Athena on S3.
The other option is to use a different primary database. You may be interested in Amazon QLDB, which is a fully-managed, serverless database that supports PartiQL (almost a superset of SQL) and Ion (a superset of json) documents. You might also consider Amazon Aurora or Amazon DocumentDB.