We have a service where a DynamoDB table ~50GB is our feature repository, which we use for real-time, online applications.
We want to create a data lake from this table for historical data, model training and analytics insights. We want to guarantee a 30-minutes "freshness" of data lake data w.r.t. the original table.
However, I'm confused on what could be a good architecture for this: my understanding of data lakes is that you should use a storage service (i.e., S3) to store the raw data with no processing. Then, you perform ETL jobs, where you transform, process and filter the data (e.g., using Glue) before using for whatever app.
But here is my doubt: does this means that we have to dump the DynamoDB table into S3 every 30 minutes? This can be easily done, but it sounds weird (this would result in ~876TB/year).
Am I missing something in the data lake pipeline?
You've hit a common problem, and its one AWS are actively working on.
If you want continous sync-ing from dynamodb to S3, its possible using existing technology including dynamodb streams. I suggest checking out this project in awslabs. Frankly its quite a bit of effort.
However, I believe AWS are about to release a product that will keep dynamodb tables and S3 buckets in sync, without code, in a few clicks. Its called AWS Glue Elastic Views. The product is in preview. They announced the product in December 2020 so I'm hoping it available soon. There is also a form you can fill in to join the trial but there is no guarantee AWS will give to access.