I have to set up a data pipeline for an app I try to create but I am not sure how to do it.
I have 2 entities in the database: A and B, each entity B belong to an entity A.
Every minute, I fetch many B entities but one field is missing (on each B entity). So before saving the B entities I need to compute this field on each one of them. Given a B entity and the corresponding A entity, I need the last 20 B entities saved (in the database, so without the missing field) that belongs to the A entity to compute the missing field.
Pseudo code every minute is:
Order of magnitude: 20k A entities, 30 millions B entities saved and 1k new B entities every minute (this 1k B entities belongs to around 300 A entities)
Instead of querying the database every minute to get the last 20 B entities saved for each A entities found in the list of fetched B entities, I thought I could implement a cache system that stores the last 20 saved B entities for each A entity.
So my first idea was:
Since it is the first time I have to set up a data pipeline, I’m note sure my first idea is good and I have multiple questions:
thank you in advance for your feedback :)
I think querying the RDS with a limit and order by creation would be more easy and less of a hazzle then having a lambda for caching. If there is a lot of load you can have a read replica in place for handling the reads.
About the cron job with Lambda, yeah why not. Make sure you know how long this lambda on average runs. Might be cheaper and more efficient if you have a dedicated container setup for it.
Not sure if all of this really needs to be in a data pipeline as this is a fairly easy setup.