In current state, I have a AWS Batch, which is configured to run ML training jobs. At the end of its flow, model artifacts are stored in an S3 bucket. Additionally, I have an AWS Application Load Balanced Fargate Service, which scales between 1-10 tasks, hosting a FastAPI application. I have an /update_model/
on this API, which given a model key, retrieves the most current model from S3.
Naively, I'd like Batch to terminate after sending an HTTP request to my Fargate Servive update_model method. However, the ALB will allocate this to only one task, leading to consistency issues.
Elasticache + Polling
My first mitigation in consideration is for Batch to publish the model key to Elasticache and then each task asychronously poll Redis for model key updates; if one exists, retrieve the artifact from S3.
SNS Task-level Subscription
My second mitigation, which I'd prefer if possible is for Batch to send a notification via SNS and for each task to independently subscribe to the topic. From my research, the ALB can subscribe to the topic but I'd encounter the naive solution problem where n-1
tasks serve outdated models.
My question is: Is it possible for each Fargate Task to independently subscribe to an SNS topic (enabling fan-out)? Or am I better off using the cache polling strategy?
My question is: Is it possible for each Fargate Task to independently subscribe to an SNS topic (enabling fan-out)?
Only if they are each accessible directly by a public IP address, in addition to being accessible through the load balancer. They could each subscribe to SNS with their IP address on startup.
I would worry about the security implications of this approach however, as it exposes your Fargate containers directly to the Internet.
My first mitigation in consideration is for Batch to publish the model key to Elasticache and then each task asychronously poll Redis for model key updates; if one exists, retrieve the artifact from S3.
They wouldn't even need to poll, they could use Redis pub/sub to have Redis push the new messages to them when they arrive. This would work similar to SNS, but the network traffic would be entirely contained within your VPC.