We use Amazon MWAA Airflow, rarely some task as marked as "FAILED" but there is no logs at all. As if the container had been shut down without noticing us.
I have found this link: https://cloud.google.com/composer/docs/how-to/using/troubleshooting-dags#task_fails_without_emitting_logs Which explain this by OOM on the machine. But our tasks are doing almost nothing with CPU and RAM. They only do 1 HTTP call to AWS API. So very light.
On Cloudwatch, I can see that no others tasks are launched on the same container (the DAG run start by printing the container IP, so I can search this IP on all tasks).
If someone has an idea, would be great, thanks !
MWAA make use of ECS as a backend and the way things work is that ECS will autoscale the number of worker according to the number of tasks running in the cluster. For a small environment, each worker can handle 5 tasks by default. If there's more than 5 tasks then it will scale out another worker and so on.
We don't do any compute on airflow (batch, long running job), our Dags are mainly API requests to other service, this mean our Dags run fast and are short lives. From time to time, we can spike to eight or more tasks for a very short period of time (few seconds). In that case, the autoscaling will trigger a scale out and add a worker(s) to the cluster. Then, since those tasks are only API request, it get executed very quickly and immediately the number of task goes down to 0 which trigger a scale in (remove worker(s)). If at that exact moment another task is schedule, then airflow will eventually run the task on a container being remove and your task will get killed in the middle without any notice (race condition). You usually see incomplete logs when this happen.
The first workaround is to disable autoscaling by freezing the number of worker in the cluster. You can set the min and max to the appropriate number of worker which will depend on your workload. We agree, we lose the elasticity of the service.
$ aws mwaa update-environment --name MyEnvironmentName --min-workers 2 --max-workers 2
Another solution suggest by AWS will be to have always one dummy task running (an infinite loop) so you never endup scaling in all your worker.
AWS told us they are working on a solution to improve executor.