microservicesoozieairflowluigiazkaban

definition of airflow dag for a use case with variable dependencies


I would like to use airflow for the following use-case :

Note : each airflow task described here is in fact a simple call to a remote micro-service (grpc call).

The design I have in mind so far :

My questions :


Solution

  • Your initial idea was the one I would go with. Having 150 different workflows with 10K tasks each leads to a fully dynamic and unmanageable scenario. On the one hand you say that each task is just a simple gRPC but at the same time you mention that the page-level tasks are really complex to encapsulate behind a single task and there are external dependencies that may cause flow bottlenecks measured in hours.

    If I were you I'd redesign the solution and transfer the page-level reporting to a different layer. For example creating a service that would do all these complex calculations would be a better option than trying to implement this in Airflow. This way you could probably cut down the number of page level tasks significantly.

    Regarding your specific questions:

    If I were you I'd have a single workflow for all 150 sites. I'd create a subdag for each website (btw there is no mention of the word 'unstable' in the official docs) and try to offload complex calculation operations to a different layer in order to cut down on the number of page level tasks as much as possible.