airflow

How to configure a passive task in Airflow that updates its status via the API?


I'm working with Apache Airflow and need to configure a task that operates in a passive manner. Specifically, this task should use the Airflow API to update its own status, while the execution of the task itself is handled externally.

For example, the task could represent a human sign-off or an asynchronous operation where the external system performs the work. How can I set up such a task in Airflow so that it remains passive and only updates its status through the API?


Solution

  • That's unfortunately really unfitting use case for airflow. Basically you have 2 options:

    1. Use sensors: https://airflow.apache.org/docs/apache-airflow/stable/core-concepts/sensors.html . Basically airflow provides you with waiting step, that will be checking if offloaded task is done and complete/timeout per your completion/timeout parameters.
    2. Build custom operator handling pooling and timing out per your particular use case. E.g. if you want to spin up hadoop cluster, offload spark job there, then wait till it's done - this will help you handle nuance e.g. job might fail, but some data went through - you can trigger different post-processing.

    The clear downside is - both options will occupy worker node, even though job is being offloaded.

    Alternatively - you can separate jobs triggering from jobs validating status e.g. running dag-s triggering, then checking status around expected completion time from separate dag. Which although least explicit in design will be likely most efficient in computation.