pythonairflowairflow-2.xairflow-taskflow

How to best mix-and-match returning/non-returning tasks on Airflow Taskflow API?


Note: All examples below seek a one-line serial execution.

Dependencies on Taskflow API can be easily made serial if they don't return data that is used afterwards:

t1() >> t2()

If the task T1 returning a value and task T2 using it, you can also link them like this:

return_value = t1()
t2(return_value)

However, if you have to mix and match returning statements, this is no longer clear:

t1() >> 
returned_value = t2()
t3(returned_value)

will fail due to syntax error (>> operator cannot be used before a returning-value task).

Also, this would work, but not generate the serial (t1 >> t2 >> t3) dag required:

t1()
returned_value = t2()
t3(returned_value)

since then t1 and t2/t3 will run in parallel.

A way to make this is to force t2 to use a returned value from t1, even if not needed:

returned_fake_t1 = t1()
returned_value_t2 = t2(returned_fake_t1)
t3(returned_value_t2)

But this is not elegant since it needs to change the logic. What is the idiomatic way in Taskflow API to express this? I was not able to find such a scenario on Airflow Documentation.


Solution

  • I've struggled with the same problem, in the end it was solved it as:

    t1_res = t1()
    t2_res = t2()
    t3_res = t3(t2_res)
    t1_res >> t2_res >> t3_res
    

    Good thing is that you don't really need to return anything from t1() or to add a fake parameter to f2(). Kudos to this article and it's author: https://khashtamov.com/ru/airflow-taskflow-api/