pythonairflowgoogle-cloud-composer

The same DAGs with different parameters or the same tasks with different parameters in the one DAG


In our project we have different clients and identical DAGs for them with different prefixes and parameters. For instance, we have mssql_to_bigquery DAG but it's separate for every client. That leads to the multiplication of this DAG for every client and it generates by "factory" DAG with different prefixes: Client1_mssql_to_bigquery, Client2_mssql_to_bigquery, etc.

The question is that if this "factory" method affects DAG parsing time and environment work in whole. I wonder if is this better to keep these in one DAG in multiple client tasks instead of the approach described above.

We have already tried both approaches but it's difficult to understand what approach is better because our environment contains both of them.


Solution

  • It might be a matter of personal taste, but I would recommend against using a factory.

    Here is the official documentation about Dynamic DAG Generation,

    You might also read this advice from the Airflow FAQ - how to create dags dynamically

    Even though Airflow supports multiple DAG definition per python file, dynamically generated or otherwise, it is not recommended as Airflow would like better isolation between DAGs from a fault and deployment perspective and multiple DAGs in the same file goes against that

    What you could do instead to increase your team's DAG writing velocity is to create a file template with only the missing information to fill. For example, here's the one from Intellij'IDEA or PyCharm.

    Regarding performances, you should read this article about optimizations. I believe that having multiple DAGs in a unique file might impact performances during DAG runs (as the whole file will be loaded but only 1 DAG will get executed). If you want to know more, you can read an epilogue article that lead to an official optimization trick still "experimental".