pythongoogle-cloud-platformairflowgoogle-cloud-composer

How to control the parallelism or concurrency of an Airflow installation?


In some of my Apache Airflow installations, DAGs or tasks that are scheduled to run do not run even when the scheduler doesn't appear to be fully loaded. How can I increase the number of DAGs or tasks that can run concurrently?

Similarly, if my installation is under high load and I want to limit how quickly my Airflow workers pull queued tasks (such as to reduce resource consumption), what can I adjust to reduce the average load?


Solution

  • Here's an expanded list of configuration options that are available since Airflow v1.10.2. Some can be set on a per-DAG or per-operator basis, but may also fall back to the setup-wide defaults when they are not specified.


    Options that can be specified on a per-DAG basis:

    Examples:

    # Only allow one run of this DAG to be running at any given time
    dag = DAG('my_dag_id', max_active_runs=1)
    
    # Allow a maximum of 10 tasks to be running across a max of 2 active DAG runs
    dag = DAG('example2', concurrency=10, max_active_runs=2)
    

    Options that can be specified on a per-operator basis:

    Example:

    t1 = BaseOperator(pool='my_custom_pool', max_active_tis_per_dag=12)
    

    Options that are specified across an entire Airflow setup: