I am work in multi-tenant data pipeline platform. So, we have about 5 tenants for now. We use AWS MWAA (Apache Airflow) as orchestration tool. Each tenant has separate DAG and run serially in order to get rid of overload in MWAA. Now we need to enable parallel execution of the DAGs in Airflow. In order to complete the Data pipeline run in bounded time.
In each data pipeline we are using more than 10 tasks, In that we have AWS NeptuneDB loader module.
As we already enabled queue_request
= True
, I ran Airflow DAG's parallel manner.
But, which didn't worked as we expected.
When 1st DAG’s Neptune process is running. 2nd DAG’s Neptune process’s status will be LOAD_IN_QUEUE. While getting this status the Airflow DAG consider this as failure and waits for retry. When the 1st DAG’s Neptune process is completed. Then, the 2nd DAG’s Neptune process starts when the retry comes.
Now, I'm expecting the DAG to wait until the 1st DAG's Neptune process completed without retrying. When the STATUS
=LOAD_IN_QUEUE
Airflow treats this as error so it tries for retry.
So, kindly add your inputs on this
Neptune's bulk loader can only load one job at a time. If you're using queue_request = TRUE
, then each subsequent requests will be put into the loader queue and each job will execute serially.
If a DAG receives a response of LOAD_IN_QUEUE
, then it will need to poll until the status changes to LOAD_COMPLETED
or LOAD_FAILED
(or one of the other potential status codes). A full list of potential status codes can be found here: https://docs.aws.amazon.com/neptune/latest/userguide/loader-message.html