I have dockerized by airflow dag that basically copies one csv file to gcp, then loads it to BigQuery and do simple transformation. When I run docker-compose run, my dag is executed twice. I am not able to understand due to which part of code is this happening. When I trigger dag manually from UI it runs once.
My Dag:
from scripts import extract_and_gcpload, load_to_BQ
default_args = {
'owner': 'shweta',
'start_date': datetime(2025, 4, 24),
'retries': 0
}
with DAG(
'spacex_etl_dag',
default_args=default_args,
schedule_interval=None,
schedule=None,
catchup=False #prevents Airflow from running missed periods
) as dag:
extract_and_upload = PythonOperator(
task_id="extract_and_upload_to_gcs",
python_callable=extract_and_gcpload.load_to_gcp_pipeline,
)
load_to_bq = PythonOperator(
task_id="load_to_BQ",
python_callable=load_to_BQ.load_csv_to_bigquery
)
run_dbt = BashOperator(
task_id="run_dbt",
bash_command="cd '/opt/airflow/dbt/my_dbt' && dbt run --profiles-dir /opt/airflow/dbt"
)
extract_and_upload >> load_to_bq >> run_dbt
My entrypoint file startscript.sh:
#!/bin/bash
set -euo pipefail
log() {
echo "[$(date +'%Y-%m-%d %H:%M:%S')] $1"
}
# Optional: Run DB init & parse DAGs
log "Initializing Airflow DB..."
airflow db upgrade
log "Parsing DAGs..."
airflow scheduler --num-runs 1
DAG_ID="spacex_etl_dag"
log "Unpausing DAG: $DAG_ID"
airflow dags unpause "$DAG_ID" || true
log "Triggering DAG: $DAG_ID"
airflow dags trigger "$DAG_ID" || true
log "Creating admin user (if not exists)..."
airflow users create \
--username admin \
--firstname Admin \
--lastname User \
--role Admin \
--email admin@example.com \
--password admin || true
if [[ "$1" == "webserver" || "$1" == "scheduler" ]]; then
log "Starting Airflow: $1"
exec airflow "$@"
else
log "Executing: $@"
exec "$@"
fi
My docker-compose.yaml file:
services:
airflow-webserver:
build:
context: .
dockerfile: Dockerfile
container_name: airflow-webserver
env_file: .env
restart: always
environment:
AIRFLOW__CORE__DAGS_ARE_PAUSED_AT_CREATION: 'false'
AIRFLOW__LOGGING__REMOTE_LOGGING: 'False'
AIRFLOW__CORE__LOAD_EXAMPLES: 'false'
AIRFLOW__DATABASE__SQL_ALCHEMY_CONN: postgresql+psycopg2://airflow:airflow@postgres/airflow
GOOGLE_APPLICATION_CREDENTIALS: /opt/airflow/secrets/llms-395417-c18ea70a3f54.json
volumes:
- ./dags:/opt/airflow/dags
- ./scripts:/opt/airflow/scripts
- ./dbt:/opt/airflow/dbt
- ./secrets:/opt/airflow/secrets
ports:
- 8080:8080
command: webserver
airflow-scheduler:
build:
context: .
dockerfile: Dockerfile
container_name: airflow-scheduler
env_file: .env
restart: always
environment:
AIRFLOW__CORE__DAGS_ARE_PAUSED_AT_CREATION: 'false'
AIRFLOW__LOGGING__REMOTE_LOGGING: 'False'
AIRFLOW__CORE__LOAD_EXAMPLES: 'false'
AIRFLOW__DATABASE__SQL_ALCHEMY_CONN: postgresql+psycopg2://airflow:airflow@postgres/airflow
GOOGLE_APPLICATION_CREDENTIALS: /opt/airflow/secrets/llms-395417-c18ea70a3f54.json
volumes:
- ./dags:/opt/airflow/dags
- ./dbt:/opt/airflow/dbt
- ./secrets:/opt/airflow/secrets
- ./scripts:/opt/airflow/scripts
depends_on:
- postgres
command: scheduler
postgres:
image: postgres:13
environment:
POSTGRES_USER: airflow
POSTGRES_PASSWORD: airflow
POSTGRES_DB: airflow
volumes:
- postgres-db-volume:/var/lib/postgresql/data
volumes:
postgres-db-volume:
You have two containers, each packaging the same Dockerfile and entrypoint script. That script is identical except for the last command to run airflow scheduler
or airflow webserver
. All of the preceding commands are being run in both containers, including:
# Optional: Run DB init & parse DAGs
log "Initializing Airflow DB..."
airflow db upgrade
log "Parsing DAGs..."
airflow scheduler --num-runs 1
DAG_ID="spacex_etl_dag"
log "Unpausing DAG: $DAG_ID"
airflow dags unpause "$DAG_ID" || true
log "Triggering DAG: $DAG_ID"
airflow dags trigger "$DAG_ID" || true
log "Creating admin user (if not exists)..."
airflow users create \
--username admin \
--firstname Admin \
--lastname User \
--role Admin \
--email admin@example.com \
--password admin || true
You should avoid repeating unnecessary commands in both images if possible. That can be done with multiple entrypoints, or a check in the entrypoint to only run the necessary steps in the relevant container. E.g. to only run some of the commands in the scheduler container (I'm unfamiliar with airflow to know if this is the correct design), you could configure the script with:
log() {
echo "[$(date +'%Y-%m-%d %H:%M:%S')] $1"
}
if "$1" == "scheduler" ]]; then
# Optional: Run DB init & parse DAGs
log "Initializing Airflow DB..."
airflow db upgrade
log "Parsing DAGs..."
airflow scheduler --num-runs 1
DAG_ID="spacex_etl_dag"
log "Unpausing DAG: $DAG_ID"
airflow dags unpause "$DAG_ID" || true
log "Triggering DAG: $DAG_ID"
airflow dags trigger "$DAG_ID" || true
fi
log "Creating admin user (if not exists)..."
airflow users create \
--username admin \
--firstname Admin \
--lastname User \
--role Admin \
--email admin@example.com \
--password admin || true
if [[ "$1" == "webserver" || "$1" == "scheduler" ]]; then
log "Starting Airflow: $1"
exec airflow "$@"
else
log "Executing: $@"
exec "$@"
fi