I am running a proof of concept to connect airflow and oracle database to store airflow metadata. From the documentation (https://airflow.apache.org/docs/apache-airflow/stable/howto/set-up-database.html#database-uri) I have seen that mostly PostgreSQL, MySQL and MSSQL are supported with PorsgreSQL being the most preferable. However, since SQLAlchemy supports Oracle connections I configured the airflow docker image to initiliaze with an oracle db.
First of all, I would like to clarify that I am using Docker on a linux Ubuntu 20.4 LTS version and everything is deployed there.
I will start with the docker-compose.yml file which is similar to the one from the official airflow documentation (https://airflow.apache.org/docs/apache-airflow/2.6.1/docker-compose.yaml):
version: '3.8'
x-airflow-common:
&airflow-common
#image: ${AIRFLOW_IMAGE_NAME:-apache/airflow:2.6.1-python3.8}
build: .
env_file: .env
user: "${AIRFLOW_UID:-50000}:0"
environment:
&airflow-common-env
AIRFLOW__CORE__EXECUTOR: LocalExecutor
AIRFLOW__DATABASE__SQL_ALCHEMY_CONN: oracle+cx_oracle://NIKOS:Nikos_123@oracle-db:1521/?service_name=ORCLPDB1
AIRFLOW__CORE__DAGS_ARE_PAUSED_AT_CREATION: 'true'
AIRFLOW__CORE__LOAD_EXAMPLES: 'False'
AIRFLOW__API__AUTH_BACKENDS: 'airflow.api.auth.backend.basic_auth,airflow.api.auth.backend.session'
AIRFLOW__SCHEDULER__ENABLE_HEALTH_CHECK: 'true'
_PIP_ADDITIONAL_REQUIREMENTS: ''
volumes:
- ${AIRFLOW_PROJ_DIR:-.}/dags:/opt/airflow/dags
- ${AIRFLOW_PROJ_DIR:-.}/logs:/opt/airflow/logs
- ${AIRFLOW_PROJ_DIR:-.}/config:/opt/airflow/config
- ${AIRFLOW_PROJ_DIR:-.}/plugins:/opt/airflow/plugins
depends_on:
&airflow-common-depends-on
oracle-db:
condition: service_healthy
services:
oracle-db:
image: container-registry.oracle.com/database/enterprise:21.3.0.0
container_name: oracle-db-airflow
environment:
ORACLE_SID: ORCLCDB
ORACLE_PDB: ORCLPDB1
ORACLE_PWD: Oracle_123
#ORACLE_EDITION: standard
ports:
- 1521:1521
volumes:
- oracle-data:/opt/oracle/oradata
- oracle-backup:/opt/oracle/backup
- ./create_oracle_user.sql:/create_oracle_user.sql
healthcheck:
test: [ "CMD", "/opt/oracle/checkDBStatus.sh"]
interval: 30s
timeout: 10s
retries: 2
restart: unless-stopped
airflow-webserver:
<<: *airflow-common
command: webserver
ports:
- "8080:8080"
healthcheck:
test: ["CMD", "curl", "--fail", "http://localhost:8080/health"]
interval: 30s
timeout: 10s
retries: 5
start_period: 30s
restart: unless-stopped
depends_on:
<<: *airflow-common-depends-on
airflow-init:
condition: service_completed_successfully
airflow-scheduler:
<<: *airflow-common
command: scheduler
healthcheck:
test: ["CMD", "curl", "--fail", "http://localhost:8974/health"]
interval: 30s
timeout: 10s
retries: 5
start_period: 30s
restart: always
depends_on:
<<: *airflow-common-depends-on
airflow-init:
condition: service_completed_successfully
airflow-init:
<<: *airflow-common
env_file:
- .env
user: root
entrypoint:
- /bin/bash
- -c
- |
mkdir -p /opt/oracle
apt-get update
apt-get install -y unzip
unzip /opt/instantclient-basic-linux.x64-21.10.0.0.0dbru.zip -d /opt/oracle/
apt-get install -y libaio1
sh -c "echo /opt/oracle/instantclient_21_10 > /etc/ld.so.conf.d/oracle-instantclient.conf"
ldconfig
ls /opt/oracle/instantclient_21_10 | grep "libclntsh.so"
/usr/bin/dumb-init -- /entrypoint version
volumes:
- ${AIRFLOW_PROJ_DIR:-.}/instantclient-basic-linux.x64-21.10.0.0.0dbru.zip:/opt/instantclient-basic-linux.x64-21.10.0.0.0dbru.zip
environment:
<<: *airflow-common-env
_AIRFLOW_DB_UPGRADE: 'true'
_AIRFLOW_WWW_USER_CREATE: 'true'
_AIRFLOW_WWW_USER_USERNAME: ${_AIRFLOW_WWW_USER_USERNAME:-airflow}
_AIRFLOW_WWW_USER_PASSWORD: ${_AIRFLOW_WWW_USER_PASSWORD:-airflow}
_PIP_ADDITIONAL_REQUIREMENTS: ''
LD_LIBRARY_PATH: /opt/oracle/instantclient_21_10
volumes:
oracle-data:
external: false
name: oracla_store
oracle-backup:
external: false
name: oracla_backup
networks:
default:
driver: bridge
name: network_airflow_poc
The oracle-db service is an official oracle database taken from the container.registry of Oracle . The database is healthy and running. I can check the connection to the database along with the user credentials with Oracle SQL Developer free app. Also the user NIKOS
is valid and has the following privileges GRANT CONNECT, CREATE SESSION, CREATE TABLE, CREATE VIEW TO NIKOS;
My problem is found when the service airflow-init
is executed. As you will notice I have added some extra bash commands under this service in the docker-compose file. This is because cx_Oracle
package and the Oracle Instant Client are not installed by default. Thus, I had to do the following steps:
Step 1: Create a Dockerfile
and a requirements.txt
with the following configurations respectively (documentation):
FROM apache/airflow:2.6.1-python3.8
ADD requirements.txt .
RUN pip install -r requirements.txt
sqlalchemy==1.4.48
cx_Oracle==8.3.0
And rebuild the docker airflow image by uncommenting the build argument (build: .
) of the docker service.
Step 2: Install Oracle Instant Client inside the airflow container. Thus following the steps of the official documentation_1, documentation_2 I managed to overcome the following error
DPI-1047: Cannot locate a 64-bit Oracle Client library
However, the whole solution fails to complete due to the following log:
full error log here
The error basically says that the table log
, which I assume is an airflow metadata
table is not found found. Because is never created.
The odd thing is that when I execute the whole docker-compose file with a PostgreSQL
database as metadata backend and checking the results in PGAdmin, I can clearly see the initialized tables like log
and others.
Thus, my question is how can I overcome the current issue I am facing in the error logs? Even though Airflow has a valid SQLAlchemy Connection string and the necessary dependency libraries installed, it cannot initiate properly its metadata tables in the oracle db.
I have found the solution to this problem. Based on the full stack trace the problem was under the db.py
file, located at /home/airflow/.local/lib/python3.8/site-packages/airflow/utils/
In this file the function check()
was called before starting to build the metadata tables. The check
function actually checks whether the database is alive. The following command was executed when the function was called.
session.execute("SELECT 1 AS IS_ALIVE")
The SQL statement does not follow the Oracle SQL systax. Thus, I changed the command to
session.execute("SELECT 1 AS IS_ALIVE FROM DUAL;")
That was the solution to the problem I posted.