oracle-databasedockerdocker-composedatabase-metadata

Configuring oracle database as airflow metadata backend db does not create airflow tables when deployed from docker


I am running a proof of concept to connect airflow and oracle database to store airflow metadata. From the documentation (https://airflow.apache.org/docs/apache-airflow/stable/howto/set-up-database.html#database-uri) I have seen that mostly PostgreSQL, MySQL and MSSQL are supported with PorsgreSQL being the most preferable. However, since SQLAlchemy supports Oracle connections I configured the airflow docker image to initiliaze with an oracle db.

First of all, I would like to clarify that I am using Docker on a linux Ubuntu 20.4 LTS version and everything is deployed there.

I will start with the docker-compose.yml file which is similar to the one from the official airflow documentation (https://airflow.apache.org/docs/apache-airflow/2.6.1/docker-compose.yaml):

version: '3.8'
x-airflow-common:
  &airflow-common
  #image: ${AIRFLOW_IMAGE_NAME:-apache/airflow:2.6.1-python3.8}
  build: .
  env_file: .env
  user: "${AIRFLOW_UID:-50000}:0"
  environment:
    &airflow-common-env
    AIRFLOW__CORE__EXECUTOR: LocalExecutor
    AIRFLOW__DATABASE__SQL_ALCHEMY_CONN: oracle+cx_oracle://NIKOS:Nikos_123@oracle-db:1521/?service_name=ORCLPDB1
    AIRFLOW__CORE__DAGS_ARE_PAUSED_AT_CREATION: 'true'
    AIRFLOW__CORE__LOAD_EXAMPLES: 'False'
    AIRFLOW__API__AUTH_BACKENDS: 'airflow.api.auth.backend.basic_auth,airflow.api.auth.backend.session'
    AIRFLOW__SCHEDULER__ENABLE_HEALTH_CHECK: 'true'
    _PIP_ADDITIONAL_REQUIREMENTS: ''
  volumes:
    - ${AIRFLOW_PROJ_DIR:-.}/dags:/opt/airflow/dags
    - ${AIRFLOW_PROJ_DIR:-.}/logs:/opt/airflow/logs
    - ${AIRFLOW_PROJ_DIR:-.}/config:/opt/airflow/config
    - ${AIRFLOW_PROJ_DIR:-.}/plugins:/opt/airflow/plugins
  depends_on:
    &airflow-common-depends-on
    oracle-db:
      condition: service_healthy

services:
  oracle-db:
    image: container-registry.oracle.com/database/enterprise:21.3.0.0
    container_name: oracle-db-airflow
    environment:
      ORACLE_SID: ORCLCDB
      ORACLE_PDB: ORCLPDB1
      ORACLE_PWD: Oracle_123
      #ORACLE_EDITION: standard
    ports:
      - 1521:1521
    volumes:
      - oracle-data:/opt/oracle/oradata
      - oracle-backup:/opt/oracle/backup
      - ./create_oracle_user.sql:/create_oracle_user.sql
    healthcheck:
      test: [ "CMD", "/opt/oracle/checkDBStatus.sh"]
      interval: 30s
      timeout: 10s
      retries: 2
    restart: unless-stopped

  airflow-webserver:
    <<: *airflow-common
    command: webserver
    ports:
      - "8080:8080"
    healthcheck:
      test: ["CMD", "curl", "--fail", "http://localhost:8080/health"]
      interval: 30s
      timeout: 10s
      retries: 5
      start_period: 30s
    restart: unless-stopped
    depends_on:
      <<: *airflow-common-depends-on
      airflow-init:
        condition: service_completed_successfully

  airflow-scheduler:
    <<: *airflow-common
    command: scheduler
    healthcheck:
      test: ["CMD", "curl", "--fail", "http://localhost:8974/health"]
      interval: 30s
      timeout: 10s
      retries: 5
      start_period: 30s
    restart: always
    depends_on:
      <<: *airflow-common-depends-on
      airflow-init:
        condition: service_completed_successfully

  airflow-init:
    <<: *airflow-common
    env_file:
      - .env
    user: root
    entrypoint:
      - /bin/bash
      - -c
      - |
        mkdir -p /opt/oracle
        apt-get update
        apt-get install -y unzip
        unzip /opt/instantclient-basic-linux.x64-21.10.0.0.0dbru.zip -d /opt/oracle/
        apt-get install -y libaio1
        sh -c "echo /opt/oracle/instantclient_21_10 > /etc/ld.so.conf.d/oracle-instantclient.conf"
        ldconfig
        ls /opt/oracle/instantclient_21_10 | grep "libclntsh.so"
        /usr/bin/dumb-init -- /entrypoint version
    volumes:
      - ${AIRFLOW_PROJ_DIR:-.}/instantclient-basic-linux.x64-21.10.0.0.0dbru.zip:/opt/instantclient-basic-linux.x64-21.10.0.0.0dbru.zip
    environment:
      <<: *airflow-common-env
      _AIRFLOW_DB_UPGRADE: 'true'
      _AIRFLOW_WWW_USER_CREATE: 'true'
      _AIRFLOW_WWW_USER_USERNAME: ${_AIRFLOW_WWW_USER_USERNAME:-airflow}
      _AIRFLOW_WWW_USER_PASSWORD: ${_AIRFLOW_WWW_USER_PASSWORD:-airflow}
      _PIP_ADDITIONAL_REQUIREMENTS: ''
      LD_LIBRARY_PATH: /opt/oracle/instantclient_21_10

volumes:
  oracle-data:
    external: false
    name: oracla_store
  oracle-backup:
    external: false
    name: oracla_backup

networks:
  default:
    driver: bridge
    name: network_airflow_poc

The oracle-db service is an official oracle database taken from the container.registry of Oracle . The database is healthy and running. I can check the connection to the database along with the user credentials with Oracle SQL Developer free app. Also the user NIKOS is valid and has the following privileges GRANT CONNECT, CREATE SESSION, CREATE TABLE, CREATE VIEW TO NIKOS;

My problem is found when the service airflow-init is executed. As you will notice I have added some extra bash commands under this service in the docker-compose file. This is because cx_Oracle package and the Oracle Instant Client are not installed by default. Thus, I had to do the following steps:

Step 1: Create a Dockerfile and a requirements.txt with the following configurations respectively (documentation):

FROM apache/airflow:2.6.1-python3.8
ADD requirements.txt . 
RUN pip install -r requirements.txt
sqlalchemy==1.4.48
cx_Oracle==8.3.0

And rebuild the docker airflow image by uncommenting the build argument (build: .) of the docker service.

Step 2: Install Oracle Instant Client inside the airflow container. Thus following the steps of the official documentation_1, documentation_2 I managed to overcome the following error

DPI-1047: Cannot locate a 64-bit Oracle Client library

However, the whole solution fails to complete due to the following log:

enter image description here

full error log here

The error basically says that the table log, which I assume is an airflow metadata table is not found found. Because is never created.

The odd thing is that when I execute the whole docker-compose file with a PostgreSQL database as metadata backend and checking the results in PGAdmin, I can clearly see the initialized tables like log and others.

enter image description here

Thus, my question is how can I overcome the current issue I am facing in the error logs? Even though Airflow has a valid SQLAlchemy Connection string and the necessary dependency libraries installed, it cannot initiate properly its metadata tables in the oracle db.


Solution

  • I have found the solution to this problem. Based on the full stack trace the problem was under the db.py file, located at /home/airflow/.local/lib/python3.8/site-packages/airflow/utils/

    In this file the function check() was called before starting to build the metadata tables. The check function actually checks whether the database is alive. The following command was executed when the function was called.

    session.execute("SELECT 1 AS IS_ALIVE")

    The SQL statement does not follow the Oracle SQL systax. Thus, I changed the command to

    session.execute("SELECT 1 AS IS_ALIVE FROM DUAL;")

    That was the solution to the problem I posted.