pythonairflowpyodbcimpala

pyodbc in Airflow works in CLI but fails when running DAG from UI


I have a simple DAG which runs kinit to establish a kerberos ticket and then uses pyodbc to connect to a db engine (impala) and run a select count(*) query.

CONN_ARGS = {
'Driver':LINKTODRIVER,
'HOST':HOST, 
'PORT':PORT,
'AuthMech':'1',
'KrbFQDN':HOST,
'KrbRealm':'SOME.REALM',
'KrbServiceName':'servicename',
'SSL':1,
'autocommit':True
}


def run_test_two():
   conn = pyodbc.connect(**CONN_ARGS)
   statement = 'SELECT count(*) AS result FROM some.table'
   crsr = conn.cursor()
   crsr.execute(statement)
   print(crsr.fetchall())
   conn.close()  


with DAG (
   dag_id ='test_dag',
   schedule_interval=None, #only for manual test runs
   start_date=datetime(2022, 1, 1),
   catchup=False,
   description='This is a test dag',
   dagrun_timeout=timedelta(minutes=60)
) as dag:

   task_test_task_one = BashOperator(
       task_id='test_task_one',
       bash_command=KINIT_TASK_COMMAND,
       dag=dag
   )

   task_test_task_two = PythonOperator(
       task_id='test_task_two',
       python_callable=run_test_two,
       dag=dag
   )


   task_test_task_one >> task_test_task_two

When running the DAG in the CLI everything works, but when I run the DAG from the UI I get an access issue:

pyodbc.Error: ('HY000', '[HY000] [Cloudera][DriverSupport] (1170) Unexpected response 
received from server. Please ensure the server host and port specified for the 
connection are correct. (1170) (SQLDriverConnect)')

What is the Airflow UI doing differently from the CLI to cause this issue?


Solution

  • Figured it out - although the kinit task succeeds, because of the way airflow works (launching tasks in separate worker pods) the 'kinit' does not exist in the subsequent task. From the CLI it works because the same session is essentially open, so the kinit still exists, but in the UI that session is closed before the next task executes.

    Adding the kinit into the same task as the pyodbc connection fixed the problem.