On Cloud Composer I have long running DAG tasks, each of them running for 4 to 6 hours. The task ends with an error which is caused by Kubernetes API. The error message states 401 Unauthorized.
The error message:
kubernetes.client.rest.ApiException: (401)
Reason: Unauthorized
HTTP response headers: HTTPHeaderDict({'Audit-Id': 'e1a37278-0693-4f36-8b04-0a7ce0b7f7a0', 'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'Date': 'Fri, 07 Jul 2023 08:10:15 GMT', 'Content-Length': '129'})
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"Unauthorized","reason":"Unauthorized","code":401}
The kubernetes API token has an expiry of 1 hour and the Composer is not renewing the token before it expires. This issue never happened with Composer1, it started showing only when I migrated from Composer1 to Composer2
Additional details: There is an option in GKEStartPodOperator called is_delete_operator_pod that is set to true. This option deletes the pod from the cluster after the job is done. So, after the task is completed in about 4 hours, the Composer tries to delete the pod, and that time this 401 Unauthorized error is shown.
I have checked some Airflow configs like kubernetes.enable_tcp_keepalive that enables TCP keepalive mechanism for kubernetes clusters, but it doesn't help resolving the problem.
What can be done to prevent this error?
After experiencing the same issue, I found a fix in the latest version of the Google provider for Airflow, which is currently not yet available in Cloud Composer. However, you can manually override this by adding the release candidate package to your Cloud Composer instance.
You can use the release candidate for version 10.5.0
of the apache-airflow-providers-google
python package. It can be found here.
The override can be accomplished by either manually adding a Pypi package in the Cloud Composer environment's settings, or by adding the package to the terraform resource. The updates takes about 15-30 minutes.
I tested this and can confirm it works. Tasks can again run longer than 1h.