google-cloud-platformgoogle-cloud-dataflowapache-beamdata-pipeline

Dataflow with python flex template - launcher timeout


I'm trying to run my python dataflow job with flex template. job works fine locally when I run with direct runner (without flex template) however when I try to run it with flex template, job stuck in "Queued" status for a while and then fail with timeout.

Here is some of logs I found in GCE console:

INFO:apache_beam.runners.portability.stager:Executing command: ['/usr/local/bin/python', '-m', 'pip', 'download', '--dest', '/tmp/dataflow-requirements-cache', '-r', '/dataflow/template/requirements.txt', '--exists-action', 'i', '--no-binary', ':all:'

Shutting down the GCE instance, launcher-202011121540156428385273524285797, used for launching.

Timeout in polling result file: gs://my_bucket/staging/template_launches/2020-11-12_15_40_15-6428385273524285797/operation_result.
Possible causes are:
1. Your launch takes too long time to finish. Please check the logs on stackdriver.
2. Service my_service_account@developer.gserviceaccount.com may not have enough permissions to pull container image gcr.io/indigo-computer-272415/samples/dataflow/streaming-beam-py:latest or create new objects in gs://my_bucket/staging/template_launches/2020-11-12_15_40_15-6428385273524285797/operation_result.
3. Transient errors occurred, please try again.

For 1, I see no useful lo. For 2, service account is default service account so it should all permissions.

How can I debug this further?

Here is my Docker file:

FROM gcr.io/dataflow-templates-base/python3-template-launcher-base

ARG WORKDIR=/dataflow/template
RUN mkdir -p ${WORKDIR}
WORKDIR ${WORKDIR}

ADD localdeps localdeps
COPY requirements.txt .
COPY main.py .
COPY setup.py .
COPY bq_field_pb2.py .
COPY bq_table_pb2.py .
COPY core_pb2.py .

ENV FLEX_TEMPLATE_PYTHON_REQUIREMENTS_FILE="${WORKDIR}/requirements.txt"
ENV FLEX_TEMPLATE_PYTHON_PY_FILE="${WORKDIR}/main.py"
ENV FLEX_TEMPLATE_PYTHON_SETUP_FILE="${WORKDIR}/setup.py"

RUN pip install -U  --no-cache-dir -r ./requirements.txt

I'm following this guide - https://cloud.google.com/dataflow/docs/guides/templates/using-flex-templates


Solution

  • A possible cause of this issue can be found within the requirements.txt file. If you are trying to install apache-beam within the requirements file the flex template will experience the exact issue you are describing: Jobs stay some time in the Queued state and finally fail with Timeout in polling result.

    The reason being, they are affected by this issue. This only affects flex templates, the jobs run properly locally or with Standard Templates.

    The solution is to install it separately in the Dockerfile.

    RUN pip install -U apache-beam==<your desired version>
    RUN pip install -U -r ./requirements.txt