google-cloud-dataflowgoogle-genomics

Why are Google Pipeline VM instances hanging indefinitely?


I am using Dockerflow to run parallel tasks through the Google Pipelines API on Google Cloud Platform. I started a single-step task running 1389 VMs in parallel and found that 233 of the VMs were apparently doing nothing and hanging indefinitely.

I did a spot check of the serial console output and repeatedly saw the VMs running into "Getting controller config failed" errors.

When I tried logging into the VMs I received the error: "Connection Failed. We are unable to connect to the VM on port 22".

I am wondering why my VM instances are hanging, and if there is something I can do to avoid running into these issues.

I've included a snippet of the serial console output below

startupscript: +++ readlink -f /usr/share/google-genomics/startup.sh
startupscript: ++ dirname /usr/share/google-genomics/startup.sh
startupscript: + cd /usr/share/google-genomics
startupscript: + ./controller --operation_id <id> --validation_token <token> --base_path https://genomics.googleapis.com
create controller[2905]: Getting controller config
create controller[2905]: Getting controller config failed, will retry: Get <link>: Get <service_account_token_link>: net/http: timeout awaiting response headers
create controller[2905]: Getting controller config failed, will retry: Get <link>: dial tcp 74.125.26.95:443: i/o timeout
collectd[2342]: write_gcm: Asking metadata server for auth token
collectd[2342]: write_gcm: curl_easy_perform() failed: Couldn't connect to server
collectd[2342]: write_gcm: Error -1 from wg_curl_get_or_post
collectd[2342]: write_gcm: wg_transmit_unique_segment failed.
collectd[2342]: write_gcm: wg_transmit_unique_segments failed. Flushing.

Solution

  • there was a temporary networking issue in us-east1-b. All 3 above VMs were in us-east1-b. These minor incidents do not appear in https://status.cloud.google.com/

    Serial console output for a successful run looks like:

    A Feb 21 19:05:06 ggp-5629907348021283130 startupscript: + ./controller --operation_id --validation_token --base_path https://autopush-genomics.sandbox.googleapis.com A Feb 21 19:05:06 ggp-5629907348021283130 create controller[2689]: Getting controller config A Feb 21 19:05:36 ggp-5629907348021283130 create controller[2689]: Getting controller config failed, will retry: Get https://genomics.googleapis.com/v1alpha2/pipelines:getControllerConfig?alt=json&operationId=&validationToken=: dial tcp 173.194.212.81:443: i/o timeout A Feb 21 19:05:43 ggp-5629907348021283130 controller[2689]: Switching to status: pulling-image A Feb 21 19:05:43 ggp-5629907348021283130 controller[2689]: Calling SetOperationStatus(pulling-image) A Feb 21 19:05:44 ggp-5629907348021283130 controller[2689]: SetOperationStatus(pulling-image) succeeded

    The "Getting controller config failed, will retry" is fine. It succeeded upon retry. The "SetOperationStatus(pulling-image) succeeded" indicates networking is working.

    In theory, you can submit any number of jobs to Pipelines API and the API will take care of queueing.

    If these temporary networking hiccups become common, we may consider changing Pipelines API to somehow detect and retry.