google-cloud-platformgoogle-apigoogle-cloud-runorchestrationgoogle-workflows

Intermittent ConnectionFailedError Between Google Cloud Workflows and Cloud Run


Issue Description:

We have a workflow set up in Google Cloud Workflows that occasionally encounters a "ConnectionFailedError" during an HTTP POST call to our Cloud Run service. This issue occurs sporadically and not consistently, causing the workflow to fail before the end of the specified timeout duration (1800 seconds).

Steps Taken:

  1. We have reviewed our network connectivity and confirmed that there are no firewall rules or networking configurations blocking communication between Google Cloud Workflows and Cloud Run.

  2. We have verified the availability and correct URL of our Cloud Run service to ensure it is deployed and accessible.

  3. Error handling has been implemented in our workflows to catch the "ConnectionFailedError" and attempt retries with a suitable retry strategy.

  4. Timeout settings have been properly configured on both the Google Cloud Workflows and Cloud Run side.

  5. We have checked for any concurrency limits, rate limits, and authentication/authorization issues that might be contributing to the connection failures.

  6. Logging and monitoring have been implemented to capture relevant information about the connection failures.

Request for Assistance:

Despite our efforts, we have been unable to pinpoint the root cause of these intermittent "ConnectionFailedError" instances.

It would be great to hear back from the experienced guys and the product team on the matter.


Solution

  • Workflows compiled after May will convert some ConnectionErrors to ConnectionFailedErrors. ConnectionFailedError is used to expand Cloud Workflows' HTTP retry coverage and is retried in both default_retry_predicate and default_retry_predicate_non_idempotent. (ConnectionError is retried only in default_retry_predicate). We are aware that we've not updated the public docs to include ConnectionFailedError tag. We'll do it soon. Our code deals with the "ConnectionError" tag as mentioned in the reference page of http.post (cloud.google.com/workflows/docs/reference/stdlib/http/post). We need to update our code to deal with this new tag, but waiting for the public docs to be updated by @google.