pythonerror-handlingtimeoutgoogle-cloud-runindefinite

How do I make my Cloud Run Job last indefinitely?


I have a Job in Cloud Run that ingests data from an external source, and writes that data into Firebase Firestore. I want this job to run indefinitely - 365 days per year, 24 hours per day.

As I understand, these Cloud Run jobs have a timeout. The timeout is the reason I migrated from Firebase Functions to Cloud Run. My job fails due to the timeout, and then retries with this Error:

"Terminating task because it has reached the maximum timeout of 600 seconds. To change this limit, see https://cloud.google.com/run/docs/configuring/task-timeout"

After retrying, the job reconnects to the external source, and starts populating Firestore again. This means, aside from the short interruption, it behaves exactly as I want - as long as it doesn't run out of retries. I can increase the timeout and retries, but this seems like a pretty ugly hack. Also, having the big red error makes me sad 😞

What is the correct way to run a Cloud Run job indefinitely?


Solution

  • Cloud Run has a timeout of 1h, Cloud Run Jobs has a timeout of 24h (soon more, stay tuned), in the absolute (whatever the service, whatever the cloud provider), something has ALWAYS a timeout (outage, need to patch the machine, network connectivity issue,...). In addition, keep in mind that the serverless timeout is the max timeout in the happy path (no outage on the underlying infrastructure during the runtime, that is not 100% guaranty!!

    I'm saying that because you must design your application to be robust at the failure. The failure is physically normal and you can't avoid it.

    Then, when you have made this design, you can use that failure-safe feature, every hour, every day, or whenever the service goes down. @Evorlor example is the right one: set a timeout of 12h and run the job with a Cloud Scheduler every 12h.

    The principle is to be able to restart on failure and to don't loose any data. Not to have a super-hero application!