[SOLVED] Spinnaker webhook stage does not allow timeouts longer than 5 minutes

Spinnaker webhook stage does not allow timeouts longer than 5 minutes

I have setup a pipeline that has a health check stage. It is a webhook stage that will hit a an endpoint in a custom server. This server will check the healthiness of deployed application and return a 200 if it's ready, or a 500 if not.

This stage currently has a timeout of 10 minutes, and is configured by this value in the execution options: timeout configuration which can also be described as json with:

  "overrideTimeout": true,
  "stageTimeoutMs": 600000, // 10 minutes

But the stage still fails on 5 minutes (plus a 1 to 9 seconds, which I believe is the time it takes to retry).

If I lower that stageTimeoutMs to less than 300000 (less than 5 minutes) it works as expected, but increasing it does not. Is there anything else that need to be configured globally to allow more than 5 minutes for webhook stages?

BTW, the pipeline itself can continue after this (proving it's not a pipeline timeout, but this stage specifically), and manual judgement stages (and some other types) can happily go over 5 minutes, just webhooks fail.

I'm currently working with Spinnaker 1.19.13

Solution

Looks like 500 as a response to webhook endpoint will cause it to get stuck at "Create Webhook" task, which seems to really have the timeout hardcoded to 5 minutes. By changing my endpoint to return a 202 and a meaningful string on $.status (the statusJsonPath) I was able to make it get past the "Create Webhook" task and start the "Monitor Webhook" one, which can have its timeout overridden.

I think this shows lack of documentation and a not very intuitive behavior on Spinnaker side, but the solution doesn't look hacky. If folks are facing similar issue out there, hopefully this is helpful.