I'm using flink 1.19. When one of operators produces an exception (e.g. a sink is unavailable for some reason), the job starts restarting circulary. I'd like to have a possibility to cancel the job with a savepoint in such situations. Normally, to cancel a job with a savepoint I'm calling flink REST API:
curl -s -XPOST localhost:8081/jobs/2f23bde95c740a0f8f83d00ce6dfdacc/savepoints -d "{\"cancel-job\": true, \"target-directory\": \"s3://bucket-name/savepoints\"}"
It works well when the job is in RUNNING state. But this command is ignored, when the job is in RESTARTING state. I know that the following command cancels (terminates) the job abruptly even in restarting state:
curl -X PATCH localhost:8081/jobs/1db48d5f38b44a0736d3e15d09f5d013
The question is, is it possible to create a savepoint when a job is restarting? And (related question) should I make my sinks exception-save or fail-fast?
I don't believe it's possible to take a savepoint unless the job is successfully running.
What you may want to do is to adjust your restart strategy so that the eventually job fails. Then you can manually restart it from the latest checkpoint, once the problem has been corrected.
You might also want to change your setting for externalized checkpoint retention to RETAIN_ON_CANCELLATION.