How should we decide the optimal heartbeat.timeout configuration for flink jobs. I am using flink 1.10.3 and my service fails due to heartbeat time out exception. Currently default value is set up that is 50 secs.
In my flink job I tried increasing the heartbeat.timeout from 50s to 5min, it did not work, and the exception kept on coming. The reason for the heartbeat timeout exception in my case was that the task managers were crashing as the heap memory was getting exhausted. So I tried changing the taskmanager.memory.managed.fraction to 0.05 from 0.4, which in turn increased the heap memory. Now, the frequency of heartbeat failure has reduced and the pipeline is also able to restart from failures.