In our production clusters we see a pattern over 14 days flink CPU climbs up resulting in container kill.
Flink heap shows a growth as shown in the graph below. The initial theory is CPU is contributed due to this increased heap growth (More GC activity/ Object allocation and deallocations).
Please suggest efficient ways to resolve and narrow down the issue.
If this is contributed by application code what are the efficient tools to exactly narrow down where the issue is ?
We are not using any checkpointing features.
Thanks a lot!
Used GCViewer and observed gc activity is more compared to day 1 and day 10.
If you can, use a Java profiler (like YourKit) to profile CPU activity, so that you actually know what is causing the load, versus guessing that it's GC activity.
If you can't do that, it's often possible to run the workflow locally and profile it to determine likely causes of CPU load.