[SOLVED] Flink optimal configuration for minimum Latency

Flink optimal configuration for minimum Latency

For a Flink streaming / Flink stateful function, it is known that the setBufferTimeout to small value (e.g., 5ms) would give 'best' latency experience. What are the other recommended configuration values one must take care (set, reset, modify..) when optimizing for latency in Flink stream or stateful functions jobs?

Solution

End-to-end latency is affected by many factors. Ignoring latency accrued before events are ingested by Flink, that leaves these issues to consider:

network buffer timeout
serialization
object reuse
watermarking delay (for accommodating out-of-order events)
auto-watermarking interval
state access (state backend dependent)
garbage collection
timers
aggregation (e.g., windowing)
transactional sinks
checkpointing
backpressure

Take advantage of operator chains. Avoid unnecessary use of keyBy and changes of parallelism. Use reinterpretAsKeyedStream where appropriate.

These points above will help avoid unnecessary serialization, but you should also take care to optimize serialization. Using a slow serializer can have an enormous impact, as can using complex, deeply nested collection types where something simpler would do.

You should always enable object reuse. By default, Flink defensively makes copies of objects being passed down operator chains. When enabling object reuse, keep in mind that it is not safe to

remember input object references across function calls or
modify input objects

If you avoid those two points, you may

modify an output object and emit it again

If you are using event time processing, the optimal situation would be to be able to rely on having ascending timestamps, and to generate watermarks accordingly (with zero delay). If you are doing windowing, doing pre-aggregation will avoid load spikes as windows are closed, and configuring a short auto-watermarking interval will help minimize latency.

The FsStateBackend maintains state as objects on the heap, which are then subject to GC. This state backend has the best average latency, but you will want to carefully tune your garbage collector to avoid GC stalls. While much slower overall, the RocksDB state backend may have better worst-case latency, especially if you need to run with many task slots per task manager. With the FsStateBackend, one slot per TM will keep the scope for GC smaller, which helps reduce latency.

Avoid having many timers that fire simultaneously. Arrange for windows for different keys to fire at different times.

Keep in mind that downstream consumers of transactional sinks will experience latency that is governed by the checkpointing interval.

If you don't need exactly once guarantees, disable checkpoint barrier alignment by configuring checkpointing to use CheckpointConfigInfo.ProcessingMode.AT_LEAST_ONCE.

Unaligned checkpoints can, in some cases, be very helpful.

And finally, do whatever you can to avoid backpressure. Give the job more-than-adequate resources. Don't do any blocking i/o in your user functions. Try to avoid data skew (hot keys).