apache-kafkaapache-flinkflink-streamingexactly-once

End-to-end Exactly-once processing in Apache Flink


Apache Flink guarantee exactly once processing upon failure and recovery by resuming the job from a checkpoint, with the checkpoint being a consistent snapshot of the distributed data stream and operator state (Chandy-Lamport algorithm for distributed snapshots). This guarantee exactly once upon failover.

In case of normal cluster operation, how does Flink guarantee exactly once processing, for instance given a Flink source that reads from external source (say Kafka), how does Flink guarantee that the event is read one time from the source? are there any kind of application level acking between the event source and Flink source? Also, how does Flink guarantee that events are propogated exactly once from upstream operators to downstream operators? Does that require any kind of acking for received events as well?


Solution

  • Flink does not guarantee that every event is read once from the sources. Instead, it guarantees that every event affects the managed state exactly once.

    Checkpoints include the source offsets, and during a checkpoint restore, the sources are rewound and some events may be replayed. That's fine, because the checkpoint included the state throughout the job that had resulted from reading everything up to the offsets that were stored in the checkpoint, and nothing beyond those offsets.

    Thus Flink's exactly once guarantee requires replayable sources. Exactly once messaging between operators depends on tcp.

    Guaranteeing that the sinks don't receive duplicated results further requires transactional sinks. Flink commits transactions as part of checkpointing.