We have a Flink setup with Kafka producer currently using at-least-once semantics. We are considering switching to exactly-once semantics regarding the Kafka producer as this would bring us benefits down the pipeline. Considering the documentation though, it seems like this can add a very possible data loss risk which we do not currently have. If we have a long downtime due to either Flink being unable to recover or Kafka brokers being down, a Kafka transaction can expire and data will be lost.
If the time between Flink application crash and completed restart is > larger than Kafka’s transaction timeout there will be data loss (Kafka will automatically abort transactions that exceeded timeout time).
This seems like a brand new risk which is not present in at-least-once semantics and can not be mitigated. Whatever huge transaction timeout is set, there can be a real world case when it would be reached. It seems to me, the best approach would be to have very short checkpointing interval as it would cause the transactions to be closed but still a very large transaction timeout (in the hours) so that trying to reduce the data loss chance. Is my understanding correct?
Your understanding is correct.
FWIW: This only applies to unplanned downtimes. When you upgrade your application, or when you want to shut it down for a longer period you should always use the "stop" command [1], which will commit all external transactions on shutdown.