apache-kafkaapache-kafka-streams

How long is the data in KTable stored?


This as reference, stream of profile updates stored in KTable object.

  1. How long this data will be stored in KTable object?
  2. Let say we run multiple instance of application. And somehow, an instance crash. How about KTable data belong to that instance? Is it will be "recovered" by another instance?

I am thinking about storing update of data that rarely updated. So if an instance crash and another instance will be build those data from scratch again, it is possible they will never get thos data again. Because they never be streamed again, or easy saying, very rarely.


Solution

  • The KTable is backed by a topic, so it would determine on what its retention + cleanup policies are.

    If the cleanup policy is compact, then each unique key is stored "forever", or until the broker runs out of space, whichever is sooner.

    If you run multiple instances, then each KTable will hold onto a subset of data from the partitions it consumed from, each table will not have all the data.

    If any instance crashes / moves without persistent storage configurations, it will need to read all data from the beginning of its changelog topic, but you can configure standby replicas to account for that scenario

    More info at https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Streams+Internal+Data+Management