apache-kafkaapache-kafka-streamsrocksdb

RocksDB data loss


I have a stateful Kafka Streams application with logging disabled. The reason I had to disable it is because the aggregated state for some records can get very big, and sending an update to the changelog topic on every change is not feasible.

Therefore a StatefulSet is being used, where the rocksdb files are preserved upon restart. I was reading here that Kafka Streams disable the rocksdb transaction log by default since it uses the changelog topic for fault tolerance, but in my case I guess it should be enabled. How can this be configured?

After reading this:

Rocksdb is not highly available and does not have a failover scheme. That doesn’t mean you lose your state store data when you store it in RocksDB using Kafka Streams, because it’s Kafka Streams that makes RocksDB fault tolerant by replicating the state store data to a Kafka topic.

I got a little worried, since I am disabling the changelog topic. Are there any other precautions that need to be taken besides enabling transaction log, since I am relying solely on rocksdb files for fault tolerance?


Solution

  • It's disabled in Kafka Streams's default RocksDB options because the WAL is supposed to be Kafka itself:

    wOptions = new WriteOptions();
    wOptions.setDisableWAL(true);
    

    Not relying on changelog topics makes many things tricky as RocksDB is not meant to be a source of truth (Kafka is) but just as a helpful tool to do key-value store queries.