mongodbapache-kafkaarchitectureapache-kafka-streams

Kafka Streams state store vs MongoDB for state management


I am working on a distributed system using Kafka Streams for communication between components. One of the components, (for simplicity BRAIN), manages a sequence of messages to other components (A, B, C, D, E, F, G).

The flow looks like this:

BRAIN sends a message to component A and waits for its completion feedback. Once A responds, BRAIN sends messages to B, C, and D, cause they can work in parallel, then waits for feedback from all three. After receiving all responses, BRAIN sends a message to E, F, G sequentially (send message to next when received completion feedback from previous.

To track the completion status of these components and determine when to send the next message, BRAIN requires a state store.

Kafka Streams naturally supports stateful processing through internal state stores and changelog topics. However, it's challenging to manage and debug Kafka Streams, we often have errors, 99% related to stateful components, (maybe cause it was not correctly configured/developed), and we end up by cleaning all the topics, so we can restart from scratch. If this can be "ok" on development environment, it can be a BIG problem on production environment. Project Manager is worried about that, and I understand him. Last but not least, has a steep learning curve, especially for junior developers, so usually when errors happens, only Senior and Architects knows how to fix them.

So I was thinking, why not to use MongoDB to store the state instead of Kafka Streams’ state store?

Kafka will continue to be used for communication between components, by producing and consuming messages. BRAIN would write and read state information (e.g., completion status of A, B, C, etc.) directly to/from MongoDB.

What are the trade-offs compared to using Kafka Streams’ native state management, and what considerations should I keep in mind when designing a system where MongoDB handles state for a Kafka-based workflow?

The only thing I can think of, is the potential issue of receiving completion feedback from parallel components in the same instant. Potentially it can lead workflow in a permanent idle (but looks quite a remote option).


Solution

  • Kafka Streams isn't as state store, it's a processing framework.

    The default store is RocksDB. Yes, there's a learning curve. Yes, you need to tune it (the defaults are less than ideal). Yes, you may spend hours troubleshooting... Why won't MongoDB have the same (assuming starting with zero experience of either)?

    In any case, the StateStoreSupplier interface can be implemented, and you can write state wherever you want. There's examples on GitHub of Solr, Neo4j, Redis, etc. What you lose however, is precise transactional guarantees.