apache-kafkaapache-kafka-streamsktable

Kafka repartitioning ( for group by based on key)


When we apply a group by function on a stream based on some key, how does kafka calculates this as same key may be present in different partitions ? I was seeing through() function which basically repartitions the data, but i don't understand what does it mean. Will it move all the messages with same key into a single partition? Also how frequently we can call through() method ? Can we call it after receiving each message if there is a requirement ? Please suggest. Thanks


Solution

  • Data in Kafka is (by default) always partitioned by key. If you call groupBy() the grouping attribute is set as message key and thus, when the data is written into the repartition topic, all records with the same key are written into the same partition. Thus, when data is read back, the aggregation can be computed correctly in the aggregate() function.

    Note, that Kafka Streams performs this repartitioning automatically (including the creation of the required topic). Calling repartition() (or through()) would achieve the same, but it's not necessary.

    Also note that a Kafka Streams program is a dataflow program. When using the DSL, you only specify the dataflow program itself, but nothing is processed yet. Only when you call KafkaStreams#start() the dataflow program will be executed.