apache-sparkapache-kafkaspark-structured-streaming

significance of kafka consumer group in spark structured streaming


planning to build spark structured streaming application, which reads json data from Kafka topic, parses the data and writes to any of the storage.

val df = spark
  .readStream
  .format("kafka")
  .option("kafka.bootstrap.servers", "host1:port1,host2:port2")
  .option("subscribe", "topic1")
  .option("kafka.group.id", "myConsumerGroup")
  .load()

As per the spark doc, Consumer group id is internally generated by the Apache Spark ie. By default, each query generates a unique group id for reading data or we can specify using kafka.group.id. I understand that Within a consumer group, at any time a kafka partition can only be consumed by a single consumer. If all we have to do is read, parse and write, what is the significance of kafka.group.id, do we need to explicitly set this?


Solution

  • I would avoid using it. From the manuals, there is a specific use-case:

    The Kafka group id to use in Kafka consumer while reading from Kafka. Use this with caution. By default, each query generates a unique group id for reading data. This ensures that each Kafka source has its own consumer group that does not face interference from any other consumer, and therefore can read all of the partitions of its subscribed topics. In some scenarios (for example, Kafka group-based authorization), you may want to use a specific authorized group id to read data. You can optionally set the group id. However, do this with extreme caution as it can cause unexpected behavior. Concurrently running queries (both, batch and streaming) or sources with the same group id are likely interfere with each other causing each query to read only part of the data. This may also occur when queries are started/restarted in quick succession. To minimize such issues, set the Kafka consumer session timeout (by setting option "kafka.session.timeout.ms") to be very small. When this is set, option "groupIdPrefix" will be ignored.

    I have been filling data lakes for a while now, and on the last project we just had multiple queries - 1 per App, (1 App for RAW- ingestion, 1 App for REF- ingestion, 1 App for BUS- Data Zone processing and ingestion), and it worked fine. With the 3.x upgrade we looked at what you are asking, but left it as is.

    Spark Structured Streaming 'sits on top' and allocates resources as per what you have defined, so performance should be fine for most aspects, you can add more.