apache-sparkapache-kafkaspark-kafka-integration

I have more data in a kafka topic but when i extract data using my pyspark application, I am getting only 1 row extracted, how to fix?


I have more data in a kafka topic but when i extract data using my pyspark application (which I use to extract from different kafka topics), I am getting only 1 row extracted. Previously I had extracted data from the same topic using the same pyspark application/code without any issues.

One thing I want to highlight is that, I had tried extracting data from the topic multiple times from the same databricks notebook and also from different databricks notebook so my doubt here is if I might have extracted the data from same topic from two different notebooks at the same time in same databricks instance and it should have caused some issue due to which I am facing this issue. How to troubleshoot and fix this issue?

I am new to kafka & pyspark


Solution

  • Previously I had extracted data from the same topic using the same pyspark application/code without any issues.

    If you're using the same kafka.group.id, then consumed offsets are being tracked by this value, and you'll need to reset the consumer group offsets using Kafka tools. Otherwise, you'll only consume new data after the offsets that were previously consumed and committed.