Kafka enable.auto.commit
is set to false
and Spark version is 2.4
If using latest offset, do we need to manually find last offset details and mention it in .CreateDirectStream() in Spark application? or will it automatically take latest offset? In any case do we need to find the last offset details manually.
Is there any difference when use SparkSession.readstrem.format(kafka)....
and KafkaUtils.createDirectStream()
?
When using earliest offset option, will it consider the offset automatically?
Here is my attempt to answer your questions
enable.auto.commit
is a kafka related parameter and if set to false requires you to manually commit (read update) your offset to the checkpoint directory. If your application restarts it will look into the checkpoint directory and start reading from last committed offset + 1. same is mentioned here https://jaceklaskowski.gitbooks.io/apache-kafka/content/kafka-properties-enable-auto-commit.html by jaceklaskowski. There is no need to specify the offset anywhere as part of your spark application. All you need is the checkpoint directory. Also, remember offsets are maintained per partition in topic per consumer so it would be bad on Spark to expect developers/users to provide that.