I’m using Kafka with a topic that has 4 partitions. The retention period (TTL) for messages in Kafka is set to the default of 7 days. I’m running a non-streaming batch job that processes data from Kafka, and I manually store the Kafka offsets after each processing run.
Here’s an example of the saved offsets after a few days of processing:
Day 1 (Offsets saved):
{
"0": 100,
"1": 110,
"2": 90,
"3": 123
}
Day 6 (Offsets saved):
{
"0": 20000,
"1": 21000,
"2": 11000,
"3": 17003
}
By Day 7, Kafka’s retention policy will kick in, and all messages older than 7 days will be automatically deleted.
My Concern:
When new data is produced to Kafka after Day 7, and the old messages have been deleted, I’m wondering what happens with the offsets.
The last processed offset I have stored is around 20000, and I want to make sure that starting to read from offset 20001 the next day will allow me to correctly read the newly produced messages, without encountering any issues (like Kafka reusing old offsets).
Kafka is not reusing earlier offsets that don't map to records anymore.
New records are always assigned the next offset sequentially. So if your current last record on a partition is at offset 20000, the next record on that partition will be offset 20001 regardless of retention, or log compaction.