I am using Kafka sink connector to write data from Kafka to s3. The output data is partitioned into hourly buckets - year=yyyy/month=MM/day=dd/hour=hh
. This data is used by a batch job downstream. So, before starting the downstream job, I need to be sure that no additional data will arrive in a given partition once the processing for that partition has started.
What is the best way to design this? How can I mark a partition as complete? i.e. no additional data will be written to it once marked as complete.
EDIT: I am using RecordField as timestamp.extractor. My kafka messages are guaranteed to be sorted within partitions by the partition field
Depends on which Timestamp Extractor you are using in the Sink config.
You would have to guarantee the no records can have a timestamp earlier than the time you consume it.
AFAIK, the only way that's possible is using the WallClock Timestamp Extractor. Otherwise, you are consuming a Kafka Record timestamp, or some timestamp within each message. Both of which can be overwritten on the Producer end to some event in the past