apache-sparkapache-kafkabatch-processinghadoop-partitioningsystem-design

Kafka S3 Sink Connector - how to mark a partition as complete


I am using Kafka sink connector to write data from Kafka to s3. The output data is partitioned into hourly buckets - year=yyyy/month=MM/day=dd/hour=hh. This data is used by a batch job downstream. So, before starting the downstream job, I need to be sure that no additional data will arrive in a given partition once the processing for that partition has started.

What is the best way to design this? How can I mark a partition as complete? i.e. no additional data will be written to it once marked as complete.

EDIT: I am using RecordField as timestamp.extractor. My kafka messages are guaranteed to be sorted within partitions by the partition field


Solution

  • Depends on which Timestamp Extractor you are using in the Sink config.

    You would have to guarantee the no records can have a timestamp earlier than the time you consume it.

    AFAIK, the only way that's possible is using the WallClock Timestamp Extractor. Otherwise, you are consuming a Kafka Record timestamp, or some timestamp within each message. Both of which can be overwritten on the Producer end to some event in the past