apache-kafkaapache-nifikafka-producer-apihortonworks-dataflow

nifi as a producer to kafka: data is not sequential while reading Kafka


I am publishing flowfiles from nifi to kafka, using publishKafka_0_10 processor. While reading datafrom Kafka through code, the sequence of data is not maintained (sorted according to timestamp). My data set is like: timestamp, channel,value.

Just to debug, I am publishing the same flowfiles to phoenix using PutSql and I can see that in Phoenix table, data is sequential (sorted according to time). It would be great if someone explains me why am I not able to read data from kafka sequentially. There is only one partition in topic in kafka. Thanks in advance.


Solution

  • Kafka only guarantees order within a partition. Since you say this is one partition, then okay.

    My data set is like: timestamp, channel,value.

    Message timestamps are simply record metadata, (your own timestamps are not passed into the Kafka ProducerRecord class by NiFi). Also, timestamps have no implications on ordering. In other words, if one "late timestamped" message is committed before other of an "earlier" time, then yes it's chronologically out of order, but Kafka just sees the offsets have moved.

    why am I not able to read data from kafka sequentially

    You are, but in the order the messages were committed to Kafka.

    Your consumer code should extract the record timestamps are reorder them accordingly. For example, Kafka Connect has a Record Timestamp extractor, which can write data into partitioned directories based on this time. I assume your PutSQL processor is reading the sequentially ordered FlowFiles (which have their own timestamps, not the timestamps in your data, unless you ran a ModifyAttribute processor), not using the ConsumeKafka processor?