apache-flinksequencekafka-topic

Apache Flink with multiple Kafka sources. Ensure one topic is fully read before consuming data on the other topic


Working with Kafka Streams by creating a GlobalKTable I know per definition that the table will be fully populated before the streaming of other sources will start.

I'm looking for a similar functionality in Apache Flink. Topic one holds configuration data which is almost static. I want Flink to fully consume this topic before even starting to read from topic two. Topic one contains ~5 Mio records with a total size of around 600MB

Is there a way to achieve this or would I need to buffer the data from topic two until I have matching data from topic one?


Solution

  • If you use the Hybrid Source approach, then you won't get updates for the "almost static" configuration data. The same thing is true if you load that topic's data into state using the State Processor API.

    Another option that I've used in the past is to support a --coldstart parameter. When this is set, you configure your workflow with a fake (empty, never terminates) source instead of the second Kafka topic. Then you run the workflow until you see no more new data being loaded from the first topic, stop with a savepoint, and restart from the savepoint but this time without the --coldstart parameter.