apache-kafkakafka-consumer-apiflume

Kafka HA + flume. How can I use the Kafka HA configuration with Flume?


environment

Problem

Currently, in our architecture,
We are using Flume with Kafka channel, no source and sink to HDFS.

In the future, We are going to build a Kafka HA cluster using kafka mirror maker.
So, even if one cluster is shut down, I try to use it so that there is no problem with failure by connecting to the other cluster.

To do this, I think that we need to subscribe topic with a regex pattern with Flume.

Assume that cluster A and cluster B exist, and two clusters have a topic called ex.
And the mirror maker copy each other ex, so cluster A has topic : ex, b.ex and cluster B has topic : ex, a.ex.

For example, while reading e and b.e from cluster A, if there is a failure, it tries to read ex and a.ex by going to the opposite cluster.

Like below.

test.channel = c1 c2
c1.channels.kafka.topics.regex = .*e (impossible in kafka channel)
...

c1.source.kafka.topics.regex = .*e (possible in kafka source)

In the case of flume kafka source, there is a property to read the topic as a regex pattern.
But This property does not exist in channel.

Is there any good way?
I'd appreciate it if you could suggest a better way. Thank you.


Solution

  • Sure, using a regex or simply a list of both topics would be preferred, but you then end up with data split across different directories based on the topic name, leaving HDFS clients to merge the data back together

    A channel includes a producer, thus why a regex isn't possible

    By going to the opposite cluster

    There's no way Flume will automatically do that unless you modify its bootstrap servers config and restart it. Same applies for any Kafka client, really... This isn't exactly what I'd call "highly available" because all clients pointing to the down cluster will experience downtime

    Instead, you should be using a Flume pipeline (or Kafka Connect) from each cluster. That being said, MirrorMaker would only then be making extra copies of your data or allowing clients to consume data from the other cluster for their own purposes rather than acting as a backup/fallback

    Aside: unclear from the question, but make sure you are using MirrorMaker2, also, which would imply you'd already be using Kafka Connect and can therefore install the HDFS sink rather than need Flume