apache-kafkaapache-kafka-mirrormaker

remote consumer running slow for kafka


Background : We were trying to run mirrormaker to try and replicate some data from one of our datacenters to aws. When trying to do this we ran into an issue where we seemed to be bottlenecking in terms of throughput per partitions. We were encountering an issue in some of our topics which have large message sizes and decent throughput . We were running mirrormaker on aws side along with our back up kafka cluster there [ on different machines] .

How did we conclude we were bottlenecking in throughput per partition ?

When we increased the number of partitions in the topic the throughput increased. Also when producing to a single partition in a topic with a bunch of partitions the throughput was low.

To further debug we tried using consumer-perf-test on the aws side to try and figure out if consumer might be the bottleneck [ considering it should ideally be either the consumer or producer there ] .

We saw a drastic decrease in the throughput of our consumer on aws as compared to bringing it up on our datacenter. Also, another observation tweaking a bunch of properties does not seem to impact the batch size of the consumer at all on aws side .

Properties we tried out for reference - fetch.min.bytes=4000000

max.partition.fetch.bytes=33554432

max.poll.records=20000

So just wondering if there is some configuration that I might be missing here that can lead to this drop ?

Edit 1 :

Ok, so tuning the socket buffer on the broker does positively impact the consumer perf test script quite a bit. But it seems like the mirrormaker is still bottlenecking somewhere . One peculiar thing I noticed is -

consumer-fetch-manager-metrics:fetch-latency-avg:{client-id=consumer-perf-consumer-4331-1} : 145.250

This metric is around 145 ms in the performance test script. But when checking the same via jmx stats for mirromaker , for the same topic this stat is around 1300 ms which is very weird considering they should be somewhat similar for both.

Edit 2:

So I saw something very wierd. It seems like mirrormaker is not picking my overrided config for the consumers mirrormaker creates to replicate data.

org.apache.kafka.common.config.AbstractConfig [2022-03-19 00:06:34,709] INFO ConsumerConfig values: 
    allow.auto.create.topics = true
    auto.commit.interval.ms = 5000
    auto.offset.reset = earliest
    bootstrap.servers = [hidden.ip:9092]
    check.crcs = true
    client.dns.lookup = use_all_dns_ips
    client.id = consumer-null-6
    client.rack = 
    connections.max.idle.ms = 540000
    default.api.timeout.ms = 60000
    enable.auto.commit = false
    exclude.internal.topics = true
    fetch.max.bytes = 52428800
    fetch.max.wait.ms = 500
    fetch.min.bytes = 1
    group.id = null
    group.instance.id = null
    heartbeat.interval.ms = 3000
    interceptor.classes = []
    internal.leave.group.on.close = true
    internal.throw.on.fetch.stable.offset.unsupported = false
    isolation.level = read_uncommitted
    key.deserializer = class org.apache.kafka.common.serialization.ByteArrayDeserializer
    max.partition.fetch.bytes = 1048576
    max.poll.interval.ms = 300000
    max.poll.records = 500
    metadata.max.age.ms = 300000
    metric.reporters = []
    metrics.num.samples = 2
    metrics.recording.level = INFO
    metrics.sample.window.ms = 30000
    partition.assignment.strategy = [class org.apache.kafka.clients.consumer.RangeAssignor]
    **receive.buffer.bytes = 65536**
    reconnect.backoff.max.ms = 1000
    reconnect.backoff.ms = 50
    request.timeout.ms = 30000
    retry.backoff.ms = 100
    sasl.client.callback.handler.class = null
    sasl.jaas.config = null
    sasl.kerberos.kinit.cmd = /usr/bin/kinit
    sasl.kerberos.min.time.before.relogin = 60000
    sasl.kerberos.service.name = null
    sasl.kerberos.ticket.renew.jitter = 0.05
    sasl.kerberos.ticket.renew.window.factor = 0.8
    sasl.login.callback.handler.class = null
    sasl.login.class = null
    sasl.login.refresh.buffer.seconds = 300
    sasl.login.refresh.min.period.seconds = 60
    sasl.login.refresh.window.factor = 0.8
    sasl.login.refresh.window.jitter = 0.05
    sasl.mechanism = GSSAPI

From jmx stats under the consumer section what I understand is that this should be the named client id which is actually doing the heavy lifting and porting the data. - enter image description here

I have set the properties as mentioned on the kip of mirrormaker at https://cwiki.apache.org/confluence/display/KAFKA/KIP-382%3A+MirrorMaker+2.0 . According to logs these are reflecting for the producer side of things but not for my consumers at all it seems.. I am using mirrormaker 2 from the 2.7.1 distribution.


Solution

  • Ok, so I got it. In the kafka 2.7. distribution, it seems like there is a bug where in the consumer properties are not getting picked for the core consumer which actually polls the data for the source cluster. They are picked for the produce side weirdly.

    I upgraded to 3.1 and sharing improvements for reference - data being replicated - a single partition topic that has average message size ~100kb

    when replicating from a cluster which has the default send buffer bytes, throughput increased by 3x of original [ used the recommended higher settings for buffers, poll records and so on ]

    On a cluster with tuned send.buffer.bytes [ 1 mb instead of 100kb default] my throughput increased 30x.

    The latency between data centers was ~70ms for reference.