Background : We were trying to run mirrormaker to try and replicate some data from one of our datacenters to aws. When trying to do this we ran into an issue where we seemed to be bottlenecking in terms of throughput per partitions. We were encountering an issue in some of our topics which have large message sizes and decent throughput . We were running mirrormaker on aws side along with our back up kafka cluster there [ on different machines] .
How did we conclude we were bottlenecking in throughput per partition ?
When we increased the number of partitions in the topic the throughput increased. Also when producing to a single partition in a topic with a bunch of partitions the throughput was low.
To further debug we tried using consumer-perf-test on the aws side to try and figure out if consumer might be the bottleneck [ considering it should ideally be either the consumer or producer there ] .
We saw a drastic decrease in the throughput of our consumer on aws as compared to bringing it up on our datacenter. Also, another observation tweaking a bunch of properties does not seem to impact the batch size of the consumer at all on aws side .
Properties we tried out for reference - fetch.min.bytes=4000000
max.partition.fetch.bytes=33554432
max.poll.records=20000
So just wondering if there is some configuration that I might be missing here that can lead to this drop ?
Edit 1 :
Ok, so tuning the socket buffer on the broker does positively impact the consumer perf test script quite a bit. But it seems like the mirrormaker is still bottlenecking somewhere . One peculiar thing I noticed is -
consumer-fetch-manager-metrics:fetch-latency-avg:{client-id=consumer-perf-consumer-4331-1} : 145.250
This metric is around 145 ms in the performance test script. But when checking the same via jmx stats for mirromaker , for the same topic this stat is around 1300 ms which is very weird considering they should be somewhat similar for both.
Edit 2:
So I saw something very wierd. It seems like mirrormaker is not picking my overrided config for the consumers mirrormaker creates to replicate data.
org.apache.kafka.common.config.AbstractConfig [2022-03-19 00:06:34,709] INFO ConsumerConfig values:
allow.auto.create.topics = true
auto.commit.interval.ms = 5000
auto.offset.reset = earliest
bootstrap.servers = [hidden.ip:9092]
check.crcs = true
client.dns.lookup = use_all_dns_ips
client.id = consumer-null-6
client.rack =
connections.max.idle.ms = 540000
default.api.timeout.ms = 60000
enable.auto.commit = false
exclude.internal.topics = true
fetch.max.bytes = 52428800
fetch.max.wait.ms = 500
fetch.min.bytes = 1
group.id = null
group.instance.id = null
heartbeat.interval.ms = 3000
interceptor.classes = []
internal.leave.group.on.close = true
internal.throw.on.fetch.stable.offset.unsupported = false
isolation.level = read_uncommitted
key.deserializer = class org.apache.kafka.common.serialization.ByteArrayDeserializer
max.partition.fetch.bytes = 1048576
max.poll.interval.ms = 300000
max.poll.records = 500
metadata.max.age.ms = 300000
metric.reporters = []
metrics.num.samples = 2
metrics.recording.level = INFO
metrics.sample.window.ms = 30000
partition.assignment.strategy = [class org.apache.kafka.clients.consumer.RangeAssignor]
**receive.buffer.bytes = 65536**
reconnect.backoff.max.ms = 1000
reconnect.backoff.ms = 50
request.timeout.ms = 30000
retry.backoff.ms = 100
sasl.client.callback.handler.class = null
sasl.jaas.config = null
sasl.kerberos.kinit.cmd = /usr/bin/kinit
sasl.kerberos.min.time.before.relogin = 60000
sasl.kerberos.service.name = null
sasl.kerberos.ticket.renew.jitter = 0.05
sasl.kerberos.ticket.renew.window.factor = 0.8
sasl.login.callback.handler.class = null
sasl.login.class = null
sasl.login.refresh.buffer.seconds = 300
sasl.login.refresh.min.period.seconds = 60
sasl.login.refresh.window.factor = 0.8
sasl.login.refresh.window.jitter = 0.05
sasl.mechanism = GSSAPI
From jmx stats under the consumer section what I understand is that this should be the named client id which is actually doing the heavy lifting and porting the data. -
I have set the properties as mentioned on the kip of mirrormaker at https://cwiki.apache.org/confluence/display/KAFKA/KIP-382%3A+MirrorMaker+2.0 . According to logs these are reflecting for the producer side of things but not for my consumers at all it seems.. I am using mirrormaker 2 from the 2.7.1 distribution.
Ok, so I got it. In the kafka 2.7. distribution, it seems like there is a bug where in the consumer properties are not getting picked for the core consumer which actually polls the data for the source cluster. They are picked for the produce side weirdly.
I upgraded to 3.1 and sharing improvements for reference - data being replicated - a single partition topic that has average message size ~100kb
when replicating from a cluster which has the default send buffer bytes, throughput increased by 3x of original [ used the recommended higher settings for buffers, poll records and so on ]
On a cluster with tuned send.buffer.bytes [ 1 mb instead of 100kb default] my throughput increased 30x.
The latency between data centers was ~70ms for reference.