apache-kafka-connectapache-kafka-mirrormaker

Distributed Mode in Dedicated MirrorMaker 2 cluster - not balancing tasks evenly


I've been trying to set a cluster in Kubernetes with nodes based on MirrorMaker, using Kafka version 3.5.1.

The setting I used is based on what is described in this KPI, and relates to the usage of dedicated.mode.enable.internal.rest, which has been delivered as part of JIRA #10586 - meaning it should be available since Kafka 3.5.0.

The issues I get are not related to the topics/offsets replication, this has been working fine - but basically:

I can leave it running for hours and the result remains the same.

Following is the mirror maker configuration I have tried:

clusters = source, destination

source.bootstrap.servers = <SOURCE_BOOTSTRAP_SERVERS>
destination.bootstrap.servers = <TARGET_BOOTSTRAP_SERVERS>

source->destination.enabled = true
source->destination.topics = <TARGET_TOPIC_WHITELIST>
groups=.*
topics.blacklist = .*[\-\.]internal
emit.heartbeats.enabled = true
source->destination.sync.group.offsets.enabled = true

# Auth config
security.protocol=SASL_SSL
source.security.protocol=SASL_SSL
destination.security.protocol=SASL_SSL
source.sasl.mechanism=SCRAM-SHA-512
destination.sasl.mechanism=SCRAM-SHA-512
source.sasl.jaas.config=org.apache.kafka.common.security.scram.ScramLoginModule required username="<SOURCE_USERNAME>" password="<SOURCE_PASSWORD>";
destination.sasl.jaas.config=org.apache.kafka.common.security.scram.ScramLoginModule required username="<TARGET_USERNAME>" password="<TARGET_PASSWORD>";

replication.policy.class=org.apache.kafka.connect.mirror.IdentityReplicationPolicy

# rest API - new since 3.5.0
dedicated.mode.enable.internal.rest=true

which includes the flag set for enabling the REST API.

As the nodes are launched (via connect-mirror-maker.sh) I can see in one of our replicas the following information:

[2023-07-27 12:36:10,825] INFO Advertised URI: http://<advertised_IP>:8083/ (org.apache.kafka.connect.runtime.rest.RestServer:394)
[2023-07-27 12:36:10,826] INFO REST server listening at http://<advertised_IP>:8083/, advertising URL http://<advertised_IP>:8083/ (org.apache.kafka.connect.runtime.rest.RestServer:202)

followed by the group join indication (the lines below can be verified in all replicas):

2023-07-27 12:36:34,719] INFO [Worker clientId=source->destination, groupId=source-mm2] Joined group at generation 44 with protocol version 2 and got assignment: Assignment{error=0, leader='source->destination-224a6033-361f-426b-9840-fc078eb87333', leaderUrl='http://<advertised_IP>:8083/', offset=705, connectorIds=[MirrorHeartbeatConnector], taskIds=[MirrorHeartbeatConnector-0], revokedConnectorIds=[], revokedTaskIds=[], delay=0} with rebalance delay: 0 (org.apache.kafka.connect.runtime.distributed.DistributedHerder:2394)

If I try running curl targeting the leaderUrl above (on the root or at /connectors) from any node, using the advertised IP (or localhost in the leader) I end up with {"error_code":404,"message":"HTTP 404 Not Found"}

I can see the busy pod CPU consumption going high as it replicates all messages, while other pods are free.


Solution

  • The maximum number of tasks did the desired effect, we started testing with tasks.max=10 and it already made a huge difference (we are now tuning it based on our needs). Some related context can be found on this answer.