apache-kafkaaws-msk

Risk of Increasing Replication Factor for __consumer_offsets Topic on a Live Kafka 3.5.1 Cluster


We are currently operating an Amazon MSK cluster with 6 brokers running Kafka version 3.5.1. This cluster is actively used by our production services.

We are planning to increase the replication factor for the __consumer_offsets topic from its current value of 2 to 3. The In-Sync Replica (ISR) setting is currently 2, and we intend to keep this value unchanged.

We plan to use the kafka-reassign-partitions.sh script to perform this partition reassignment.

My questions are as follows:

Any additional tips or precautions for safely performing this operation would be highly valued. Thank you in advance for your answers.


Solution

  • Increasing the RF to 3 will cause all 50 partitions (assuming defaults) to stream out to the other brokers in the cluster. There is a risk here in terms of saturating the throughput limits of the other brokers in terms of network entitlement and volume entitlement. i.e. Copying all 50 partitions at once is risky, as it can also add unnecessary load to the controller which can have latency impact too.

    Recommendations

    1. Use replication throttling as part of the kafka-reassign-partitions.sh syntax to only allow a small amount of entitlement to be consumed by the fetches copying the new replica. i.e. 10MB/s

    2. Increase num.replica.fetchers to (2 to 4) give the fetches on each broker additional resource to complete this task assuming you have enough spare capacity in terms of CPU and IO.

    3. Split the reassignment by 5-10 partitions, and only increase X partitions to RF=3 at a time. Do not try and reassign all 50 partitions at once to avoid controller load.

    4. Make sure the cluster is otherwise healthy before performing this action.