[SOLVED] YugabyteDB fails to load balance tablet-leaders after blacklisting multiple servers

YugabyteDB fails to load balance tablet-leaders after blacklisting multiple servers

I think I’ve found a way to guide the yb-master load balancer into a stuck state:

Spin up a cluster with 6 tservers, and initialize with some data/tablets
Leader blacklist 3 of those tservers
While the load balancer is still busy moving the tablet leaders from the blacklisted tservers, start/join a another fresh new tserver

Now the load balancer suddenly stops moving the blacklisted leaders, and soon yb-admin get_is_load_balancer_idle reports Idle = 1, although :7000/tablet-servers still shows leader blacklisted nodes with remaining tablet leaders. yb-admin get_leader_blacklist_completion Never completes, and restarting the yb-masters has no effect.

Using version (2.25.1.0-b381) with RF=3.

With only minimal modifications for yb-master https://github.com/yugabyte/yugabyte-db/blob/master/cloud/kubernetes/yugabyte-statefulset.yaml

- "/home/yugabyte/bin/yb-master"
- "--fs_data_dirs=/mnt/data0"
- "--rpc_bind_addresses=$(POD_NAME).yb-masters.$(NAMESPACE).svc.cluster.local:7100"
- "--server_broadcast_addresses=$(POD_NAME).yb-masters.$(NAMESPACE).svc.cluster.local:7100"
- "--use_private_ip=never"
- "--master_addresses=$(YB_MASTER_ADDRS)"
- "--enable_ysql=true"
- "--replication_factor=3"
- "--logtostderr"
- "--webserver_interface=0.0.0.0"
- "--load_balancer_max_concurrent_moves=10" # default 2
- "--load_balancer_max_concurrent_moves_per_table=2" # default 1
- "--load_balancer_max_over_replicated_tablets=10" # default 1
- "--load_balancer_max_concurrent_removals=10" # default 1
- "--load_balancer_max_concurrent_adds=10" # default 1
- "--load_balancer_max_inbound_remote_bootstraps_per_tserver=2" # default 4
- "--load_balancer_max_concurrent_tablet_remote_bootstraps_per_table=1" # default 2

I just reproduced the problem (fresh cluster 3 tservers, scale up to 6, add data, wait loadbalancer idle, then leader blacklist 3 tservers, and while tablet leaders moving, deploy another new tserver).

All tservers that host a replica of the stuck tablet/s are leader blacklisted. but at the same time there are 3 or more non-blacklisted tservers available, where the leaders could be moved to (example screenshot attached, rebalance was stuck and shows "cluster is load balanced").

Removing the blacklist of 1 of the 3 blacklisted tservers, allowed the rebalance to resume as normal.

Did anyone else run into this issue?

Solution

This is caused by all of the tservers that host a replica of that tablet being leader blacklisted (so we can't move the leaders anywhere).

The cluster balancer currently handles leader blacklisting without considering data moves, since users normally expect leader blacklists to take effect quickly. In this case, we would have to move a tablet off of the leader blacklisted set to another node and then move the leader on to that, which we don't currently support.

We probably won't add support for this in the near future because the usual use case for leader blacklisting is temporarily taking down a node or set of nodes in the same region/zone, in which case the nodes in other regions are able to take the leaders. You could use a data blacklist to move the actual data off the nodes.