I have a Citus cluster with 3 worker nodes, added a new worker recently then started a rebalance on a table. Everything is OK until here but rebalancing does not end. When I look at
get_rebalance_progress()
target shard size is always %10 of source shard size. I waited for 18 hours but no progress. Also no error.
The table has 32 shards and 1200 partitions. 25k rows inserted every minute. No delete or update. It has only 10 days worth of data today(2022-08-10). After I start rebalance I see some disk and network activity on the new node but it drops after a few minutes and I see no significant activity after that. What am I doing wrong? How should I rebalance that table?
Try doing a write blocking rebalance if that is possible:
SELECT rebalance_table_shards('dist_table', shard_transfer_mode:='block_writes');
If the operation is still stuck, looking at the logs of the workers and coordinator for any rebalance related errors/info might help you understand what is causing the problem.
Alternatively, consider updating to the lastest citus version if you haven't already since there have been improvements on the rebalancing operation.
You should be able to update by following this: https://www.citusdata.com/blog/2022/06/17/citus-11-goes-fully-open-source/