I have a write heavy table that will get write errors if all the table data is assigned to one range, so when I create the table, I want to split the table into multiple ranges assigned to different nodes in the cluster. But for some reason cockroachDB assigns new splits to the exact same nodes, which is kind of useless. How can I tell it to assign the new splits to a different set of nodes so that I won't have a hot spot when I open the nozzle on the heavy writes? Once I turn on the writes it will move the ranges after ten to twenty minutes, but this is too late after I already had many failed writes.
Here's an example:
CREATE TABLE write_heavy_table (
sequence BIGINT NOT NULL,
id TEXT NOT NULL,
count BIGINT NOT NULL,
PRIMARY KEY (sequence, id) USING HASH WITH BUCKET_COUNT = 5);
SHOW RANGES FROM TABLE write_heavy_table;
start_key | end_key | index_name | range_id | range_size_mb | lease_holder | lease_holder_locality | replicas | replica_localities | voting_replicas | non_voting_replicas
------------+---------+------------+----------+---------------+--------------+-----------------------+----------+--------------------+-----------------+----------------------
NULL | NULL | | 2589 | 0 | 2 | | {2,7,10} | {"","",""} | {7,2,10} | {}
(1 row)
# Now I split on different hash bucket values
ALTER TABLE write_heavy_table SPLIT AT VALUES (1);
ALTER TABLE write_heavy_table SPLIT AT VALUES (2);
ALTER TABLE write_heavy_table SPLIT AT VALUES (3);
ALTER TABLE write_heavy_table SPLIT AT VALUES (4);
SHOW RANGES FROM TABLE write_heavy_table;
start_key | end_key | index_name | range_id | range_size_mb | lease_holder | lease_holder_locality | replicas | replica_localities | voting_replicas | non_voting_replicas
------------+---------+------------+----------+---------------+--------------+-----------------------+----------+--------------------+-----------------+----------------------
NULL | /1 | | 2589 | 0 | 2 | | {2,7,10} | {"","",""} | {7,2,10} | {}
/1 | /2 | | 2590 | 0 | 2 | | {2,7,10} | {"","",""} | {7,2,10} | {}
/2 | /3 | | 2591 | 0 | 2 | | {2,7,10} | {"","",""} | {7,2,10} | {}
/3 | /4 | | 2592 | 0 | 2 | | {2,7,10} | {"","",""} | {7,2,10} | {}
/4 | NULL | | 2593 | 0 | 2 | | {2,7,10} | {"","",""} | {7,2,10} | {}
(5 rows)
So it created the new data ranges based on the hash buckets, but they all have the same lease_holder and replicas! I need the writes to be spread out to different replicas since I have a ten node cluster, but for some reason cockroachDB seems to go out of its way to create a hot spot. When I turn on the writes the single lease holder gets hammered. Is there some way to get it to use the whole cluster for this table right from the get go?
There doesn't appear to be an automatic way of doing this, but you can do it programmatically.
After splitting the table using the hash bucket prefixes, you can then relocate the ranges one by one to other database nodes with a command like this:
ALTER TABLE write_heavy_table RELOCATE SELECT ARRAY[1, 8, 10], 1;
The values in the array are the database nodes that will hold the three replicas of the data and the first number in the array is the "lease holder". The number at the end is the hash bucket value. So to distribute the table data across your cluster you could use round robin to assign the lease holder and ranomly pick two other node numbers for the other two replicas. Do this for each hash bucket value. You can find out how many nodes are in the cluster by querying the crdb_internal.gossip_nodes table.
After relocating all the data ranges, you should then do some dummy writes to the table so that it will not be completely empty when you turn on the spigot of heavy writes. Inserting some dummy rows with random data prewarms the table and prevents the first writes from experiencing high contention on the empty data ranges.
And when you create the splits, you should also include an expiration time for the manual splits using the "WITH EXPIRATION" clause. You want CRDB to automatically adjust the splits as needed over time based on the actual writes, so your manually created splits are only needed for the first several minutes of the heavy writes and then you can have them expire so that CRDB is free to modify things as needed.