In a past job I worked with financial tick data, a time series of events where each event/record is about 100 bytes and has 15-20 columns. A full time series has order 1e12 events quickly growing to 1e13. The mostly uncompressed data required about 300 TB with replication and was distributed over 10 hefty bare metal nodes.
I recently learned of Citus, and as of version 11 it supports load-balanced queries across all workers. It's open source, straightforward to deploy, and offers very high throughput for both online and analytical queries. It looks great for many low-latency queries, much better than the solution we ended up with.
Except one big issue: loading all the data. Citus' Query Performance Tuning page says writes must go through the single coordinator node and "up to" 2M records/second is possible for bulk loading. For 1e13 records that would take about 60 days which is too long. Even if a more powerful server allows for 10x the write throughput it would still take a full week to load - not completely game breaking, but reducing this to one day or overnight would change the game entirely.
But another link for Cluster Management has a tantalizing suggestion:
However, in some write heavy use cases where the coordinator becomes a performance bottleneck, users can add another coordinator. As the metadata tables are small (typically a few MBs in size), it is possible to copy over the metadata onto another node and sync it regularly. Once this is done, users can send their queries to any coordinator and scale out performance. If your setup requires you to use multiple coordinators, please contact us.
In a few hours of searching I couldn't find anything more about this. Are there any guides or suggestions about how to add and use multiple coordinators to increase write throughput? In this case a single table with row-based sharding would be fine so the metadata should be mostly static.
My current best guess is 1. create a "leader coordinator" for lack of a better term, 2. stream its WAL to several "follower coordinators," then 3. any ordinary insert, update, and delete statements can be load-balanced across all coordinators. But any operation that changes the metadata, such as adding or removing columns or workers, would have to go through the "leader coordinator."
** note: that link for "contact us" includes Stack Overflow... so consider yourself contacted!
Since StackOverflow endorses answering your own question: I have the answers after deploying a cluster with horizontally scaled writes myself. There weren't any good how-to guides on this so here goes.
Brief preamble: Citus deprecated sharding, what they also call "statement based replication" (not to be confused with statement based replication in general) years ago. HA and fault tolerance are now achieved by 1. having a coordinator cluster and 1+ worker clusters, and 2. bringing your own cluster admin tools like Patroni. This migration solved subtle but serious deadlock scenarios plaguing the sharding/statement-based replication strategy.
Terminology:
Citus' documentation often uses "the coordinator" or "a worker" to refer to the coordinator leader or to a worker group's leader. I'll try to avoid that ambiguity below.
Citus mostly deals only with the leader for each group. The main exception to this is the citus.use_secondary_nodes GUC. Another exception is Citus has a metadata table with all nodes tagged with their leader or follower status. This table is used to direct DDL and DML statements to the correct node within each group/cluster. Your bring-your-own HA solution such as Patroni is responsible for updating this table correctly.
Concise Guide:
citus.use_secondary_nodes = never and add more worker clusters; never means all queries are sent to the leader of each worker cluster, so scaling requires adding worker clusterscitus.use_secondary_nodes = always and add followers to all worker clusters; always means queries are only sent to replicas within each groupAdding worker clusters to scale writes likely seems counterintuitive. There are two reasons for this: