Scaling Write Throughput for 1e13 Small Events with Citus

In a past job I worked with financial tick data, a time series of events where each event/record is about 100 bytes and has 15-20 columns. A full time series has order 1e12 events quickly growing to 1e13. The mostly uncompressed data required about 300 TB with replication and was distributed over 10 hefty bare metal nodes.

I recently learned of Citus, and as of version 11 it supports load-balanced queries across all workers. It's open source, straightforward to deploy, and offers very high throughput for both online and analytical queries. It looks great for many low-latency queries, much better than the solution we ended up with.

Except one big issue: loading all the data. Citus' Query Performance Tuning page says writes must go through the single coordinator node and "up to" 2M records/second is possible for bulk loading. For 1e13 records that would take about 60 days which is too long. Even if a more powerful server allows for 10x the write throughput it would still take a full week to load - not completely game breaking, but reducing this to one day or overnight would change the game entirely.

But another link for Cluster Management has a tantalizing suggestion:

However, in some write heavy use cases where the coordinator becomes a performance bottleneck, users can add another coordinator. As the metadata tables are small (typically a few MBs in size), it is possible to copy over the metadata onto another node and sync it regularly. Once this is done, users can send their queries to any coordinator and scale out performance. If your setup requires you to use multiple coordinators, please contact us.

In a few hours of searching I couldn't find anything more about this. Are there any guides or suggestions about how to add and use multiple coordinators to increase write throughput? In this case a single table with row-based sharding would be fine so the metadata should be mostly static.

My current best guess is 1. create a "leader coordinator" for lack of a better term, 2. stream its WAL to several "follower coordinators," then 3. any ordinary insert, update, and delete statements can be load-balanced across all coordinators. But any operation that changes the metadata, such as adding or removing columns or workers, would have to go through the "leader coordinator."

** note: that link for "contact us" includes Stack Overflow... so consider yourself contacted!

Solution

Since StackOverflow endorses answering your own question: I have the answers after deploying a cluster with horizontally scaled writes myself. There weren't any good how-to guides on this so here goes.

Brief preamble: Citus deprecated sharding, what they also call "statement based replication" (not to be confused with statement based replication in general) years ago. HA and fault tolerance are now achieved by 1. having a coordinator cluster and 1+ worker clusters, and 2. bringing your own cluster admin tools like Patroni. This migration solved subtle but serious deadlock scenarios plaguing the sharding/statement-based replication strategy.

Terminology:

coordinator cluster: 1+ nodes that are candidates for being the coordinator
- coordinator leader: the primary/leader node serving as "the coordinator" for Citus
- coordinator followers: the followers/standby nodes
worker cluster: same idea as coordinator cluster; there is a leader and 0+ followers for each Citus worker [group]

Citus' documentation often uses "the coordinator" or "a worker" to refer to the coordinator leader or to a worker group's leader. I'll try to avoid that ambiguity below.

Citus mostly deals only with the leader for each group. The main exception to this is the citus.use_secondary_nodes GUC. Another exception is Citus has a metadata table with all nodes tagged with their leader or follower status. This table is used to direct DDL and DML statements to the correct node within each group/cluster. Your bring-your-own HA solution such as Patroni is responsible for updating this table correctly.

Concise Guide:

you can only direct DDL statements to the leader coordinator
you can only direct DML statements to any leader node
- you CANNOT direct DML to follower coordinators even though DML in theory doesn't change data on the coordinators. Attempting to do so results in Postgres errors.
therefore to scale writes you need to add worker clusters
- there is only ever one coordinator leader
- therefore the only way to add leaders is adding worker clusters to get 1 new writable leader per new worker cluster
to scale reads you have two options
- option 1: citus.use_secondary_nodes = never and add more worker clusters; never means all queries are sent to the leader of each worker cluster, so scaling requires adding worker clusters
- option 2: citus.use_secondary_nodes = always and add followers to all worker clusters; always means queries are only sent to replicas within each group
use Patroni or another solution that supports both HA and is "Citus-aware" to enable HA on the cluster
- "Citus-aware:" the Citus table that tracks leader versus follower status for all nodes must be kept up to date with the nodes' actual states as your tool handles failover etc.

Adding worker clusters to scale writes likely seems counterintuitive. There are two reasons for this:

again, DML statements can only be directed to leaders. Postgres itself denies DML on followers, even coordinator followers
since version ~10 (years ago) Citus has a "queries on any node" feature where you can send queries statements to any worker and it gets rerouted to the correct worker and shard automatically
- "any node" is not quite correct though; DML can only be sent to leaders. Only SELECT queries can truly be run on any node