[SOLVED] How to get backpressure for Cassandra Writes in multi-apps scenario?

How to get backpressure for Cassandra Writes in multi-apps scenario?

I have multiples applications that do writes to Cassandra.

Each unit app has backpressure mecanism configured, like throughputMBPerSec=10

Problems arise when multiples applications are running at the same time, because the backpressure set and successfully tested individually becomes wrong

In a client side backpressure scenario, how to implement a mecanism that choose a good backpressure value, regarding the whole pressure state of the cluster, and without loosing too much performance ?

How this kind of problems are solved in large companies ?

Solution

There are two approaches commonly used by large companies to mitigate write backpressure on Cassandra. I've suggested that application teams use both of these. And I'll suggest a third, as well:

Send each write as its own thread with a "listenable future." Once a certain number of them have been initiated (say 50...this will change per application), block to ensure they have all completed. Once complete, kick off another batch of threads. The main "tuneable" here, is to raise or lower the active thread count.
Send each write as a message to an Event Processor/Broker like Apache Pulsar or Apache Kafka. Build a consumer that processes the messages. The main tuneable here is adjusting the size of the consumer's receiver queue. I think the default for Pulsar is 1000 messages.
Build a Cassandra cluster for each application. Unfortunately, Cassandra doesn't do a great job of handling vastly different access patterns. At Target, we finally had enough of the applications with heavy write traffic creating a bottleneck for everyone else and built a cluster for every single new application.

Smaller applications only needed 3 to 6 nodes, while others required more than 200 nodes. When you look at disparities like that in required resources, it really doesn't make sense to co-locate those applications. If you're deploying in the cloud (public or private) this is MUCH easier to do, as opposed to deploying on bare metal.

Edits

it seems simple when you are in a specific ETL (jar/app), as you control all threads (and can block some of those), it seems more difficult to globally block threads from multiples ETL (jar/app) at the same time, that can run from different physical machine.

So yes, this is assuming that the thread control is happening in each individual ETL job. This might not be easy to do in your case.

how adjusting the queue size, have an impact to backpressure on Cassandra side ?

I'm not exactly sure on this one, but I think that adjusting the queue size limits how many messages the consumer can pull off of the topic at any given time. So if that queue is smaller, it has to make more trips to the broker, effectively slowing down how fast the messages can be written into Casasndra.

But still, on the consumer side, you need some ways to avoid too many pressure to Cassandra, so this solution seems to me to just move the problem, from the backpressure point of view

You're not wrong on this one. It's a tradeoff for sure, but the idea is that some of the problem is solved by the message brokers effectively acting as a hand brake on write throughput. Of course, the overflow is handled by either the messaging servers or the consumers.