amazon-web-services cassandra amazon-keyspaces

Amazon Keyspaces - reading data at a high rate and getting PerConnectionRequestRateExceeded

We have an Amazon Keyspaces table with a few GBs of data. The table structure does not have any performance issues and works perfectly on our on-premise deployment of a Cassandra cluster.

However, when migrating to Amazon Keyspaces, we get PerConnectionRequestRateExceeded errors for this table in CloudWatch when we try to increase the load of the service to our production workload. The service which reads and writes from/to the database is a standard Java 17 Spring application, without any custom settings. We just basically bombard the service with 2-3k/sec HTTP GET requests that translate to reading from the database, and from this specific table in particular.

The issue we have is with the reads and the SELECT statements in particular. The error is reported to the client application as read timeout, which is pretty standard.

What we tried so far to mitigate it:

disabling hostname validation
increasing connection heartbeat in the Cassandra driver
increasing connection pool size
increasing instances of the client application (8 instances x 550 connections - 4400 connections, 3 instances per 1500 connections = 4500 connections)

We are talking about a peak of 2-3k HTTP req/sec. The only thing that is perhaps special in this case is that in most of those queries, we have IN statements. I read in the Keyspaces documentation that IN statements are basically translated like this:

SELECT * FROM table_name WHERE field IN ("A", "B")

this is translated to

SELECT * FROM table_name WHERE field = "A"
SELECT * FROM table_name WHERE field = "B"

In some cases we have around 100 elements in the IN statements, which I assume means that for each SELECT query, we have 100 queries to Keyspaces.

So for a load of 3000 req/sec, we should have about 300k CQL queries. For the amount of connections we have to Keyspaces (connection pool = 4400/4500), we should support a few million of CQL queries per second.

EDIT: A bit more info regarding the use case/table.

The table in question stores customer transactions, and each transaction has customer ID and transaction type (along with some additional unrelated data). We tried two approaches:

to partition the table by customerID, so each partition has transactions for the same customer
to partition the table by customerID and transaction type, so each partition has info about specific transactions types for a given customer

The problematic queries are in the case where we want to retrieve info about a certain customer, and a specific set of their transaction types. This is where the IN clause comes:

SELECT * FROM transactions_by_customer_id_and_type
WHERE customer_id = X AND type IN (k1, .., k100)

SELECT * FROM transactions_by_customer_id
WHERE customer_id = X AND type IN (k1, .., k100)

I'm trying to figure out what's missing here, or at least what else to try.

Thanks in advance.

Solution

Amazon Keyspaces can handle up to 3K CQL requests per second for each client connection. Once this limit is reached, the service returns PerConnectionRequestExceeded to the client.

Note that if you have a multi-region setup, replication counts towards the 3K rps limit.

Amazon recommends over-subscribing by increasing the number of client connections such that each connection only does 500 rps.

As a side note, you are correct that queries which use the IN() operator fans out to N requests where N is the number of keys in IN(). Our general recommendation is to only have 2 to 3 keys in a SELECT statement or it will otherwise put a lot of pressure on the coordinator having to fire off so many individual requests.

Cassandra is a designed for solving internet scale problems where application need to retrieve data at an extremely fast rate so reads are optimised for OLTP workloads with single-partition reads. Multi-partition reads indicative of (a) data not modelled correctly, or (b) an analytics workloads (or analytics-like) which should be run through Spark since it optimises data retrieval by breaking up queries into small sub-range requests that are distributed across multiple workers/executors.

For more info, see how client connections work in Amazon Keyspaces. Cheers!

UPDATE - Based on the additional info you provided, it is perfectly fine to use the IN() operator when the query is restricted to a single partition (SELECT .. FROM ... WHERE pk = ? AND ck IN (...)). So now, you just have to deal with the original problem which is that your all is hitting the max requests per connection of 3K requests/s. You need to modify your app and configure the driver so it uses more connections. Cheers!