[SOLVED] Is a CQL COUNT() on a single partition also an expensive operation?

Is a CQL COUNT() on a single partition also an expensive operation?

I know Cassandra count() is an expensive operation as it needs a complete table scan. https://www.datastax.com/blog/running-count-expensive-cassandra

But let's say, we have a table hotel with hotel_type as the partition key and we run query

select count(*) from hotel where hotel_type= 'luxury';

Will this be expensive too? Actually, I need to run 1 million queries like this to get the count of different hotel_types. Will finding count can impact the prod Cassandra cluster?

Update:

I saw we can use dsbulk for the counting. How dsbulk count is different from cql count()?

Solution

You are correct that an unbounded COUNT() is an expensive operation in Cassandra for the reasons I've explained in this post -- https://dba.stackexchange.com/questions/314567/.

But counting the rows of a single partition (i.e. SELECT query is restricted with WHERE pk = ?) is a valid use case for the CQL COUNT() function.

Additionally, it should be fine executing a million read operations asynchronously provided you throttle the requests to a throughput your cluster can handle.

The count command in the DataStax Bulk Loader (DSBulk) tool is one the recommended ways for running a full table count because DSBulk optimises the operation by breaking it up into small token segments then fires off multiple distributed requests instead of a single full table scan.

If you're interested, here are some references with examples for using DSBulk:

Cheers!