countcassandradsbulk

Is a CQL COUNT() on a single partition also an expensive operation?


I know Cassandra count() is an expensive operation as it needs a complete table scan. https://www.datastax.com/blog/running-count-expensive-cassandra

But let's say, we have a table hotel with hotel_type as the partition key and we run query

select count(*) from hotel where hotel_type= 'luxury';

Will this be expensive too? Actually, I need to run 1 million queries like this to get the count of different hotel_types. Will finding count can impact the prod Cassandra cluster?

Update:

I saw we can use dsbulk for the counting. How dsbulk count is different from cql count()?


Solution

  • You are correct that an unbounded COUNT() is an expensive operation in Cassandra for the reasons I've explained in this post -- https://dba.stackexchange.com/questions/314567/.

    But counting the rows of a single partition (i.e. SELECT query is restricted with WHERE pk = ?) is a valid use case for the CQL COUNT() function.

    Additionally, it should be fine executing a million read operations asynchronously provided you throttle the requests to a throughput your cluster can handle.

    The count command in the DataStax Bulk Loader (DSBulk) tool is one the recommended ways for running a full table count because DSBulk optimises the operation by breaking it up into small token segments then fires off multiple distributed requests instead of a single full table scan.

    If you're interested, here are some references with examples for using DSBulk:

    Cheers!