I know Cassandra count() is an expensive operation as it needs a complete table scan. https://www.datastax.com/blog/running-count-expensive-cassandra
But let's say, we have a table hotel
with hotel_type
as the partition key and we run query
select count(*) from hotel where hotel_type= 'luxury';
Will this be expensive too? Actually, I need to run 1 million queries like this to get the count of different hotel_types. Will finding count can impact the prod Cassandra cluster?
Update:
I saw we can use dsbulk
for the counting. How dsbulk count is different from cql count()?
You are correct that an unbounded COUNT()
is an expensive operation in Cassandra for the reasons I've explained in this post -- https://dba.stackexchange.com/questions/314567/.
But counting the rows of a single partition (i.e. SELECT
query is restricted with WHERE pk = ?
) is a valid use case for the CQL COUNT()
function.
Additionally, it should be fine executing a million read operations asynchronously provided you throttle the requests to a throughput your cluster can handle.
The count
command in the DataStax Bulk Loader (DSBulk) tool is one the recommended ways for running a full table count because DSBulk optimises the operation by breaking it up into small token segments then fires off multiple distributed requests instead of a single full table scan.
If you're interested, here are some references with examples for using DSBulk:
Cheers!