cassandrascylla

system.size_estimates seems to be off in estimates than real table


I have created a simple schema in scylla with just a table chunks(id text, size bigint) and filled it with a bulk upload of about 30M rows. Using dsbulk count I get the right row count:

dsbulk-1.11.0/bin/dsbulk count -k uzzstore -t chunks Operation

directory: /Downloads/logs/COUNT_20230927-100252-630941

total | failed | rows/s | p50ms | p99ms | p999ms

30,593,909 | 0 | 53,207 | 363.11 | 562.04 | 746.59

if instead I run the query

select sum(partitions_count) from system.size_estimates where keyspace_name='keyspace' and table_name='chunks';

which I understand should return an estimates of the size of the table, updated every 5 minutes, I get only 9069368 ...

Is that the expected behavior? if not, any hint on how to fix it?


Solution

  • You didn't say how many nodes your cluster has. The first thing you should know is that the partitions_count estimate is local to one Scylla node - it only estimates the number of partitions held in this node. Moreover, it only estimates in token ranges for which this node is the primary owner. If you have N Scylla nodes, each node is a primary owner of 1/N of the data. So if you have 30 millions partitions and 3 nodes, you'd expect to see a 10 million estimate reported by each, close to what you saw.

    The second thing you should know is that the partition count estimator is very naive, in both Scylla and Cassandra. In particular, if the same partition appears in several sstables it will be double-counted, so high-overwrite workloads will get estimates higher than reality. Another reason for too-high estimates is partition deletions - those are wrongly counted as a partition even though it's a deleted partition.

    Finally, in the simple cases (each write goes to a unique partition, no later modification to existing partition), Cassandra does provide very accurate estimates for large tables, but unfortunately Scylla does not. This is tracked in the Scylla bug tracker in https://github.com/scylladb/scylladb/issues/9083. This may explain why you saw an estimate of 9 million instead of 10 million partitions.

    Note: I can't find anywhere where the above is officially documented, neither in Cassandra nor Scylla. Unfortunately.