I have 5 nodes of machines connected in a Cassandra distributed data system. I have setup the replication factor as 3.
I have understood that for a replication of 3, the data will be spread across 3 nodes based on the coordinator nodes availability. When I check for individual nodes, the row counts are differing. I have transferred some 100k of rows from csv to cassandra. Does this mean, I have to take row counts for all nodes all together to get the results ? I am using dsbulk for checking the row count.
Am I missing something here?
With 5 nodes, an RF of 3, and 100k rows loaded of raw data - assuming no dropped mutations, then there is a grand total of 300k rows of data spread across the 5 nodes. (the RF of 3 x 100k).
You mention that the data is spread based on the coordinator nodes availability
- but it is based on the consistent hash of the partition key of the row, as to which nodes hold the replicas.
The likelihood is that when using DSBulk you are using the default consistency level of local_one (https://docs.datastax.com/en/dsbulk/docs/reference/driver-options.html#datastaxJavaDriverBasicRequestConsistency), and that there were dropped mutations on the load. Change the consistency level to at least local_quorum / repair the cluster to bring it back to a consistent state.