cassandrascylla

Scylla - Two nodes with RF 2 not having the same data?


I have two nodes and I created a keyspace like this:

DESCRIBE uzzstore

CREATE KEYSPACE uzzstore WITH replication = {'class': 'SimpleStrategy', 'replication_factor': '2'}  AND durable_writes = true;

CREATE TABLE uzzstore.chunks (
    id blob PRIMARY KEY,
    size bigint
) WITH bloom_filter_fp_chance = 0.01
    AND caching = {'keys': 'ALL', 'rows_per_partition': 'ALL'}
    AND comment = ''
    AND compaction = {'class': 'SizeTieredCompactionStrategy'}
    AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'}
    AND crc_check_chance = 1.0
    AND dclocal_read_repair_chance = 0.0
    AND default_time_to_live = 0
    AND gc_grace_seconds = 864000
    AND max_index_interval = 2048
    AND memtable_flush_period_in_ms = 0
    AND min_index_interval = 128
    AND read_repair_chance = 0.0
    AND speculative_retry = '99.0PERCENTILE';

Currently I have only node 1 receiving queries (read/writes) and from what I have understood from the documentation, all writes should be replicated; therefore I assume both nodes will have the same data. I added the second node in a second moment and flushed and repaired the nodes multiple times. However, I see node 1 has about 213,435,988 rows and node 2 only 206,916,617 rows.

nodetool status
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address        Load       Tokens       Owns    Host ID                               Rack
UN  192.168.1.51   17.5 GB    256          ?       15450683-e34b-475d-a393-ad25611398d8  rack1
UN  192.168.1.100  17.92 GB   256          ?       6cad2ba2-b22e-4947-a952-dc65c616a08f  rack1

Note: Non-system keyspaces don't have the same replication settings, effective ownership information is meaningless

Is it an expected behaviour? Is my understanding of the replicas incorrect? (note that I left several time to the cluster to get on pair).


Solution

  • You're right, if you have a two-node cluster and a keyspace with replication_factor 2, then indeed every piece of data will be in both nodes, every write will be "eventually" replicated to both. If you use CL=ALL you can be sure this has happened by the time that the write completed - but even if you do CL=ONE the write will still happen eventually on the second node - usually very quickly, but after a repair (which you said you did) you can be sure the same data appears on both nodes, and both nodes should have exactly the same number of rows.

    Yet, you said "I see node 1 has about 213,435,988 rows and node 2 only 206,916,617 rows.". How sure are you about these numbers? How did you come by them? Did you really scan the table (how did you limit the scan to just one node?), or did you use some sort of "size estimate" feature? If it's the latter, you should be aware that on both Cassandra and Scylla, this is just an estimate. It turns out that this estimate is even less accurate and trustworthy in ScyllaDB than in Cassandra (see https://github.com/scylladb/scylladb/issues/9083) but in both of them, the question of whether or not you did a major compaction (nodetool compact) affects the estimate. You said that you "flushed and repaired" the tables but not that you compacted it.

    In any case I want to emphasize again that even though compaction affects the estimate of the number of partitions, it doesn't have any affect on the correctness or the data or the accurate number of rows you see if you'll scan the entire table with SELECT * FROM table or count them with SELECT COUNT(*) FROM table. A repair might be needed if hinted handoff wasn't enabled and your cluster had connectivity problems during the write - but since you did say you did repair, you should be good.