When Cassandra is doing the data integrity check, it does a validation compaction, but what does this mean exactly? My understanding is that it creates a single SSTable that will be stored temporarily (until the repair finishes), and then it generates the Merkle trees from that single created SSTable. If any of the Merkle trees leafs fails validation, then the partitions used to create that leaf (from the SSTable created during the validation compaction) will be streamed to the other node. However, a friend told me that the Merkle trees are generated from each (previously existing) SSTable.
So, how many Merkle trees are generated, one or as many as SSTables?
The validation compaction iterates over all the sstables that are included in the range to build the merkle tree. It doesn't actually write a new sstable, but the compaction interfaces perform same type of task (iterating over data) so its reused. The compaction manager is also used for cleanup, secondary index rebuild, MV building, scrubbing, and verify processes.
A single merkle tree is generated. Each node of it represents a hash of all the data in a token range, each child of the node is half of its token range. The depth of the tree is dynamic, ideally the leaf represents 1 partition each but it could end up representing much more if the root node represents a wide range containing many partitions. Since the depth of the merkle tree is capped at 20 (or else it will be too large, and cause issues transferring) you generally dont want to repair a range that has much more than 2^20 or 1 million partitions in it. Can use getsplits or the size_estimates table to determine this when picking how to subdivide range for a subrange repair.
Worth noting, that a repair can kick off many sub repairs, each will have its own validation compaction/merkle tree/streaming session.