cassandranodetool

How nodetool garbagecollect will remove the shadowed tombstone data from the sstables


I have gone through nodetool garbagecollect documentation, it is not mentioned how this command removes shadowed tombstone data from multiple and overlapping sstables.


Solution

  • Behind the scenes, nodetool garbagecollect runs a single-SSTable compaction. Which means that it's not running on multiple SSTable files. Essentially, the compaction process just runs on one SSTable file at a time, removing any eligible tombstones, and rewriting the file.

    But, as per issues.apache.org/jira/browse/CASSANDRA-7019 overlapping sstables are used as source of tombstones to filter out deleted content, and will not appear in the resulting single sstable compaction. Let's say, shadowed data is present in the sstable1 and tombstone is present in sstable2, how shadowed data in sstable1 will be removed without knowing its tombstone in the sstable2 with nodetool garbagecollect?

    In re-reading through that ticket's comments, it sounds like that scenario is indeed covered.

    will not appear in the resulting single sstable compaction.

    I don't think that's an entirely correct statement. From my understanding:

    Edit 20230306

    So I asked Branimir (the developer who wrote CASSANDRA-7019) to take a look at this discussion. He mentioned that the SSTables are processed by the lower timestamps first, so I've adjusted my points above to reflect this. He also offered the following notes to help clarify the process:

    Additional sstables are used in two ways:

    1. older sstables to see if a tombstone can be dropped (this is done for all compactions, not just garbagecollect.

    2. newer sstables to see if anything deletes data in the current one (this is garbagecollect-specific).

    The order is as follows:

    SStable1, having a lower min timestamp, is processed first. The single-sstable compaction should find, by (2), the tombstone in SStable2 and drop the covered data.

    SStable2 is processed then. Because we already processed SStable1, there should be no data found in (1) when the tombstone is processed and the tombstone can be dropped.

    If we do normal compactions, there is no (2), the data remains and (1) finds it when processing SStable2, i.e. neither the data nor tombstone can be dropped.

    Note: (2) can be tuned on also by setting provide_overlapping_tombstones: row or cell in the compaction options. It may slow down compaction considerably (esp. with STCS), so it's to be used with caution.

    Independently, any tombstone that is still within gc_grace won't be dropped regardless if any data is found. Still, if garbagecollect managed to remove the data, a later normal compaction (after the grace expires) should be able to drop it.

    Another note: (1) only uses the bloom filter to check if data is covered by a tombstone; i.e. if anything in the partition remains live, no tombstone can be dropped.