cassandracassandra-2.0cassandra-3.0cassandra-2.1

what are the difference between the data back up using nodetool and cqlsh copy command?


Currently we have two options to take data back up of the tables in a Cassandra keyspace. We can either user nodetool commands or use the copy command from the cqlsh terminal.

1) What are the differences between these commands ?

2) Which one is most appropriate ?

3) Also if we are using nodetool to take backup we would generally flush the data from mem tables to sstables before we issue the nodetool snapshot command. So my question is should we employ the same techinque of flushing the data if we use the cqlsh copy command ?

Any help is appreciated.

Thanks very much.


Solution

  • GREAT question!

    1) What are the differences between these commands ?

    Running a nodetool snapshot creates a hard-link to the SSTable files on the requested keyspace. It's the same as running this from the (Linux) command line:

    ln {source} {link}
    

    A cqlsh COPY is essentially the same as doing a SELECT * FROM on a table. It'll create a text file with the table's data in whichever format you have specified.

    In terms of their difference from a backup context, a file created using cqlsh COPY will contain data from all nodes. Whereas nodetool snapshot needs to be run on each node in the cluster. In clusters where the number of nodes is greater than the replication factor, each snapshot will only be valid for the node which it was taken on.

    2) Which one is most appropriate ?

    It depends on what you're trying to do. If you simply need backups for a node/cluster, then nodetool snapshot is the way to go. If you're trying to export/import data into a new table or cluster, then COPY is the better approach.

    Also worth noting, cqlsh COPY takes a while to run (depending on the amount of data in a table), and can be subject to timeouts if not properly configured. nodetool snapshot is nigh instantaneous; although the process of compressing and SCPing snapshot files to an off-cluster instance will take some time.

    3) Should we employ the same technique of flushing the data if we use the cqlsh copy command ?

    No, that's not necessary. As cqlsh COPY works just like a SELECT, it will follow the normal Cassandra read path, which will check structures both in RAM and on-disk.