Currently we have two options to take data back up of the tables in a Cassandra keyspace. We can either user nodetool
commands or use the copy
command from the cqlsh
terminal.
1) What are the differences between these commands ?
2) Which one is most appropriate ?
3) Also if we are using nodetool
to take backup we would generally flush the data from mem tables to sstables before we issue the nodetool snapshot command
. So my question is should we employ the same techinque of flushing the data if we use the cqlsh copy
command ?
Any help is appreciated.
Thanks very much.
GREAT question!
1) What are the differences between these commands ?
Running a nodetool snapshot
creates a hard-link to the SSTable files on the requested keyspace. It's the same as running this from the (Linux) command line:
ln {source} {link}
A cqlsh COPY
is essentially the same as doing a SELECT * FROM
on a table. It'll create a text file with the table's data in whichever format you have specified.
In terms of their difference from a backup context, a file created using cqlsh COPY
will contain data from all nodes. Whereas nodetool snapshot
needs to be run on each node in the cluster. In clusters where the number of nodes is greater than the replication factor, each snapshot will only be valid for the node which it was taken on.
2) Which one is most appropriate ?
It depends on what you're trying to do. If you simply need backups for a node/cluster, then nodetool snapshot
is the way to go. If you're trying to export/import data into a new table or cluster, then COPY
is the better approach.
Also worth noting, cqlsh COPY
takes a while to run (depending on the amount of data in a table), and can be subject to timeouts if not properly configured. nodetool snapshot
is nigh instantaneous; although the process of compressing and SCPing snapshot files to an off-cluster instance will take some time.
3) Should we employ the same technique of flushing the data if we use the
cqlsh
copy command ?
No, that's not necessary. As cqlsh COPY
works just like a SELECT
, it will follow the normal Cassandra read path, which will check structures both in RAM and on-disk.