Riak Cluster Backup

We have a five node Riak cluster(n_val is 3) running on Amazon EC2 spread across multiple availability zones. Since we don't have enterprise edition, we do not have the luxury of multi datacenter replication and a full sync to a different zone/region.

Our current backup strategy is this:

SSH to each node in the cluster, one node at a time
Stop riak services using riak stop (because we are using leveldb backend)
Issue a EBS snapshot for the data volume that has riak data
Start riak service using riak start
Move on to the other node and repeat above steps

I have tested this approach on a 3 node test cluster which doesn't have much of live activity and recovered from snapshots without an issue. I would like to understand from experts here whether this approach is valid for a production cluster with heavy activity. Will we run into any issues related to handoffs during shutting down node and starting node again? Is there something else I am unaware of at the moment, that might hamper chances of recovery when a disaster occurs?

Thanks in advance!

Solution

The backup documentation states that

Riak backups can be performed using OS features or filesystems that support snapshots, such as LVM or ZFS, or by using tools like rsync or tar

I've never used EBS snapshot, but I'm pretty sure that it can be considered as a "filesystem that supports snapshots"

So, as long as you shut each the node before backing it up, you should be good.

About handoffs: I'd recommend that after you backed up node A, before backing up the next node B, you wait for all handoffs created (because A was down) to have been transferred to A.

Be careful to not consider the backup of all individual nodes the same as "backup of the entire cluster". Each nodes will be backed up individually. If your cluster is under heavy write load, and you wait for handoffs to be transferred between backups, then you can't consider that your nodes backup were done at the same time.

It's not a big deal: when you restore a node from a backup, you can trigger read-repairs, or wait for AAE to fix the data for you. You might want to configure AAE to be more aggressive when you've restored nodes from backup.