amazon-web-servicesamazon-redshiftfailovercluster

RedShift Node Failover


I have a RedShift cluster of 4 nodes.

  1. When one of the nodes goes down, will the entire cluster become unavailable?
  2. If yes - for how long?
  3. When the cluster gets back - is it returned to exactly the same point it was before the failure, or the data may be rolled back a to S3 snapshot from a few hours ago?
  4. How can I simulate this situation to check this scenario by myself?

Thanks a lot!


Solution

  • If it's a single node failure - amazon will start a new node and stream data from other nodes (each block is written to two different nodes if any). In such case, we can expect:

    1. Downtime of the entire cluster till a new node starts up + filled with the DB information. Should be about 3-4 minutes.
    2. After these 3-4 minutes that cluster will return to exactly the same point it was before it went down. The cluster will be available to both reads and writes.
    3. Some slowdown will be experienced due to data redistribution in the cluster.

    In case more than one nodes fails, redshift will restore itself from the latest S3 backup. S3 backups are done on the following occasions:

    1. If it's been 8 hours since the last backup
    2. If RedShift was filled with more then 5GB of data since the last backup
    3. Manually
    4. You have the option of a final snapshot when you chose to terminate your cluster