storagegoogle-compute-enginereliabilitysolid-state-drivedurability

What happens to Local SSD if the entire zone were to lose power?


What happens to data on local SSD if the entire google data center were to suffer a cataclysmic loss of power? When the compute engine instance comes back online eventually, will it still have the data on the Local SSD? It seems like it handles planned downtime just fine:

No planned downtime: Local SSD data will not be lost when Google does datacenter maintenance, even without replication or redundancy. We will use our live migration technology to move your VMs along with their local SSD to a new machine in advance of any planned maintenance, so your applications are not disrupted and your data is not lost.

But I'm concerned about unplanned downtime. Disk failure is an ever-present risk, but if you combine Local SSD with replication, you can protect against that. However, I'm trying to guard against correlated failure, where e.g. the whole region goes dark. Then the in-memory replicated data is lost, but does the data fsynced to the local SSD likely survive when the instances come back up? If you lose it, then fsyncing data to local SSD really doesn't buy you any more safety than RAM - e.g. for running a local database instance or something.


Solution

  • As an aside, please note that Google data center power supplies are redundant and have backup power generators in case of correlated power supply failures:

    Powering our data centers

    To keep things running 24/7 and ensure uninterrupted services, Google’s data centers feature redundant power systems and environmental controls. Every critical component has a primary and alternate power source, each with equal power. Diesel engine backup generators can provide enough emergency electrical power to run each data center at full capacity. Cooling systems maintain a constant operating temperature for servers and other hardware, reducing the risk of service outages. Fire detection and suppression equipment helps prevent damage to hardware. Heat, fire, and smoke detectors trigger audible and visible alarms in the affected zone, at security operations consoles, and at remote monitoring desks.

    Back to your questions. You asked:

    Then the in-memory replicated data is lost, but does the data fsynced to the local SSD likely survive when the instances come back up?

    Per the local SSD documentation (emphasis in the original):

    [...] local SSD storage is not automatically replicated and all data can be lost in the event of an instance reset, host error, or user configuration error that makes the disk unreachable. Users must take extra precautions to back up their data.

    If all of the above protections fail, a power outage would be equivalent to an instance reset, which may render local SSD volumes to be inaccessible—a VM is very likely to restart on a different physical machine, and if it does, the data would be essentially "lost" as it would be inaccessible and wiped.

    Thus, you should consider local SSD data as transient as you consider RAM to be.


    You also asked:

    However, I'm trying to guard against correlated failure, where e.g. the whole region goes dark.

    If you want to protect against a zone outage, replicate across multiple zones in a region. If you want to protect against an entire region outage, replicate to other regions. If you want to protect against correlated region failures, replicate to even more regions.

    You can also store snapshots of your data in Google Cloud Storage which provides a high level of durability:

    Google Cloud Storage is designed for 99.999999999% durability; multiple copies, multiple locations with checksums and cross region striping of data.