[SOLVED] How to effectively perform zonal outage simulation in GCP in regional MIG so that the attached VM rebuilds in the remaining zone during the outage?

How to effectively perform zonal outage simulation in GCP in regional MIG so that the attached VM rebuilds in the remaining zone during the outage?

Recently our team is trying to perform DR proving exercise on the VMs attached to the regional MIG in a GCP project.

We followed google's documentation(https://cloud.google.com/compute/docs/instance-groups/regional-mig-simulate-zonal-outage) for simulating zonal loss for regional MIG using a failure script(in which we are deleting the instance every time it tries to rebuild after boot).

While simulating zonal outage on the VM attached to the regional MIG, the MIG is trying to rebuild the VM in the primary or impacted zone instead of remaining zone. During the actual outage it won't be the case ideally.

VMs have been created using the instance template. Autoscaling and autohealing not configured in the MIG. Target distribution shape is even.

Our regional MIG which is deployed in two zones(europe-west2-b, europe-west2-a) with zone europe-west2-b being the primary zone, then during zonal outage the VM should failover to europe-west2-a zone. However, that's not happening here.

Not sure if there are some other recommendations on DR proving exercise on regional MIGs?

Solution

There are multiple issues currently:

It is not really possible to simulate zone failure properly. You can simulate failing VMs to check if your application can handle it, but the GCE recognizes it as customer activity.
Currently, only autoscaled groups have the functionality built in to recover VMs in a different zone.

The approach taken in https://cloud.google.com/compute/docs/instance-groups/regional-mig-simulate-zonal-outage is to check if you are already overprovisioned today, i.e. if all the VMs in one of the zones are not serving correctly, then your workload will still have enough capacity.