terraformfailoverredundancy

Use terraform to create identical parallel infrastructure for failover purposes


I have a requirement to use terraform to provision identical copies of the same infrastructure in different places for failover purposes. For example, I have 2 Kubernetes clusters A and B; I want to be able to use terraform to provision them both to the identical state. It would be as if there was one terraform plan, and 2 parallel applies to different "destinations" would happen for each apply.

Using provider aliases comes to mind, but that would require duplicating code for everything. Workspaces aren't a good fit either because each set of infrastructure is a first class citizen that should be in-sync with the other.

The best I've come up with is to use partial configuration for the backend https://www.terraform.io/language/settings/backends/configuration#partial-configuration and using variables in the provider block like so:

provider "kubernetes" {
  cluster = var.foo
}

And run terraform using:

terraform init -backend-config="baz=bat"
terraform plan -var "foo=bar"

Using this approach, there's a separate backend state for each copy of the infrastructure, and the provider will be pointed to the right destination via command line variables.

The above would work, but would require a separate init, plan, and apply for each distinct copy of the infrastructure being provisioned. Is that the best that I can hope for, or is there a better approach to combine everything into one workflow?

EDIT: Adding more context based on a comment below. The scenario is that cluster A is in a less expensive, less reliable datacenter and cluster B is in a more expensive, more reliable datacenter. To save costs, we want to run primarily in the less expensive datacenter, but have fully provisioned infrastructure ready to go if there is an outage in the primary datacenter. We'd keep cluster B artificially too small (to achieve the cost savings) until we lose cluster A, at which point we'd scale out cluster B to manage the full workload.


Solution

  • The situation you are describing sounds like a variation on the typical idea of "environments" where you have two independent production environments, rather than e.g. separate stating and production stages.

    The good news is that you can therefore employ mostly the same strategy that's typical for multiple deployment stages: factor out your common infrastructure into a shared module and write two different configurations that refer to it with some different settings.

    Each of your configurations will presumably consist just of a backend configuration, a provider configuration, and a call to the shared module, like this:

    terraform {
      backend "example" {
        # ...
      }
    
      required_providers {
        kubernetes = {
          source = "hashicorp/kubernetes"
        }
      }
    }
    
    provider "kubernetes" {
      cluster = "whichever-cluster-is-appropriate-here"
    }
    
    module "main" {
      source = "../modules/main"
    
      # (whatever settings make sense for this environment)
    }
    

    This structure keeps all of the per-environment settings together in a single configuration, so you can just switch into this directory and run the normal Terraform commands (with no unusual extra options) to update that particular environment.


    From your description it seems like a key requirement here is that each of your environments is a separate failure domain and that's one of the typical reasons to split infrastructure into two separate configurations. Doing so will help ensure that an outage of the underlying platform infrastructure in one environment cannot prevent you from using Terraform to manage the other environment.

    If you intend to build automation around your Terraform runs (which I'd recommend) I'd suggest configuring your automation so that any change to the shared module will automatically trigger a run for both of your environments, just so you can make sure they're always routinely getting updated and thus you won't end up in an awkward situation where you try to fail over and find that the backup environment is "stale" and needs significant updates before you could fail over into it.

    Of course, you'd need to make sure that a failure of one of those runs cannot block applying the other one, because otherwise you will have combined the failure domains together and could prevent yourself from failing over in the event of an outage. The way I would imagine it working (in principle) is that, if there is an outage:

    1. You change the configuration of the backup environment to increase its scale.
    2. That triggers a run only for the backup environment, because the shared module hasn't changed. You can apply that to scale up the backup environment.
    3. You change some setting outside of the scope of both of these environments to redirect incoming requests into the backup environment until the outage is over.

    In the event that you do end up needing to change the shared module during an outage, the flow is similar except that step 2 would trigger a run for each of the environments and the primary environment's run would fail, but you can ignore that for now and just apply the backup environment changes. Once the outage is over, you can re-run the primary environment's run to "catch up" with the changes made in the backup environment before you flip back to the primary environment again, and then scale the backup environment back down.

    The key theme here is that Terraform is a building block of a solution here but is not the entire solution itself: Terraform can help you make the changes you need to make, but you will need to build your own workflow (automated or not) around Terraform to make sure that Terraform is running in the appropriate context at the appropriate time to respond to an outage.