kubernetesterraformterraform-provider-awsamazon-eksterraform-aws-modules

AWS EKS cluster setup via Terraform inaccessible from bastion


Background and Context

I am working on a Terraform project that has an end goal of an EKS cluster with the following properties:

  1. Private to the outside internet
  2. Accessible via a bastion host
  3. Uses worker groups
  4. Resources (deployments, cron jobs, etc) configurable via the Terraform Kubernetes module

To accomplish this, I've modified the Terraform EKS example slightly (code at bottom of the question). The problems that I am encountering is that after SSH-ing into the bastion, I cannot ping the cluster and any commands like kubectl get pods timeout after about 60 seconds.

Here are the facts/things I know to be true:

  1. I have (for the time being) switched the cluster to a public cluster for testing purposes. Previously when I had cluster_endpoint_public_access set to false the terraform apply command would not even complete as it could not access the /healthz endpoint on the cluster.
  2. The Bastion configuration works in the sense that the user data runs successfully and installs kubectl and the kubeconfig file
  3. I am able to SSH into the bastion via my static IP (that's the var.company_vpn_ips in the code)
  4. It's entirely possible this is fully a networking problem and not an EKS/Terraform problem as my understanding of how the VPC and its security groups fit into this picture is not entirely mature.

Code

Here is the VPC configuration:

locals {
  vpc_name            = "my-vpc"
  vpc_cidr            = "10.0.0.0/16"
  public_subnet_cidr  = ["10.0.4.0/24", "10.0.5.0/24", "10.0.6.0/24"]
  private_subnet_cidr = ["10.0.1.0/24", "10.0.2.0/24", "10.0.3.0/24"]
}

# The definition of the VPC to create

module "vpc" {

  source  = "terraform-aws-modules/vpc/aws"
  version = "3.2.0"

  name                 = local.vpc_name
  cidr                 = local.vpc_cidr
  azs                  = data.aws_availability_zones.available.names
  private_subnets      = local.private_subnet_cidr
  public_subnets       = local.public_subnet_cidr
  enable_nat_gateway   = true
  single_nat_gateway   = true
  enable_dns_hostnames = true

  tags = {
    "kubernetes.io/cluster/${var.cluster_name}" = "shared"
  }

  public_subnet_tags = {
    "kubernetes.io/cluster/${var.cluster_name}" = "shared"
    "kubernetes.io/role/elb"                    = "1"
  }

  private_subnet_tags = {
    "kubernetes.io/cluster/${var.cluster_name}" = "shared"
    "kubernetes.io/role/internal-elb"           = "1"
  }
}

data "aws_availability_zones" "available" {}

Then the security groups I create for the cluster:

resource "aws_security_group" "ssh_sg" {
  name_prefix = "ssh-sg"
  vpc_id      = module.vpc.vpc_id

  ingress {
    from_port = 22
    to_port   = 22
    protocol  = "tcp"

    cidr_blocks = [
      "10.0.0.0/8",
    ]
  }
}

resource "aws_security_group" "all_worker_mgmt" {
  name_prefix = "all_worker_management"
  vpc_id      = module.vpc.vpc_id

  ingress {
    from_port = 22
    to_port   = 22
    protocol  = "tcp"

    cidr_blocks = [
      "10.0.0.0/8",
      "172.16.0.0/12",
      "192.168.0.0/16",
    ]
  }
}

Here's the cluster configuration:

locals {
  cluster_version = "1.21"
}

# Create the EKS resource that will setup the EKS cluster
module "eks_cluster" {
  source = "terraform-aws-modules/eks/aws"

  # The name of the cluster to create
  cluster_name = var.cluster_name

  # Disable public access to the cluster API endpoint
  cluster_endpoint_public_access = true

  # Enable private access to the cluster API endpoint
  cluster_endpoint_private_access = true

  # The version of the cluster to create
  cluster_version = local.cluster_version

  # The VPC ID to create the cluster in
  vpc_id = var.vpc_id

  # The subnets to add the cluster to
  subnets = var.private_subnets

  # Default information on the workers
  workers_group_defaults = {
    root_volume_type = "gp2"
  }

  worker_additional_security_group_ids = [var.all_worker_mgmt_id]

  # Specify the worker groups
  worker_groups = [
    {
      # The name of this worker group
      name = "default-workers"
      # The instance type for this worker group
      instance_type = var.eks_worker_instance_type
      # The number of instances to raise up
      asg_desired_capacity = var.eks_num_workers
      asg_max_size         = var.eks_num_workers
      asg_min_size         = var.eks_num_workers
      # The security group IDs for these instances
      additional_security_group_ids = [var.ssh_sg_id]
    }
  ]
}

data "aws_eks_cluster" "cluster" {
  name = module.eks_cluster.cluster_id
}

data "aws_eks_cluster_auth" "cluster" {
  name = module.eks_cluster.cluster_id
}

output "worker_iam_role_name" {
  value = module.eks_cluster.worker_iam_role_name
}

And the finally the bastion:

locals {
  ami           = "ami-0f19d220602031aed" # Amazon Linux 2 AMI (us-east-2)
  instance_type = "t3.small"
  key_name      = "bastion-kp"
}

resource "aws_iam_instance_profile" "bastion" {
  name = "bastion"
  role = var.role_name
}

resource "aws_instance" "bastion" {
  ami           = local.ami
  instance_type = local.instance_type

  key_name                    = local.key_name
  associate_public_ip_address = true
  subnet_id                   = var.public_subnet
  iam_instance_profile        = aws_iam_instance_profile.bastion.name

  security_groups = [aws_security_group.bastion-sg.id]

  tags = {
    Name = "K8s Bastion"
  }

  lifecycle {
    ignore_changes = all
  }

  user_data = <<EOF
      #! /bin/bash

      # Install Kubectl
      curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl"
      install -o root -g root -m 0755 kubectl /usr/local/bin/kubectl
      kubectl version --client

      # Install Helm
      curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3
      chmod 700 get_helm.sh
      ./get_helm.sh
      helm version

      # Install AWS
      curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
      unzip awscliv2.zip
      ./aws/install
      aws --version

      # Install aws-iam-authenticator
      curl -o aws-iam-authenticator https://amazon-eks.s3.us-west-2.amazonaws.com/1.21.2/2021-07-05/bin/linux/amd64/aws-iam-authenticator
      chmod +x ./aws-iam-authenticator
      mkdir -p $HOME/bin && cp ./aws-iam-authenticator $HOME/bin/aws-iam-authenticator && export PATH=$PATH:$HOME/bin
      echo 'export PATH=$PATH:$HOME/bin' >> ~/.bashrc
      aws-iam-authenticator help

      # Add the kube config file 
      mkdir ~/.kube
      echo "${var.kubectl_config}" >> ~/.kube/config
  EOF
}

resource "aws_security_group" "bastion-sg" {
  name   = "bastion-sg"
  vpc_id = var.vpc_id
}

resource "aws_security_group_rule" "sg-rule-ssh" {
  security_group_id = aws_security_group.bastion-sg.id
  from_port         = 22
  protocol          = "tcp"
  to_port           = 22
  type              = "ingress"
  cidr_blocks       = var.company_vpn_ips
  depends_on        = [aws_security_group.bastion-sg]
}

resource "aws_security_group_rule" "sg-rule-egress" {
  security_group_id = aws_security_group.bastion-sg.id
  type              = "egress"
  from_port         = 0
  protocol          = "all"
  to_port           = 0
  cidr_blocks       = ["0.0.0.0/0"]
  ipv6_cidr_blocks  = ["::/0"]
  depends_on        = [aws_security_group.bastion-sg]
}

Ask

The most pressing issue for me is finding a way to interact with the cluster via the bastion so that the other part of the Terraform code can run (the resources to spin up in the cluster itself). I am also hoping to understand how to setup a private cluster when it ends up being inaccessible to the terraform apply command. Thank you in advance for any help you can provide!


Solution

  • See how your node group is communicate with the control plane, you need to add the same cluster security group to your bastion host in order for it to communicate with the control plane. You can find the SG id on the EKS console - Networking tab.