restkubernetesgoogle-cloud-platformgoogle-kubernetes-engineinternal-load-balancer

Bad latency in GKE between Pods


We are having a very strange behavior with unacceptable big latency for communication within a kubernetes cluster (GKE). The latency is jumping between 600ms and 1s for a endpoint that has a Memorystore get/store action and a CloudSQL query. The same setup running locally in dev enivornment (although without k8s) is not showing this kind of latency.

About our architecture: We are running a k8s cluster on GKE using terraform and service / deployment (yaml) files for the creation (I added those below). We're running two node APIs (koa.js 2.5). One API is exposed with an ingress to the public and connects via a nodeport to the API pod.

The other API pod is private reachable through an internal loadbalancer from google. This API is connected to all the resource we need (CloudSQL, Cloud Storage).

Both APIs are also connected to a Memorystore (Redis).

The communication between those pods is secured with self-signed server/client certificates (which isn't the problem, we already removed it temporarily to test).

We checked the logs and saw that the request from the public API to the private one is taking about 200ms only to reach it. Also the response to the public API from the private one took about 600ms (messured from the point when the whole business logic of the private API went throw until we received that response back at the pubilc API)

We're really out of things to try... We already connected all the Google Cloud resources to our local environment which didn't show that kind of bad latency.

In a complete local setup the latency is only about 1/5 to 1/10 of what we see in the cloud setup. We also tried to ping the private POD from the public one which was in the 0.100ms area.

Do you have any ideas where we can further investigate ? Here is the terraform script about our Google Cloud setup

  // Configure the Google Cloud provider
  provider "google" {
    project = "${var.project}"
    region  = "${var.region}"
  }
  data "google_compute_zones" "available" {}
  # Ensuring relevant service APIs are enabled in your project. Alternatively visit and enable the needed services
  resource "google_project_service" "serviceapi" {
    service            = "serviceusage.googleapis.com"
    disable_on_destroy = false
  }
  resource "google_project_service" "sqlapi" {
    service            = "sqladmin.googleapis.com"
    disable_on_destroy = false
    depends_on         = ["google_project_service.serviceapi"]
  }
  resource "google_project_service" "redisapi" {
    service            = "redis.googleapis.com"
    disable_on_destroy = false
    depends_on         = ["google_project_service.serviceapi"]
  }
  # Create a VPC and a subnetwork in our region
  resource "google_compute_network" "appnetwork" {
    name                    = "${var.environment}-vpn"
    auto_create_subnetworks = "false"
  }
  resource "google_compute_subnetwork" "network-with-private-secondary-ip-ranges" {
    name          = "${var.environment}-vpn-subnet"
    ip_cidr_range = "10.2.0.0/16"
    region        = "europe-west1"
    network       = "${google_compute_network.appnetwork.self_link}"
    secondary_ip_range {
      range_name    = "kubernetes-secondary-range-pods"
      ip_cidr_range = "10.60.0.0/16"
    }
    secondary_ip_range {
      range_name    = "kubernetes-secondary-range-services"
      ip_cidr_range = "10.70.0.0/16"
    }
  }
  # GKE cluster setup
  resource "google_container_cluster" "primary" {
    name               = "${var.environment}-cluster"
    zone               = "${data.google_compute_zones.available.names[1]}"
    initial_node_count = 1
    description        = "Kubernetes Cluster"
    network            = "${google_compute_network.appnetwork.self_link}"
    subnetwork         = "${google_compute_subnetwork.network-with-private-secondary-ip-ranges.self_link}"
    depends_on         = ["google_project_service.serviceapi"]
    additional_zones = [
      "${data.google_compute_zones.available.names[0]}",
      "${data.google_compute_zones.available.names[2]}",
    ]
    master_auth {
      username = "xxxxxxx"
      password = "xxxxxxx"
    }
    ip_allocation_policy {
      cluster_secondary_range_name  = "kubernetes-secondary-range-pods"
      services_secondary_range_name = "kubernetes-secondary-range-services"
    }
    node_config {
      oauth_scopes = [
        "https://www.googleapis.com/auth/compute",
        "https://www.googleapis.com/auth/devstorage.read_only",
        "https://www.googleapis.com/auth/logging.write",
        "https://www.googleapis.com/auth/monitoring",
        "https://www.googleapis.com/auth/trace.append"
      ]
      tags = ["kubernetes", "${var.environment}"]
    }
  }
  ##################
  # MySQL DATABASES 
  ##################
  resource "google_sql_database_instance" "core" {
    name             = "${var.environment}-sql-core"
    database_version = "MYSQL_5_7"
    region           = "${var.region}"
    depends_on       = ["google_project_service.sqlapi"]
    settings {
      # Second-generation instance tiers are based on the machine
      # type. See argument reference below.
      tier = "db-n1-standard-1"
    }
  }
  resource "google_sql_database_instance" "tenant1" {
    name             = "${var.environment}-sql-tenant1"
    database_version = "MYSQL_5_7"
    region           = "${var.region}"
    depends_on       = ["google_project_service.sqlapi"]
    settings {
      # Second-generation instance tiers are based on the machine
      # type. See argument reference below.
      tier = "db-n1-standard-1"
    }
  }
  resource "google_sql_database_instance" "tenant2" {
    name             = "${var.environment}-sql-tenant2"
    database_version = "MYSQL_5_7"
    region           = "${var.region}"
    depends_on       = ["google_project_service.sqlapi"]
    settings {
      # Second-generation instance tiers are based on the machine
      # type. See argument reference below.
      tier = "db-n1-standard-1"
    }
  }
  resource "google_sql_database" "core" {
    name     = "project_core"
    instance = "${google_sql_database_instance.core.name}"
  }
  resource "google_sql_database" "tenant1" {
    name     = "project_tenant_1"
    instance = "${google_sql_database_instance.tenant1.name}"
  }
  resource "google_sql_database" "tenant2" {
    name     = "project_tenant_2"
    instance = "${google_sql_database_instance.tenant2.name}"
  }
  ##################
  # MySQL USERS
  ##################
  resource "google_sql_user" "core-user" {
    name     = "${var.sqluser}"
    instance = "${google_sql_database_instance.core.name}"
    host     = "cloudsqlproxy~%"
    password = "${var.sqlpassword}"
  }
  resource "google_sql_user" "tenant1-user" {
    name     = "${var.sqluser}"
    instance = "${google_sql_database_instance.tenant1.name}"
    host     = "cloudsqlproxy~%"
    password = "${var.sqlpassword}"
  }
  resource "google_sql_user" "tenant2-user" {
    name     = "${var.sqluser}"
    instance = "${google_sql_database_instance.tenant2.name}"
    host     = "cloudsqlproxy~%"
    password = "${var.sqlpassword}"
  }
  ##################
  # REDIS
  ##################
  resource "google_redis_instance" "redis" {
    name               = "${var.environment}-redis"
    tier               = "BASIC"
    memory_size_gb     = 1
    depends_on         = ["google_project_service.redisapi"]
    authorized_network = "${google_compute_network.appnetwork.self_link}"
    region             = "${var.region}"
    location_id        = "${data.google_compute_zones.available.names[0]}"
    redis_version = "REDIS_3_2"
    display_name  = "Redis Instance"
  }
  # The following outputs allow authentication and connectivity to the GKE Cluster.
  output "client_certificate" {
    value = "${google_container_cluster.primary.master_auth.0.client_certificate}"
  }
  output "client_key" {
    value = "${google_container_cluster.primary.master_auth.0.client_key}"
  }
  output "cluster_ca_certificate" {
    value = "${google_container_cluster.primary.master_auth.0.cluster_ca_certificate}"
  }

The service and deployment of the private API

  # START CRUD POD
  apiVersion: extensions/v1beta1
  kind: Deployment
  metadata:
    name: crud-pod
    labels:
      app: crud
  spec:
    template:
      metadata:
        labels:
          app: crud
      spec:
        containers:
          - name: crud
            image: eu.gcr.io/dev-xxxxx/crud:latest-unstable
            ports:
              - containerPort: 3333
            env:
            - name: NODE_ENV
              value: develop
            volumeMounts:
            - [..MountedConfigFiles..]
          # [START proxy_container]
          - name: cloudsql-proxy
            image: gcr.io/cloudsql-docker/gce-proxy:1.11
            command: ["/cloud_sql_proxy",
                      "-instances=dev-xxxx:europe-west1:dev-sql-core=tcp:3306,dev-xxxx:europe-west1:dev-sql-tenant1=tcp:3307,dev-xxxx:europe-west1:dev-sql-tenant2=tcp:3308",
                      "-credential_file=xxxx"]
            volumeMounts:
              - name: cloudsql-instance-credentials
                mountPath: /secrets/cloudsql
                readOnly: true
          # [END proxy_container]
        # [START volumes]
        volumes:
          - name: cloudsql-instance-credentials
            secret:
              secretName: cloudsql-instance-credentials
          - [..ConfigFilesVolumes..]
        # [END volumes]
  # END CRUD POD
  -------
  # START CRUD SERVICE
  apiVersion: v1
  kind: Service
  metadata:
    name: crud
    annotations:
      cloud.google.com/load-balancer-type: "Internal"
  spec:
    type: LoadBalancer
    loadBalancerSourceRanges: 
      - 10.60.0.0/16
    ports:
    - name: crud-port
      port: 3333
      protocol: TCP # default; can also specify UDP
    selector:
      app: crud # label selector for Pods to target
  # END CRUD SERVICE

And the public one (including ingress)

  # START SAPI POD
  apiVersion: extensions/v1beta1
  kind: Deployment
  metadata:
    name: sapi-pod
    labels:
      app: sapi
  spec:
    template:
      metadata:
        labels:
          app: sapi
      spec:
        containers:
          - name: sapi
            image: eu.gcr.io/dev-xxx/sapi:latest-unstable
            ports:
              - containerPort: 8080
            env:
              - name: NODE_ENV
                value: develop
            volumeMounts:
              - [..MountedConfigFiles..]
        volumes:
          - [..ConfigFilesVolumes..]
  # END SAPI POD
  -------------
  # START SAPI SERVICE
  kind: Service
  apiVersion: v1
  metadata:
    name: sapi # Service name
  spec:
    selector:
      app:  sapi
    ports:
    - port: 8080
      targetPort: 8080
    type: NodePort
  # END SAPI SERVICE
  --------------
  apiVersion: extensions/v1beta1
  kind: Ingress
  metadata:
    name: dev-ingress
    annotations:
      kubernetes.io/ingress.global-static-ip-name: api-dev-static-ip
    labels:
      app: sapi-ingress
  spec:
    backend:
      serviceName: sapi
      servicePort: 8080
    tls:
    - hosts:
      - xxxxx
      secretName: xxxxx

Solution

  • We fixed the issue by removing the @google-cloud/logging-winston from our logTransport. For some reason it blocked our traffic so that we got such bad latency.