google-kubernetes-engineautoscalingkubernetes-statefulsetautopilot

GKE Autopilot Stateful set - not scaling


I have created a GKE Autopilot cluster however when I create a stateful set with 3 replicas I am getting the following error

FailedScheduling 77s (x3 over 11m) gke.io/optimize-utilization-scheduler 0/4 nodes are available: 4 Insufficient cpu, 4 Insufficient memory. preemption: 0/4 nodes are available: 4 No preemption victims found for incoming pod.

FailedScaleUp 4m32s cluster-autoscaler Node scale up in zones us-central1-b associated with this pod failed: IP space exhausted. Pod is at risk of not being scheduled.

Of the 3 replicas two is fully up and running and the other is giving the following error

kubectl get pods                        
NAME                  READY   STATUS    RESTARTS   AGE
nginx-statefulset-0   2/2     Running   0          25m
nginx-statefulset-1   2/2     Running   0          24m
nginx-statefulset-2   0/2     Pending   0          10m

kubectl describe pod nginx-statefulset-2
Name:             nginx-statefulset-2
Namespace:        default
Priority:         0
Service Account:  default
Node:             <none>
Labels:           app=nginx
                  apps.kubernetes.io/pod-index=2
                  autopilot.gke.io/allow-net-admin=true
                  controller-revision-hash=nginx-statefulset-6d59ffdd85
                  security.istio.io/tlsMode=istio
                  service.istio.io/canonical-name=nginx
                  service.istio.io/canonical-revision=latest
                  statefulset.kubernetes.io/pod-name=nginx-statefulset-2
Annotations:      autopilot.gke.io/resource-adjustment:
                    {"input":{"initContainers":[{"limits":{"cpu":"2","memory":"1Gi"},"requests":{"cpu":"100m","memory":"128Mi"},"name":"istio-init"}],"contain...       
                  autopilot.gke.io/warden-version: 2.9.52
                  istio.io/rev: default
                  kubectl.kubernetes.io/default-container: nginx
                  kubectl.kubernetes.io/default-logs-container: nginx
                  prometheus.io/path: /stats/prometheus
                  prometheus.io/port: 15020
                  prometheus.io/scrape: true
                  sidecar.istio.io/status:
                    {"initContainers":["istio-init"],"containers":["istio-proxy"],"volumes":["workload-socket","credential-socket","workload-certs","istio-env...       
Status:           Pending
SeccompProfile:   RuntimeDefault
IP:
IPs:              <none>
Controlled By:    StatefulSet/nginx-statefulset
Init Containers:
  istio-init:
    Image:      docker.io/istio/proxyv2:1.23.0
    Port:       <none>
    Host Port:  <none>
    Args:
      istio-iptables
      -p
      15001
      -z
      15006
      -u
      1337
      -m
      REDIRECT
      -i
      *
      -x

      -b
      *
      -d
      15090,15021,15020
      --log_output_level=default:info
    Limits:
      cpu:                100m
      ephemeral-storage:  2Gi
      memory:             128Mi
    Requests:
      cpu:                100m
      ephemeral-storage:  2Gi
      memory:             128Mi
    Environment:          <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-62ch8 (ro)
Containers:
  nginx:
    Image:      nginx:1.21
    Port:       80/TCP
    Host Port:  0/TCP
    Limits:
      cpu:                650m
      ephemeral-storage:  1Gi
      memory:             2Gi
    Requests:
      cpu:                650m
      ephemeral-storage:  1Gi
      memory:             2Gi
    Environment:          <none>
    Mounts:
      /usr/share/nginx/html from www (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-62ch8 (ro)
  istio-proxy:
    Image:      docker.io/istio/proxyv2:1.23.0
    Port:       15090/TCP
    Host Port:  0/TCP
    Args:
      proxy
      sidecar
      --domain
      $(POD_NAMESPACE).svc.cluster.local
      --proxyLogLevel=warning
      --proxyComponentLogLevel=misc:error
      --log_output_level=default:info
    Limits:
      cpu:                100m
      ephemeral-storage:  1Gi
      memory:             128Mi
    Requests:
      cpu:                100m
      ephemeral-storage:  1Gi
      memory:             128Mi
    Readiness:            http-get http://:15021/healthz/ready delay=0s timeout=3s period=15s #success=1 #failure=4
    Startup:              http-get http://:15021/healthz/ready delay=0s timeout=3s period=1s #success=1 #failure=600
    Environment:
      PILOT_CERT_PROVIDER:           istiod
      CA_ADDR:                       istiod.istio-system.svc:15012
      POD_NAME:                      nginx-statefulset-2 (v1:metadata.name)
      POD_NAMESPACE:                 default (v1:metadata.namespace)
      INSTANCE_IP:                    (v1:status.podIP)
      SERVICE_ACCOUNT:                (v1:spec.serviceAccountName)
      HOST_IP:                        (v1:status.hostIP)
      ISTIO_CPU_LIMIT:               1 (limits.cpu)
      PROXY_CONFIG:                  {}

      ISTIO_META_POD_PORTS:          [
                                         {"name":"web","containerPort":80,"protocol":"TCP"}
                                     ]
      ISTIO_META_APP_CONTAINERS:     nginx
      GOMEMLIMIT:                    134217728 (limits.memory)
      GOMAXPROCS:                    1 (limits.cpu)
      ISTIO_META_CLUSTER_ID:         Kubernetes
      ISTIO_META_NODE_NAME:           (v1:spec.nodeName)
      ISTIO_META_INTERCEPTION_MODE:  REDIRECT
      ISTIO_META_WORKLOAD_NAME:      nginx-statefulset
      ISTIO_META_OWNER:              kubernetes://apis/apps/v1/namespaces/default/statefulsets/nginx-statefulset
      ISTIO_META_MESH_ID:            cluster.local
      TRUST_DOMAIN:                  cluster.local
    Mounts:
      /etc/istio/pod from istio-podinfo (rw)
      /etc/istio/proxy from istio-envoy (rw)
      /var/lib/istio/data from istio-data (rw)
      /var/run/secrets/credential-uds from credential-socket (rw)
      /var/run/secrets/istio from istiod-ca-cert (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-62ch8 (ro)
      /var/run/secrets/tokens from istio-token (rw)
      /var/run/secrets/workload-spiffe-credentials from workload-certs (rw)
      /var/run/secrets/workload-spiffe-uds from workload-socket (rw)
Conditions:
  Type           Status
  PodScheduled   False
Volumes:
  workload-socket:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:
    SizeLimit:  <unset>
  credential-socket:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:
    SizeLimit:  <unset>
  workload-certs:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:
    SizeLimit:  <unset>
  istio-envoy:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     Memory
    SizeLimit:  <unset>
  istio-data:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:
    SizeLimit:  <unset>
  istio-podinfo:
    Type:  DownwardAPI (a volume populated by information about the pod)
    Items:
      metadata.labels -> labels
      metadata.annotations -> annotations
  istio-token:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  43200
  istiod-ca-cert:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      istio-ca-root-cert
    Optional:  false
  www:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  www-nginx-statefulset-2
    ReadOnly:   false
  kube-api-access-62ch8:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Guaranteed
Node-Selectors:              <none>
Tolerations:                 kubernetes.io/arch=amd64:NoSchedule
                             node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason            Age                From                                   Message
  ----     ------            ----               ----                                   -------
  Normal   TriggeredScaleUp  10m                cluster-autoscaler                     pod triggered scale-up: [{https://www.googleapis.com/compute/v1/projects/meghdo-4
567/zones/us-central1-c/instanceGroups/gk3-meghdo-cluster-nap-1qk1o6u2-36641132-grp 0->1 (max: 1000)} {https://www.googleapis.com/compute/v1/projects/meghdo-4567/zones/us-central1-b/instanceGroups/gk3-meghdo-cluster-nap-1qk1o6u2-d7e76a9c-grp 0->1 (max: 1000)}]
  Warning  FailedScaleUp     9m32s              cluster-autoscaler                     Node scale up in zones us-central1-c, us-central1-b associated with this pod failed: IP space exhausted. Pod is at risk of not being scheduled.
  Warning  FailedScaleUp     4m32s              cluster-autoscaler                     Node scale up in zones us-central1-b associated with this pod failed: IP space exhausted. Pod is at risk of not being scheduled.
  Normal   TriggeredScaleUp  3m44s              cluster-autoscaler                     pod triggered scale-up: [{https://www.googleapis.com/compute/v1/projects/meghdo-4
567/zones/us-central1-c/instanceGroups/gk3-meghdo-cluster-nap-su76u4nk-9d90d6ab-grp 0->1 (max: 1000)} {https://www.googleapis.com/compute/v1/projects/meghdo-4567/zones/us-central1-f/instanceGroups/gk3-meghdo-cluster-nap-su76u4nk-f792d666-grp 0->1 (max: 1000)}]
  Warning  FailedScheduling  77s (x3 over 11m)  gke.io/optimize-utilization-scheduler  0/4 nodes are available: 4 Insufficient cpu, 4 Insufficient memory. preemption: 0/4 nodes are available: 4 No preemption victims found for incoming pod.

my nginx stateful set yaml

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: nginx-statefulset
  namespace: default
  labels:
    app: nginx
spec:
  serviceName: "nginx"
  replicas: 3
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
        - name: nginx
          image: nginx:1.21
          ports:
            - containerPort: 80
              name: web
          volumeMounts:
            - name: www
              mountPath: /usr/share/nginx/html
  volumeClaimTemplates:
    - metadata:
        name: www
      spec:
        accessModes: ["ReadWriteOnce"]
        resources:
          requests:
            storage: 1Gi

Checked the quotas in IAM and there no quotas exceeding , in fact it is very much all less than 1% usage

I expect the Autopilot GKE cluster to scale up automatically


Solution

  • You won't find it in quota (it is not about quota). When you create a cluster, you specify a ipv4 range for pods. The error above says that this range has been exhausted.

    To find this range, open the cluster in cloud console, and search for Cluster Pod IPv4 range (default)

    You can create additional ranges, by creating new secondary ranges to the subnet used by the cluster, and add it to Cluster Pod IPv4 ranges (additional) and either create new nodepool or enable Node auto-provisioning. See https://cloud.google.com/kubernetes-engine/docs/how-to/multi-pod-cidr