amazon-web-services amazon-eks aws-application-load-balancer

502 Error from ALB using SPOT ASG in AWS EKS

I'm using EKS cluster with multiple different managed node groups of SPOT instances. I'm trying to make graceful shutdown on workloads on that nodes. I'm using ALB for balance input traffic. And Also I have deployments with graceful shutdown attributes like terminationGracePeriodSeconds, preStop, and readinessProbe

apiVersion: apps/v1
kind: Deployment
metadata:
  name: {{ .Release.Name }}-{{ .Release.Namespace }}
  namespace: {{ .Release.Namespace }}
  labels:
    app: {{ .Release.Name }}-{{ .Release.Namespace }}
    type: instance
spec:
  selector:
    matchLabels:
      app: {{ .Release.Name }}-{{ .Release.Namespace }}
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 10%
    type: RollingUpdate
  template:
    metadata:
      labels:
        app: {{ .Release.Name }}-{{ .Release.Namespace }}
    spec:
      serviceAccountName: {{ .Release.Name }}-sa-{{ .Release.Namespace }}
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
                - key: type
                  operator: In
                  values:
                    - instance
            topologyKey: node
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
              - matchExpressions:
                  - key: "eks.amazonaws.com/nodegroup"
                    operator: In
                    values:
                      - {{ .Values.nodegroup }}
      containers:
        - name: ai-server
          lifecycle:
            preStop:
              exec:
                command: [
                  "sh", "-c",
                  "sleep 20 && echo 1",
                ]
          image: {{ .Values.registry }}:{{ .Values.image }}
          command: [ "java" ]
          args:
            - -jar
            - app.jar
          readinessProbe:
            httpGet:
              path: /api/health
              port: 8080
            successThreshold: 1
            periodSeconds: 10
            initialDelaySeconds: 60
            failureThreshold: 2
            timeoutSeconds: 10
          env:
            - name: REDIS_HOST
              value: redis-redis-cluster.{{ .Release.Namespace }}
            - name: REDIS_PORT
              value: "6379"
            - name: REDIS_USER
              value: default
            - name: REDIS_PASSWORD
              valueFrom:
                secretKeyRef:
                  name: redis-redis-cluster
                  key: redis-password
            - name: REDIS_TTL
              value: {{ .Values.redis.ttl }}
          resources:
            requests:
              memory: {{ .Values.resources.requests.memory }}
              cpu: {{ .Values.resources.requests.cpu }}
            limits:
              memory: {{ .Values.resources.limits.memory }}
              cpu: {{ .Values.resources.limits.cpu }}
          ports:
            - name: http
              containerPort: 8080
          imagePullPolicy: Always
      terminationGracePeriodSeconds: 120

That approach provides me zero downtime updating and up and down scaling without any problems, without any errors on client side.

Unfortunately when SPOT node that serving pods of the deployemnt goes down for any reason like rebalance, clients get that error down below:

502 Bad Gateway
502 Bad Gateway

It happens because for some reason when node already in NotReady state, and cluster received event about that

Warning   NodeNotReady   pod/workload-f554999c9-7xkbk  Node is not ready

pod is still in state READY for some period of time,

workload-f554999c9-7xkbk         1/1     Running   0             64m

and ALB continuing forward requests to that pod, that already not exists, until the pod just disappear.

Will be appreciate any ideas that help!

Solution

The main challenge with this issue was that EKS itself doesn't handle the SpotInterruptionWarning event — this is the event triggered when Amazon decides to sell a node to someone else at a better rate. This event must be handled by external components.

For example, Karpenter can process these Amazon notifications. However, even Karpenter handles this rather crudely. When Amazon decides to reclaim a node, Karpenter simply removes the node from the cluster without waiting for a new pod to be rescheduled on a new node, even though there are 2 minutes to handle the event.

Unfortunately, the problem was ultimately solved by writing our own component to process the SpotInterruptionWarning. It works as follows: upon receiving a SpotInterruptionWarning, we mark the affected node as unschedulable and immediately reschedule all the necessary pods on new nodes.

We have 2 minutes to complete the migration. That's enough time for me, because adding a new node to the EKS cluster takes about 70 seconds, scheduling a pod and pulling its Docker image takes another 10 seconds, and I’ll allow 10 seconds for the delay in receiving the SpotInterruptionWarning from SQS.

In the end, there's still 30 seconds left for starting the application and switching the traffic. This setup allows us to handle production loads and replace spot nodes without downtime—or more accurately, with a minimal chance of downtime, which is offset by the cost savings of using spot instances.