I'm using EKS cluster with multiple different managed node groups of SPOT instances. I'm trying to make graceful shutdown on workloads on that nodes. I'm using ALB for balance input traffic. And Also I have deployments with graceful shutdown attributes like terminationGracePeriodSeconds
, preStop
, and readinessProbe
apiVersion: apps/v1
kind: Deployment
metadata:
name: {{ .Release.Name }}-{{ .Release.Namespace }}
namespace: {{ .Release.Namespace }}
labels:
app: {{ .Release.Name }}-{{ .Release.Namespace }}
type: instance
spec:
selector:
matchLabels:
app: {{ .Release.Name }}-{{ .Release.Namespace }}
strategy:
rollingUpdate:
maxSurge: 25%
maxUnavailable: 10%
type: RollingUpdate
template:
metadata:
labels:
app: {{ .Release.Name }}-{{ .Release.Namespace }}
spec:
serviceAccountName: {{ .Release.Name }}-sa-{{ .Release.Namespace }}
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: type
operator: In
values:
- instance
topologyKey: node
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: "eks.amazonaws.com/nodegroup"
operator: In
values:
- {{ .Values.nodegroup }}
containers:
- name: ai-server
lifecycle:
preStop:
exec:
command: [
"sh", "-c",
"sleep 20 && echo 1",
]
image: {{ .Values.registry }}:{{ .Values.image }}
command: [ "java" ]
args:
- -jar
- app.jar
readinessProbe:
httpGet:
path: /api/health
port: 8080
successThreshold: 1
periodSeconds: 10
initialDelaySeconds: 60
failureThreshold: 2
timeoutSeconds: 10
env:
- name: REDIS_HOST
value: redis-redis-cluster.{{ .Release.Namespace }}
- name: REDIS_PORT
value: "6379"
- name: REDIS_USER
value: default
- name: REDIS_PASSWORD
valueFrom:
secretKeyRef:
name: redis-redis-cluster
key: redis-password
- name: REDIS_TTL
value: {{ .Values.redis.ttl }}
resources:
requests:
memory: {{ .Values.resources.requests.memory }}
cpu: {{ .Values.resources.requests.cpu }}
limits:
memory: {{ .Values.resources.limits.memory }}
cpu: {{ .Values.resources.limits.cpu }}
ports:
- name: http
containerPort: 8080
imagePullPolicy: Always
terminationGracePeriodSeconds: 120
That approach provides me zero downtime updating and up and down scaling without any problems, without any errors on client side.
Unfortunately when SPOT node that serving pods of the deployemnt goes down for any reason like rebalance, clients get that error down below:
502 Bad Gateway502 Bad Gateway
It happens because for some reason when node already in NotReady state, and cluster received event about that
Warning NodeNotReady pod/workload-f554999c9-7xkbk Node is not ready
pod is still in state READY for some period of time,
workload-f554999c9-7xkbk 1/1 Running 0 64m
and ALB continuing forward requests to that pod, that already not exists, until the pod just disappear.
Will be appreciate any ideas that help!
The main challenge with this issue was that EKS itself doesn't handle the SpotInterruptionWarning event — this is the event triggered when Amazon decides to sell a node to someone else at a better rate. This event must be handled by external components.
For example, Karpenter can process these Amazon notifications. However, even Karpenter handles this rather crudely. When Amazon decides to reclaim a node, Karpenter simply removes the node from the cluster without waiting for a new pod to be rescheduled on a new node, even though there are 2 minutes to handle the event.
Unfortunately, the problem was ultimately solved by writing our own component to process the SpotInterruptionWarning. It works as follows: upon receiving a SpotInterruptionWarning, we mark the affected node as unschedulable and immediately reschedule all the necessary pods on new nodes.
We have 2 minutes to complete the migration. That's enough time for me, because adding a new node to the EKS cluster takes about 70 seconds, scheduling a pod and pulling its Docker image takes another 10 seconds, and I’ll allow 10 seconds for the delay in receiving the SpotInterruptionWarning from SQS.
In the end, there's still 30 seconds left for starting the application and switching the traffic. This setup allows us to handle production loads and replace spot nodes without downtime—or more accurately, with a minimal chance of downtime, which is offset by the cost savings of using spot instances.