We have the following configuration for our service that's deployed to EKS but it causes downtime for about 120s whenever we make a deployment.
I can successfully make requests to the new pod when I port forward to it directly, so the pod itself seems fine. It seems to be either the AWS NLB that's not routing the traffic or something network related but I'm not sure, and I don't know where to debug further for this.
I tried a few things to no avail: added a readinessProbe, tried increasing the initialDelaySeconds to 120, tried switching to an IP ELB target, rather than an instance ELB target type, tried reducing the NLB's health check interval but it's not actually being applied and remains as 30s.
Any help would be greatly appreciated!
---
# Autoscaler for the frontend
apiVersion: autoscaling/v1
kind: HorizontalPodAutoscaler
metadata:
name: my-frontend
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: my-frontend
minReplicas: 3
maxReplicas: 8
targetCPUUtilizationPercentage: 60
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-frontend
labels:
app: my-frontend
spec:
replicas: 3
strategy:
rollingUpdate:
maxSurge: 1
maxUnavailable: 0
type: RollingUpdate
selector:
matchLabels:
app: my-frontend
template:
metadata:
labels:
app: my-frontend
spec:
containers:
- name: my-frontend
image: ${DOCKER_IMAGE}
ports:
- containerPort: 3001
name: web
resources:
requests:
cpu: "300m"
memory: "256Mi"
livenessProbe:
httpGet:
scheme: HTTP
path: /v1/ping
port: 3001
initialDelaySeconds: 5
timeoutSeconds: 1
periodSeconds: 10
readinessProbe:
httpGet:
scheme: HTTP
path: /v1/ping
port: 3001
initialDelaySeconds: 5
timeoutSeconds: 1
periodSeconds: 10
restartPolicy: Always
---
apiVersion: v1
kind: Service
metadata:
annotations:
service.beta.kubernetes.io/aws-load-balancer-type: nlb
service.beta.kubernetes.io/aws-load-balancer-backend-protocol: http
service.beta.kubernetes.io/aws-load-balancer-ssl-cert: ${SSL_CERTIFICATE_ARN}
service.beta.kubernetes.io/aws-load-balancer-ssl-ports: "https"
service.beta.kubernetes.io/aws-load-balancer-cross-zone-load-balancing-enabled: "true"
service.beta.kubernetes.io/aws-load-balancer-healthcheck-interval: "10"
service.beta.kubernetes.io/aws-load-balancer-connection-draining-enabled: "true"
service.beta.kubernetes.io/aws-load-balancer-connection-draining-timeout: "60"
name: my-frontend
labels:
service: my-frontend
spec:
ports:
- name: http
port: 80
targetPort: 3001
- name: https
port: 443
targetPort: 3001
externalTrafficPolicy: Local
selector:
app: my-frontend
type: LoadBalancer
This is most likely caused by NLB not reacting quickly enough to the target changes which is related directly to your externalTrafficPolicy settings.
If your application does not make any use of client IP you can set the externalTrafficPolicy to ClusterIP or leave it to default by removing it.
In case where your application requires to preserve client IP you may use the solution discussed in this github issue which in short requires you to use blue-green deployment.