I have a downtime on my app running on GKE when I deploy it using rolling update.
rollingUpdate:
maxSurge: 25%
maxUnavailable: 0
type: RollingUpdate
I've checked the events on my pod and the last event is this one:
NEG is not attached to any Backend Service with health checking. Marking condition "cloud.google.com/load-balancer-neg-ready" to True.
On my pod I have a livenessProbe
like this:
livenessProbe:
failureThreshold: 1
httpGet:
path: /healthz
port: http
scheme: HTTP
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 1
startupProbe:
failureThreshold: 30
httpGet:
path: /healthz
port: http
scheme: HTTP
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 1
Checked my LB logs and found this:
{
httpRequest: {
latency: "0.002246s"
remoteIp: "myIP"
requestMethod: "GET"
requestSize: "37"
requestUrl: "https://www.myurl/"
responseSize: "447"
status: 502
userAgent: "curl/7.77.0"
}
insertId: "1mk"
jsonPayload: {3}
logName: "myproject/logs/requests"
receiveTimestamp: "2022-02-15T15:30:52.085256523Z"
resource: {
labels: {6}
type: "http_load_balancer"
}
severity: "WARNING"
spanId: "b75e2f583a0e9e25"
timestamp: "2022-02-15T15:30:51.270776Z"
trace: "myproject/traces/32c488f48a392ac42358be0f"
}
And this is my deployment spec as asked:
spec:
progressDeadlineSeconds: 600
replicas: 1
revisionHistoryLimit: 10
selector:
matchLabels:
app.kubernetes.io/instance: app
app.kubernetes.io/name: myname
strategy:
rollingUpdate:
maxSurge: 25%
maxUnavailable: 0
type: RollingUpdate
template:
metadata:
annotations:
checksum/config: 4920135cd08336150d3184cc1af
creationTimestamp: null
labels:
app.kubernetes.io/instance: app
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/name: webapp-server
app.kubernetes.io/part-of: webapp
helm.sh/chart: myapp-1.0.0
spec:
containers:
- env:
- name: ENV VAR
value: Hello
envFrom:
- configMapRef:
name: myapp
- secretRef:
name: myapp-credentials
image: imagelink
imagePullPolicy: IfNotPresent
livenessProbe:
failureThreshold: 1
httpGet:
path: /healthz
port: http
scheme: HTTP
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 1
name: node
ports:
- containerPort: 3000
name: http
protocol: TCP
resources:
limits:
cpu: 500m
memory: 512Mi
requests:
cpu: 250m
memory: 256Mi
startupProbe:
failureThreshold: 30
httpGet:
path: /healthz
port: http
scheme: HTTP
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 1
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
dnsPolicy: ClusterFirst
What can I change to avoid this downtime when performing a rollingUpdate
?
this worked by adding this:
lifecycle:
preStop:
exec:
command:
- /bin/sh
- -c
- sleep 60
which basically gives the pod 60 seconds to handle the sigterm and the ancient requests while the new pod is up and handles the new requests.