google-cloud-platformgoogle-kubernetes-enginekubernetes-health-check

NEG is not attached to any BackendService with health checking


I have a downtime on my app running on GKE when I deploy it using rolling update.

rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 0
    type: RollingUpdate

I've checked the events on my pod and the last event is this one:

NEG is not attached to any Backend Service with health checking. Marking condition "cloud.google.com/load-balancer-neg-ready" to True.

On my pod I have a livenessProbe like this:

livenessProbe:
      failureThreshold: 1
      httpGet:
        path: /healthz
        port: http
        scheme: HTTP
      periodSeconds: 10
      successThreshold: 1
      timeoutSeconds: 1

startupProbe:
          failureThreshold: 30
          httpGet:
            path: /healthz
            port: http
            scheme: HTTP
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 1

Checked my LB logs and found this:

{
httpRequest: {
latency: "0.002246s"
remoteIp: "myIP"
requestMethod: "GET"
requestSize: "37"
requestUrl: "https://www.myurl/"
responseSize: "447"
status: 502
userAgent: "curl/7.77.0"
}
insertId: "1mk"
jsonPayload: {3}
logName: "myproject/logs/requests"
receiveTimestamp: "2022-02-15T15:30:52.085256523Z"
resource: {
labels: {6}
type: "http_load_balancer"
}
severity: "WARNING"
spanId: "b75e2f583a0e9e25"
timestamp: "2022-02-15T15:30:51.270776Z"
trace: "myproject/traces/32c488f48a392ac42358be0f"
}

And this is my deployment spec as asked:

spec:
  progressDeadlineSeconds: 600
  replicas: 1
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app.kubernetes.io/instance: app
      app.kubernetes.io/name: myname
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 0
    type: RollingUpdate
  template:
    metadata:
      annotations:
        checksum/config: 4920135cd08336150d3184cc1af
      creationTimestamp: null
      labels:
        app.kubernetes.io/instance: app
        app.kubernetes.io/managed-by: Helm
        app.kubernetes.io/name: webapp-server
        app.kubernetes.io/part-of: webapp
        helm.sh/chart: myapp-1.0.0
    spec:
      containers:
      - env:
        - name: ENV VAR
          value: Hello
        envFrom:
        - configMapRef:
            name: myapp
        - secretRef:
            name: myapp-credentials
        image: imagelink
        imagePullPolicy: IfNotPresent
        livenessProbe:
          failureThreshold: 1
          httpGet:
            path: /healthz
            port: http
            scheme: HTTP
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 1
        name: node
        ports:
        - containerPort: 3000
          name: http
          protocol: TCP
        resources:
          limits:
            cpu: 500m
            memory: 512Mi
          requests:
            cpu: 250m
            memory: 256Mi
        startupProbe:
          failureThreshold: 30
          httpGet:
            path: /healthz
            port: http
            scheme: HTTP
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 1
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
      dnsPolicy: ClusterFirst

What can I change to avoid this downtime when performing a rollingUpdate?


Solution

  • this worked by adding this:

    lifecycle:
       preStop:
          exec:
            command:
            - /bin/sh
            - -c
            - sleep 60
    

    which basically gives the pod 60 seconds to handle the sigterm and the ancient requests while the new pod is up and handles the new requests.