Pods stuck in Terminating state when worker node is down (never redeployed on healthy nodes), how to fix this?

We run a kubernetes cluster provisioned with kubespray and discovered that each time when a faulty node goes down (we had this due to hardware issue recently) the pods executing on this node stuck in Terminating state indefinitely. Even after many hours the pods are not being redeploying on healthy nodes and thus our entire application is malfunctioning and the users are affected for a prolonged period of time.

How it is possible to configure kubernetes to perform failover in situations like this?

Below is our statefulset manifest.

apiVersion: apps/v1
kind: StatefulSet
metadata:
  namespace: project-stock
  name: ps-ra
spec:
  selector:
    matchLabels:
      infrastructure: ps
      application: report-api
      environment: staging
  serviceName: hl-ps-report-api
  replicas: 1
  template:
    metadata:
      namespace: project-stock
      labels:
        infrastructure: ps
        application: report-api
        environment: staging
    spec:
      terminationGracePeriodSeconds: 10
      containers:
        - name: ps-report-api
          image: localhost:5000/ps/nodejs-chrome-application:latest
          ports:
            - containerPort: 3000
              protocol: TCP
              name: nodejs-rest-api
          volumeMounts:
          resources:
            limits:
              cpu: 1000m
              memory: 8192Mi
            requests:
              cpu: 333m
              memory: 8192Mi
          livenessProbe:
            httpGet:
              path: /health/
              port: 3000
            initialDelaySeconds: 180
            periodSeconds: 10
            failureThreshold: 12
            timeoutSeconds: 10

Solution

Posted community wiki for better visibility. Feel free to expand it.

In my opinion, the behaviour on your kubespray cluster (pod staying in Terminating state) is fully intentional. Based on Kubernetes documentation:

A Pod is not deleted automatically when a node is unreachable. The Pods running on an unreachable Node enter the 'Terminating' or 'Unknown' state after a timeout. Pods may also enter these states when the user attempts graceful deletion of a Pod on an unreachable Node.

The same documentation introduces ways in which a Pod in Terminating state can be removed. Also there are some recommended best practices:

The only ways in which a Pod in such a state can be removed from the apiserver are as follows:

The Node object is deleted (either by you, or by the Node Controller).

The kubelet on the unresponsive Node starts responding, kills the Pod and removes the entry from the apiserver.

Force deletion of the Pod by the user.

The recommended best practice is to use the first or second approach. If a Node is confirmed to be dead (e.g. permanently disconnected from the network, powered down, etc), then delete the Node object. If the Node is suffering from a network partition, then try to resolve this or wait for it to resolve. When the partition heals, the kubelet will complete the deletion of the Pod and free up its name in the apiserver. Normally, the system completes the deletion once the Pod is no longer running on a Node, or the Node is deleted by an administrator. You may override this by force deleting the Pod.

You can implement Graceful Node Shutdown if your node is shutdown in one of the following ways:

On Linux, your system can shut down in many different situations. For example:

A user or script running shutdown -h now or systemctl poweroff or systemctl reboot.

Physically pressing a power button on the machine.

Stopping a VM instance on a cloud provider, e.g. gcloud compute instances stop on GCP.

A Preemptible VM or Spot Instance that your cloud provider can terminate unexpectedly, but with a brief warning.

Keep in mind this feature is supported from version 1.20 (where it is in alpha state) and up (currently in 1.21 is in beta state).

The other solution, mentioned in documentation is to manually delete a node, for example using a kubectl delete node <your-node-name>:

If a Node is confirmed to be dead (e.g. permanently disconnected from the network, powered down, etc), then delete the Node object.

Then pod will be re-scheduled on the other node.

The last workaround is to set TerminationGracePeriodSeconds to 0, but this is strongly discouraged:

For the above to lead to graceful termination, the Pod must not specify a pod.Spec.TerminationGracePeriodSeconds of 0. The practice of setting a pod.Spec.TerminationGracePeriodSeconds of 0 seconds is unsafe and strongly discouraged for StatefulSet Pods. Graceful deletion is safe and will ensure that the Pod shuts down gracefully before the kubelet deletes the name from the apiserver.