We run a kubernetes cluster provisioned with kubespray and discovered that each time when a faulty node goes down (we had this due to hardware issue recently) the pods executing on this node stuck in Terminating state indefinitely. Even after many hours the pods are not being redeploying on healthy nodes and thus our entire application is malfunctioning and the users are affected for a prolonged period of time.
How it is possible to configure kubernetes to perform failover in situations like this?
Below is our statefulset manifest.
apiVersion: apps/v1
kind: StatefulSet
metadata:
namespace: project-stock
name: ps-ra
spec:
selector:
matchLabels:
infrastructure: ps
application: report-api
environment: staging
serviceName: hl-ps-report-api
replicas: 1
template:
metadata:
namespace: project-stock
labels:
infrastructure: ps
application: report-api
environment: staging
spec:
terminationGracePeriodSeconds: 10
containers:
- name: ps-report-api
image: localhost:5000/ps/nodejs-chrome-application:latest
ports:
- containerPort: 3000
protocol: TCP
name: nodejs-rest-api
volumeMounts:
resources:
limits:
cpu: 1000m
memory: 8192Mi
requests:
cpu: 333m
memory: 8192Mi
livenessProbe:
httpGet:
path: /health/
port: 3000
initialDelaySeconds: 180
periodSeconds: 10
failureThreshold: 12
timeoutSeconds: 10
Posted community wiki for better visibility. Feel free to expand it.
In my opinion, the behaviour on your kubespray
cluster (pod staying in Terminating
state) is fully intentional. Based on Kubernetes documentation:
A Pod is not deleted automatically when a node is unreachable. The Pods running on an unreachable Node enter the 'Terminating' or 'Unknown' state after a timeout. Pods may also enter these states when the user attempts graceful deletion of a Pod on an unreachable Node.
The same documentation introduces ways in which a Pod in Terminating
state can be removed. Also there are some recommended best practices:
The only ways in which a Pod in such a state can be removed from the apiserver are as follows:
- The Node object is deleted (either by you, or by the Node Controller).
- The kubelet on the unresponsive Node starts responding, kills the Pod and removes the entry from the apiserver.
- Force deletion of the Pod by the user.
The recommended best practice is to use the first or second approach. If a Node is confirmed to be dead (e.g. permanently disconnected from the network, powered down, etc), then delete the Node object. If the Node is suffering from a network partition, then try to resolve this or wait for it to resolve. When the partition heals, the kubelet will complete the deletion of the Pod and free up its name in the apiserver. Normally, the system completes the deletion once the Pod is no longer running on a Node, or the Node is deleted by an administrator. You may override this by force deleting the Pod.
You can implement Graceful Node Shutdown if your node is shutdown in one of the following ways:
On Linux, your system can shut down in many different situations. For example:
- A user or script running
shutdown -h now
orsystemctl poweroff
orsystemctl reboot
.- Physically pressing a power button on the machine.
- Stopping a VM instance on a cloud provider, e.g.
gcloud compute instances stop
on GCP.- A Preemptible VM or Spot Instance that your cloud provider can terminate unexpectedly, but with a brief warning.
Keep in mind this feature is supported from version 1.20 (where it is in alpha state) and up (currently in 1.21 is in beta state).
The other solution, mentioned in documentation is to manually delete a node, for example using a kubectl delete node <your-node-name>
:
If a Node is confirmed to be dead (e.g. permanently disconnected from the network, powered down, etc), then delete the Node object.
Then pod will be re-scheduled on the other node.
The last workaround is to set TerminationGracePeriodSeconds
to 0
, but this is strongly discouraged:
For the above to lead to graceful termination, the Pod must not specify a
pod.Spec.TerminationGracePeriodSeconds
of 0. The practice of setting apod.Spec.TerminationGracePeriodSeconds
of 0 seconds is unsafe and strongly discouraged for StatefulSet Pods. Graceful deletion is safe and will ensure that the Pod shuts down gracefully before the kubelet deletes the name from the apiserver.