Are Kubernetes liveness probe failures voluntary or involuntary disruptions?

I have an application deployed to Kubernetes that depends on an outside application. Sometimes the connection between these 2 goes to an invalid state, and that can only be fixed by restarting my application.

To do automatic restarts, I have configured a liveness probe that will verify the connection.

This has been working great, however, I'm afraid that if that outside application goes down (such that the connection error isn't just due to an invalid pod state), all of my pods will immediately restart, and my application will become completely unavailable. I want it to remain running so that functionality not depending on the bad service can continue.

I'm wondering if a pod disruption budget would prevent this scenario, as it limits the # of pods down due to a "voluntary" disruption. However, the K8s docs don't state whether liveness probe failure are a voluntary disruption. Are they?

Solution

I would say, accordingly to the documentation:

Voluntary and involuntary disruptions

Pods do not disappear until someone (a person or a controller) destroys them, or there is an unavoidable hardware or system software error.

We call these unavoidable cases involuntary disruptions to an application. Examples are:

a hardware failure of the physical machine backing the node

cluster administrator deletes VM (instance) by mistake

cloud provider or hypervisor failure makes VM disappear

a kernel panic

the node disappears from the cluster due to cluster network partition

eviction of a pod due to the node being out-of-resources.

Except for the out-of-resources condition, all these conditions should be familiar to most users; they are not specific to Kubernetes.

We call other cases voluntary disruptions. These include both actions initiated by the application owner and those initiated by a Cluster Administrator. Typical application owner actions include:

deleting the deployment or other controller that manages the pod

updating a deployment's pod template causing a restart

directly deleting a pod (e.g. by accident)

Cluster administrator actions include:

Draining a node for repair or upgrade.

Draining a node from a cluster to scale the cluster down (learn about Cluster Autoscaling ).

Removing a pod from a node to permit something else to fit on that node.

-- Kubernetes.io: Docs: Concepts: Workloads: Pods: Disruptions

So your example is quite different and according to my knowledge it's neither voluntary or involuntary disruption.

Also taking a look on another Kubernetes documentation:

Pod lifetime

Like individual application containers, Pods are considered to be relatively ephemeral (rather than durable) entities. Pods are created, assigned a unique ID (UID), and scheduled to nodes where they remain until termination (according to restart policy) or deletion. If a Node dies, the Pods scheduled to that node are scheduled for deletion after a timeout period.

Pods do not, by themselves, self-heal. If a Pod is scheduled to a node that then fails, the Pod is deleted; likewise, a Pod won't survive an eviction due to a lack of resources or Node maintenance. Kubernetes uses a higher-level abstraction, called a controller, that handles the work of managing the relatively disposable Pod instances.

-- Kubernetes.io: Docs: Concepts: Workloads: Pods: Pod lifecycle: Pod lifetime

Container probes

The kubelet can optionally perform and react to three kinds of probes on running containers (focusing on a livenessProbe):

livenessProbe: Indicates whether the container is running. If the liveness probe fails, the kubelet kills the container, and the container is subjected to its restart policy. If a Container does not provide a liveness probe, the default state is Success.

-- Kubernetes.io: Docs: Concepts: Workloads: Pods: Pod lifecycle: Container probes

When should you use a liveness probe?

If the process in your container is able to crash on its own whenever it encounters an issue or becomes unhealthy, you do not necessarily need a liveness probe; the kubelet will automatically perform the correct action in accordance with the Pod's restartPolicy.

If you'd like your container to be killed and restarted if a probe fails, then specify a liveness probe, and specify a restartPolicy of Always or OnFailure.

-- Kubernetes.io: Docs: Concepts: Workloads: Pods: Pod lifecycle: When should you use a startup probe

According to those information it would be better to create custom liveness probe which should consider internal process health checks and external dependency(liveness) health check. In the first scenario your container should stop/terminate your process unlike the the second case with external dependency.

Answering following question:

I'm wondering if a pod disruption budget would prevent this scenario.

In this particular scenario PDB will not help.

I'd reckon giving more visibility to the comment, I've made with additional resources on the matter could prove useful to other community members: