kubernetes deployment downtime statefulset rolling-updates

Kubernetes RollingUpdate differences between deployment and statefulset

I have a basic question regarding Kubernetes Cluster. Let's assume that you have a pod and pod need to be updated. During this time there should be no down-time. Of course maybe state or sessions are lost, but I more interested on pod update process.

When pod is controlled by a deployment configured with rolling update, then a new pod will be created and the old one will start termination, once the new pod is healthy the old pod will be deleted. This behavior happens with replicas configured to 1, and temporary actual replicas become 2.

When pod is controlled by a statefulset configured with rolling update, then the pod is deleted immediately and recreated again, so there is downtime. This behavior can change only if replicas configured to >= 2. But this way, always there are two pods.

Can someone explain, or if there is a way to fix this behavior ?

Solution

The key design point of a StatefulSet is that it's, well, stateful: each replica has its own corresponding PersistentVolumeClaim. In many servers you can't have multiple processes accessing the same local files at the same time (most database servers will refuse to start up if a lock file exists). Furthermore, the kinds of PersistentVolume that are easier to get tend to allow only ReadWriteOnce access; they can't be attached to multiple nodes.

Thus, for workloads that run in a StatefulSet, you normally can't be running an old and a new Pod at the same time.

If you need a zero-downtime update in a StatefulSet, most of the heavy lifting needs to be implemented inside the process. A multi-node replicated database is a good example here (think MongoDB or Elasticsearch). Any particular datum will normally exist on at least two of the replicas (configurable). If one replica temporarily goes down, another can take over its responsibility, and its data still exists; when it comes back, it can rejoin the cluster and get the updates it missed.

The other corollary to this is that you can't have a zero-downtime update for a single-replica StatefulSet. Again, consider a single-node non-replicated database (MySQL or PostgreSQL): even if you could start two Pods with the same PersistentVolume, the database process in the second Pod wouldn't start as long as the lock file exists, and it will exist until the first Pod exits. A Deployment would wait to terminate the old Pod until the new Pod's liveness and readiness probes passed, but if the database process is waiting to claim the lock file, this will never happen.