Kubernetes AutoScaler or changing Desired Nodes in AWS prematurely terminates Docker Pods

I built a service that utilizes docker pods to process data. The time it takes varies from as little as 15 minutes to as much as 1 hour.

My applications captures SIGTERM to ensure a graceful shutdown takes place when demand drops while Pods and Nodes are decommissioned.

In each docker image I placed code to report back if it shutdown because it completed the work and if a SIGTERM event took place and thus completed its processing and terminated.

My system is deployed in AWS using EKS. I use EKS to manage node deployment when demand goes up and spindown nodes when demand drops. I use KEDA to manage POD deployment which is what helps trigger whether additional nodes are needed or not. In KEDA I have the cooldownPeriod defined for 2 hours the maximum I expect a pod to take even though the max it would ever take is 1 hour.

In AWS EKS, I have defined the terminationGracePeriodSeconds for 2 hours as well.

I isolated the issue during Node scale down that when nodes are being terminated, the terminationGracePeriodSeconds is not being honored and my Pods are being shutdown within ~30 minutes. Because the Pods are abruptly removed I am unable to look at their logs to see what happened.

I tried to simulate this issue by issuing a kubernetes node drain and kept my pod running

kubectl drain <MY NODE>

I saw the SIGTERM come through, and I also noticed that the pod was only terminated after 2 hours and not before.

So for a brief minute I thought maybe I did not configure the terminationGracePeriod properly, so I checked:

kubectl get deployment test-mypod -o yaml|grep terminationGracePeriodSeconds
  terminationGracePeriodSeconds: 7200

I even redeployed the config but that made no difference.

However, I was able to reproduce the issue by modifying the desiredSize of the Node group. I can reproduce it programmatically in Python by doing this:

        resp = self.eks_client.update_nodegroup_config(clusterName=EKS_CLUSTER_NAME,
                                                       nodegroupName=EKS_NODE_GROUP_NAME,
                                                       scalingConfig={'desiredSize': configured_desired_size})

or by simply going to AWS console and modifying the desiredSize there.

I see EKS choosing a node and if it happens that there is a pod processing data that will take about an hour, the pod is sometimes prematurely terminated.

I have logged on to that node that is being scaled down and found no evidence of the prematurely terminated Pod in the logs.

I was able to capture this information once

kubectl get events | grep test-mypod-b8dfc4665-zp87t
54m         Normal    Pulling    pod/test-mypod-b8dfc4665-zp87t         Pulling image ...
54m         Normal    Pulled     pod/test-mypod-b8dfc4665-zp87t         Successfully pulled image ...
54m         Normal    Created    pod/test-mypod-b8dfc4665-zp87t         Created container mypod
54m         Normal    Started    pod/test-mypod-b8dfc4665-zp87t         Started container mypod
23m         Normal    ScaleDown  pod/test-mypod-b8dfc4665-zp87t         deleting pod for node scale down
23m         Normal    Killing    pod/test-mypod-b8dfc4665-zp87t         Stopping container mypod
13m         Warning   FailedKillPod   pod/test-po-b8dfc4665-zp87t       error killing pod: failed to "KillContainer" for "mypod" with KillContainerError: "rpc error: code = Unknown desc = operation timeout: context deadline exceeded"

I once saw a pod removed for no reason as such when scaledown was disabled but it decided to remove my pod:

kubectl get events | grep test-mypod-b8dfc4665-vxqhv
45m         Normal    Pulling    pod/test-mypod-b8dfc4665-vxqhv Pulling image ...
45m         Normal    Pulled     pod/test-mypod-b8dfc4665-vxqhv Successfully pulled image ...
45m         Normal    Created    pod/test-mypod-b8dfc4665-vxqhv Created container mypod
45m         Normal    Started    pod/test-mypod-b8dfc4665-vxqhv Started container mypod
40m         Normal    Killing    pod/test-mypod-b8dfc4665-vxqhv Stopping container mypod

This is the kuberenets version I have

Client Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.0" GitCommit:"9e991415386e4cf155a24b1da15becaa390438d8", GitTreeState:"clean", BuildDate:"2020-03-25T14:58:59Z", GoVersion:"go1.13.8", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"18+", GitVersion:"v1.18.20-eks-8c49e2", GitCommit:"8c49e2efc3cfbb7788a58025e679787daed22018", GitTreeState:"clean", BuildDate:"2021-10-17T05:13:46Z", GoVersion:"go1.13.15", Compiler:"gc", Platform:"linux/amd64"}

To minimize this issue, I deployed a Pod Disruption Budget during peak hours to block scale down and in the evening during low demand I remove the PDB which initiates the scaledown. However, that is not the right solution and even during low peak there are still pods that get stopped prematurely.

Solution

We worked with Amazon support to solve this issue. The final resolution was not far from @lub0v answer but there was still a missing component.

Our EKS system had only one node group that managed multiple Availability Zones. Instead I deployed one node group per Availability Zone. Once we did that the TerminationGracePeriod was being honored.

Also, don't forget prior answers I added earlier, ensure your pod annotation contains safe-to-evict set as false

Finaly, use --balance-similar-node-groups in your cluster autoscaler command line parameter if you prefer to have the same number of nodes deployed during upscaling. Currently this parameter is not honored during downscaling.

Reference on autoscaling: https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md