I built a service that utilizes docker pods to process data. The time it takes varies from as little as 15 minutes to as much as 1 hour.
My applications captures SIGTERM to ensure a graceful shutdown takes place when demand drops while Pods and Nodes are decommissioned.
In each docker image I placed code to report back if it shutdown because it completed the work and if a SIGTERM event took place and thus completed its processing and terminated.
My system is deployed in AWS using EKS. I use EKS to manage node deployment when demand goes up and spindown nodes when demand drops. I use KEDA to manage POD deployment which is what helps trigger whether additional nodes are needed or not. In KEDA I have the cooldownPeriod defined for 2 hours the maximum I expect a pod to take even though the max it would ever take is 1 hour.
In AWS EKS, I have defined the terminationGracePeriodSeconds for 2 hours as well.
I isolated the issue during Node scale down that when nodes are being terminated, the terminationGracePeriodSeconds is not being honored and my Pods are being shutdown within ~30 minutes. Because the Pods are abruptly removed I am unable to look at their logs to see what happened.
I tried to simulate this issue by issuing a kubernetes node drain and kept my pod running
kubectl drain <MY NODE>
I saw the SIGTERM come through, and I also noticed that the pod was only terminated after 2 hours and not before.
So for a brief minute I thought maybe I did not configure the terminationGracePeriod properly, so I checked:
kubectl get deployment test-mypod -o yaml|grep terminationGracePeriodSeconds
terminationGracePeriodSeconds: 7200
I even redeployed the config but that made no difference.
However, I was able to reproduce the issue by modifying the desiredSize of the Node group. I can reproduce it programmatically in Python by doing this:
resp = self.eks_client.update_nodegroup_config(clusterName=EKS_CLUSTER_NAME,
nodegroupName=EKS_NODE_GROUP_NAME,
scalingConfig={'desiredSize': configured_desired_size})
or by simply going to AWS console and modifying the desiredSize there.
I see EKS choosing a node and if it happens that there is a pod processing data that will take about an hour, the pod is sometimes prematurely terminated.
I have logged on to that node that is being scaled down and found no evidence of the prematurely terminated Pod in the logs.
I was able to capture this information once
kubectl get events | grep test-mypod-b8dfc4665-zp87t
54m Normal Pulling pod/test-mypod-b8dfc4665-zp87t Pulling image ...
54m Normal Pulled pod/test-mypod-b8dfc4665-zp87t Successfully pulled image ...
54m Normal Created pod/test-mypod-b8dfc4665-zp87t Created container mypod
54m Normal Started pod/test-mypod-b8dfc4665-zp87t Started container mypod
23m Normal ScaleDown pod/test-mypod-b8dfc4665-zp87t deleting pod for node scale down
23m Normal Killing pod/test-mypod-b8dfc4665-zp87t Stopping container mypod
13m Warning FailedKillPod pod/test-po-b8dfc4665-zp87t error killing pod: failed to "KillContainer" for "mypod" with KillContainerError: "rpc error: code = Unknown desc = operation timeout: context deadline exceeded"
I once saw a pod removed for no reason as such when scaledown was disabled but it decided to remove my pod:
kubectl get events | grep test-mypod-b8dfc4665-vxqhv
45m Normal Pulling pod/test-mypod-b8dfc4665-vxqhv Pulling image ...
45m Normal Pulled pod/test-mypod-b8dfc4665-vxqhv Successfully pulled image ...
45m Normal Created pod/test-mypod-b8dfc4665-vxqhv Created container mypod
45m Normal Started pod/test-mypod-b8dfc4665-vxqhv Started container mypod
40m Normal Killing pod/test-mypod-b8dfc4665-vxqhv Stopping container mypod
This is the kuberenets version I have
Client Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.0" GitCommit:"9e991415386e4cf155a24b1da15becaa390438d8", GitTreeState:"clean", BuildDate:"2020-03-25T14:58:59Z", GoVersion:"go1.13.8", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"18+", GitVersion:"v1.18.20-eks-8c49e2", GitCommit:"8c49e2efc3cfbb7788a58025e679787daed22018", GitTreeState:"clean", BuildDate:"2021-10-17T05:13:46Z", GoVersion:"go1.13.15", Compiler:"gc", Platform:"linux/amd64"}
To minimize this issue, I deployed a Pod Disruption Budget during peak hours to block scale down and in the evening during low demand I remove the PDB which initiates the scaledown. However, that is not the right solution and even during low peak there are still pods that get stopped prematurely.
We worked with Amazon support to solve this issue. The final resolution was not far from @lub0v answer but there was still a missing component.
Our EKS system had only one node group that managed multiple Availability Zones. Instead I deployed one node group per Availability Zone. Once we did that the TerminationGracePeriod was being honored.
Also, don't forget prior answers I added earlier, ensure your pod annotation contains safe-to-evict set as false
Finaly, use --balance-similar-node-groups in your cluster autoscaler command line parameter if you prefer to have the same number of nodes deployed during upscaling. Currently this parameter is not honored during downscaling.
Reference on autoscaling: https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md