amazon-web-serviceskubernetes

'Kubelet stopped posting node status' and node inaccessible


I am having some issues with a fairly new cluster where a couple of nodes (always seems to happen in pairs but potentially just a coincidence) will become NotReady and a kubectl describe will say that the Kubelet stopped posting node status for memory, disk, PID and ready.

All of the running pods are stuck in Terminating (can use k9s to connect to the cluster and see this) and the only solution I have found is to cordon and drain the nodes. After a few hours they seem to be being deleted and new ones created. Alternatively I can delete them using kubectl.

They are completely inaccessible via ssh (timeout) but AWS reports the EC2 instances as having no issues.

This has now happened three times in the past week. Everything does recover fine but there is clearly some issue and I would like to get to the bottom of it.

How would I go about finding out what has gone on if I cannot get onto the boxes at all? (Actually just occurred to me to maybe take a snapshot of the volume and mount it so will try that if it happens again, but any other suggestions welcome)

Running kubernetes v1.18.8


Solution

  • The answer turned out to be an issue with iops as a result of du commands coming from - I think - cadvisor. I have moved to io1 boxes and have had stability since then so going to mark this as closed and the move of ec2 instance types as the resolution

    Thanks for the help!