[SOLVED] Kubernetes NodeLost/NotReady / High IO Disks

Kubernetes NodeLost/NotReady / High IO Disks

I am experiencing a very complicated issue with Kubernetes in my production environments losing all their Agent Nodes, they change from Ready to NotReady, all the pods change from Running to NodeLost state. I have discovered that Kubernetes is making intensive usage of disks:

My cluster is deployed using acs-engine 0.17.0 (and I tested previous versions too and the same happened).

On the other hand, we decided to deploy the Standard_DS2_VX VM series which contains Premium disks and we incresed the IOPS to 2000 (It was previously under 500 IOPS) and same thing happened. I am going to try with a higher number now.

Any help on this will be appreaciated.

Solution

It was a microservice exhauting resources and then Kubernetes just halt the nodes. We have worked on establishing resources/limits based so we can avoid the entire cluster disruption.