We're upgrading an EKS cluster from 1.24 to 1.25 with a couple of applications.
When provisioning the new worker nodes, right off the bat the new nodes were in NotReady status with the error (aws-node pod)
failed to connect service :50051
Digging a bit, it seemed to be related with CNI plugin for EKS (which we didn't have it installed by then). We installed it as per the docs and the nodes went up as expected.
After a while we noticed the nodes were restarting, namely the one starting the reset was one holding filebeat with the following error
instance/beat.go:1015 Exiting: error in autodiscover provider settings:
error setting up kubernetes autodiscover provider:
couldn't discover kubernetes node due to error kubernetes:
Node could not be discovered with any known method. Consider setting env var NODE_NAME
Even tough the error suggests the fix, which seem to be mentioend here, the discussion on the GitHub Issue got me thinking that maybe is something related to the CNI itself blocking filebeat communication with the Kubernetes API Server.
Another fact that backs my belief is that we have another cluster running on a version above 1.25, having its filebeat deployment the same settings as ours (being migrated to 1.25).
Any light on the discovery is much appreciated
The issue was actually being hinted during one of the updated where the aws-node pod being launched wasn't able to connect to the API Server. Turns out that when upgrading a cluster, EKS won't auto update the addons for you, so we upgraded the kube-proxy add on and things went back to operational