kubetnetes cluster in Azure (AKS) upgrade 1.24.9 in fail state with pods facing intermittent DNS issues

I upgrade AKS using Azure portal from 1.23.5 to 1.24.9. This part finished properly (or so I assumed) based on below status on Azure portal.

I continued with 1.24.9 to 1.25.5. This time it worked partly. Azure portal shows 1.25.5 for nodepool with provision state "Failed". While nodes are still at 1.24.9.

I found that some nodes were having issues connecting to network including outside e.g. github as well as internal "services". For some reason it is intermittent issue. On same node it sometime works and sometimes not. (I had pods running on each node with python.)

Each node has cluster IP in resolv.conf

One of the question on SO had a hint about ingress-nginx compatibility. I found that I had an incompatible version. So I upgraded it to 1.6.4 which is compatible with 1.24 and 1.25 both

But this network issue still persists. I am not sure if this is because AKS provisioning state of "Failed". Connectivity check for this cluster in Azure portal is Success. Only issue reported in Azure portal diagnostics is nodepool provisioning state.

is there anything I need to do after ingress-nginx upgrade for all nodes/pods to get the new config? Or is there a way to re-trigger this upgrade? although I am not sure why, but just assuming that it may reset the configs on all nodes and might work.

Solution

Ok, posting the solution and journey to it here, so just in case someone comes across similar issue.

There was a network issue in the cluster after the upgrade. This was the reason all pods had DNS issue. Due to these network issues metrics-server was not in running state and it's pdb allowed disruption remained at 0 and it was causing PodDrainFaulure errors while upgrading the node.

I was able to force the upgrade to upgrade all nodes to 1.25.5 by running the command az aks nodepool upgrade -n agentpool -g rg_name --cluster-name aks_name --node-image-only . However, after executing this, I had to keep on deleting the pdb to get all the nodes to upgrade.

This helped to get control plane and all nodes to reach 1.25.5 version, however overall status still remained in failed(Running) state. This was solved by triggering another upgrade with --cluster-plane-only flag

   --resource-group <ResourceGroupName> --name <AKSClusterName> \
   --control-plane-only \
   --kubernetes-version <KubernetesVersion>

However, this did not solve the core networking issues, as starting from metrics server to application pods, all were failing trying to resolve hostnames. Interesting thing was that internal services were not reachable at all, however outside network e.g. github.com, microsoft.com etc would work intermittently.

Based on AKS issue 2903 and related ingress-nginx issue 8501 found that after k8s 1.24 ingress-nginx needs special annotation to keep health probes running properly. had to update helm with below command

helm upgrade ingress-nginx ingress-nginx/ingress-nginx \
  --reuse-values \
  --namespace <NAMESPACE> \
  --set controller.service.annotations."service\.beta\.kubernetes\.io/azure-load-balancer-health-probe-request-path"=/healthz

This did get the Azure aks health dashboard and metrics server to show as running state, but it did not solve the underlying network issue. used all the scenarios and commands mentioned in all sections of MS troubleshooting guide for outbound connections to narrow down that issue is somehow with kube-dns service and CoreDNS pods.

as DNS resolution would work if upstreaming nameserver directly to coreDNS pod IP (need to run on the same node) OR public DNS, but it will fail if it was using kube-dns service ip as configured in resolv.conf

jovyan@elyra-web-59f899c447-xw5x2:~$ host -a microsoft.com
Trying "microsoft.com.elyra-airflow.svc.cluster.local"
;; connection timed out; no servers could be reached

jovyan@elyra-web-59f899c447-xw5x2:~$ nslookup microsoft.com
;; connection timed out; no servers could be reached

jovyan@elyra-web-59f899c447-xw5x2:~$ nslookup microsoft.com 1.1.1.1
Server:         1.1.1.1
Address:        1.1.1.1#53

Non-authoritative answer:
Name:   microsoft.com
Address: 20.103.85.33
Name:   microsoft.com
Address: 20.112.52.29
Name:   microsoft.com
Address: 20.81.111.85
Name:   microsoft.com
Address: 20.84.181.62
Name:   microsoft.com
Address: 20.53.203.50

I restarted coreDNS, konnectivity-agent and so on, but no help. At the end found a hint from AKS issue 1320 which helped solved the issue. Even though this issue is related to k8s version 1.13 thus suggesting it was not a version specific problem. I deleted ALL pods from namespace kube-system at once. Immediately after these pods were up and running, DNS issue was gone and all is working as before.

Phew, this was some journey of 5 days to get it solved. Looking forward to next upgrade in March now !