I upgrade AKS using Azure portal from 1.23.5 to 1.24.9. This part finished properly (or so I assumed) based on below status on Azure portal.
I continued with 1.24.9 to 1.25.5. This time it worked partly. Azure portal shows 1.25.5 for nodepool with provision state "Failed". While nodes are still at 1.24.9.
I found that some nodes were having issues connecting to network including outside e.g. github as well as internal "services". For some reason it is intermittent issue. On same node it sometime works and sometimes not. (I had pods running on each node with python.)
Each node has cluster IP in resolv.conf
One of the question on SO had a hint about ingress-nginx
compatibility. I found that I had an incompatible version. So I upgraded it to 1.6.4 which is compatible with 1.24 and 1.25 both
But this network issue still persists. I am not sure if this is because AKS provisioning state of "Failed". Connectivity check for this cluster in Azure portal is Success. Only issue reported in Azure portal diagnostics is nodepool provisioning state.
is there anything I need to do after ingress-nginx
upgrade for all nodes/pods to get the new config?
Or is there a way to re-trigger this upgrade? although I am not sure why, but just assuming that it may reset the configs on all nodes and might work.
Ok, posting the solution and journey to it here, so just in case someone comes across similar issue.
There was a network issue in the cluster after the upgrade. This was the reason all pods had DNS issue. Due to these network issues metrics-server was not in running state and it's pdb allowed disruption remained at 0 and it was causing PodDrainFaulure
errors while upgrading the node.
I was able to force the upgrade to upgrade all nodes to 1.25.5 by running the command az aks nodepool upgrade -n agentpool -g rg_name --cluster-name aks_name --node-image-only
. However, after executing this, I had to keep on deleting the pdb to get all the nodes to upgrade.
This helped to get control plane and all nodes to reach 1.25.5 version, however overall status still remained in failed(Running) state. This was solved by triggering another upgrade with --cluster-plane-only flag
--resource-group <ResourceGroupName> --name <AKSClusterName> \
--control-plane-only \
--kubernetes-version <KubernetesVersion>
However, this did not solve the core networking issues, as starting from metrics server to application pods, all were failing trying to resolve hostnames. Interesting thing was that internal services were not reachable at all, however outside network e.g. github.com, microsoft.com etc would work intermittently.
Based on AKS issue 2903 and related ingress-nginx issue 8501 found that after k8s 1.24 ingress-nginx needs special annotation to keep health probes running properly. had to update helm with below command
helm upgrade ingress-nginx ingress-nginx/ingress-nginx \
--reuse-values \
--namespace <NAMESPACE> \
--set controller.service.annotations."service\.beta\.kubernetes\.io/azure-load-balancer-health-probe-request-path"=/healthz
This did get the Azure aks health dashboard and metrics server to show as running state, but it did not solve the underlying network issue. used all the scenarios and commands mentioned in all sections of MS troubleshooting guide for outbound connections to narrow down that issue is somehow with kube-dns
service and CoreDNS pods.
as DNS resolution would work if upstreaming nameserver directly to coreDNS pod IP (need to run on the same node) OR public DNS, but it will fail if it was using kube-dns
service ip as configured in resolv.conf
jovyan@elyra-web-59f899c447-xw5x2:~$ host -a microsoft.com
Trying "microsoft.com.elyra-airflow.svc.cluster.local"
;; connection timed out; no servers could be reached
jovyan@elyra-web-59f899c447-xw5x2:~$ nslookup microsoft.com
;; connection timed out; no servers could be reached
jovyan@elyra-web-59f899c447-xw5x2:~$ nslookup microsoft.com 1.1.1.1
Server: 1.1.1.1
Address: 1.1.1.1#53
Non-authoritative answer:
Name: microsoft.com
Address: 20.103.85.33
Name: microsoft.com
Address: 20.112.52.29
Name: microsoft.com
Address: 20.81.111.85
Name: microsoft.com
Address: 20.84.181.62
Name: microsoft.com
Address: 20.53.203.50
I restarted coreDNS, konnectivity-agent and so on, but no help.
At the end found a hint from AKS issue 1320 which helped solved the issue. Even though this issue is related to k8s version 1.13 thus suggesting it was not a version specific problem. I deleted ALL pods from namespace kube-system
at once. Immediately after these pods were up and running, DNS issue was gone and all is working as before.
Phew, this was some journey of 5 days to get it solved. Looking forward to next upgrade in March now !