jenkinskubernetescalicocni

Kubernetes - Calico-Nodes 0/1 Ready


We are deploying Jenkins on the K8s env, with 1 master and 4 worker nodes using calico network plugin, the pods are created on the time of Job run in Jenkins, but the issue is hostnames don't resolve, no error logs in Jenkins, on checking the pods, calico pod on master node is down, not sure if this is cause for the above problem.

[root@kmaster-1 ~]#  kubectl get pod calico-node-lvvx4 -n kube-system -o wide
NAME                READY   STATUS    RESTARTS   AGE   IP             NODE                                  NOMINATED NODE   READINESS GATES
calico-node-lvvx4   0/1     Running   9          9d    x0.x1.x5.x6   kmaster-1.b.x.x.com   <none>           <none>



Events:
  Type     Reason     Age                       From                                          Message
  ----     ------     ----                      ----                                          -------
  Warning  Unhealthy  107s (x34333 over 3d23h)  kubelet, kmaster-1.b.x.x.com  (combined from similar events): Readiness probe failed: calico/node is not ready: BIRD is not ready: BGP not established with 10.x1.2x.x23,10.x1.x7.x53,10.x1.1x.1x5,10.x1.2x.1x22020-04-12 08:40:48.567 [INFO][27813] health.go 156: Number of node(s) with BGP peering established = 0

10.x1.2x.x23,10.x1.x7.x53,10.x1.1x.1x5,10.x1.2x.1x2 are the IPs of the worker pods, they are connected among themselves as netstat shows BGP established, but not with the master. Port 179 is open on the master,not sure why BGP peering doesn't establish, Kindly advice.


Solution

  • What Sanjay M. P. shared worked for me, however I want to clarify what caused the problem, and why the solution work with some more detail.

    First of all, I am running an ubuntu env, so what Piknik shared does not work, firewalld is only on centos / rhel systems. Even still, ufw was disabled on all nodes.

    I was able to narrow down the exact error I was receiving to cause this problem by doing a kubectl describe pod calico-node-*****. What I found was the calico BIRD service could not connect to peers. What also showed was the IP addresses the calico-node was trying to use to pair to for it's BGP peers. It was using the wrong interface, thereby wrong ips.

    To define the problem for myself, all of my node host vms have multiple interfaces. If you don't explicitly specify which interface to use, calico "automatically" picks one, weather you want that interface or not.

    The solution was to specify the specific interface when you build your calico overlay network in the calico.yaml file. Sanjay M. P. uses a regex, which MAY work if you have different named interfaces, however, as I am running Ubuntu Server, the string "ens" starts for all interfaces, so the same problem happens.

    I have stripped out most of the calico.yaml file to show the exact location of where this setting should be (~line 675) Add the setting there, I also left the CALICO_IPV4POOL_CIDR as well as this setting needs to be set appropriately to the same subnet range specified on kubeadm initialization:

    spec:
      template:
        spec:
          containers:
            - name: calico-node
              image: calico/node:v3.14.2
              env:
                - name: CALICO_IPV4POOL_CIDR
                  value: "192.168.0.0/22"
                - name: IP_AUTODETECTION_METHOD
                  value: "interface=ens224"
    

    Unfortunately I did not find a way to roll back older configurations, so I just rebuilt the whole cluster, and redeployed the calico overlay (Thank god for VM snapshots).

    kubeadm init your cluster. Then run kubectl create -f calico.yaml with the setting added to build out the overlay network.

    Confirm overlay network is working

    You can read more about IP_AUTODETECTION_METHOD here.