kubernetesreadinessprobeeclipse-honoeclipse-ditto

Readiness fails in the Eclipse Hono pods of the Cloud2Edge package


I am a bit desperate and I hope someone can help me. A few months ago I installed the eclipse cloud2edge package on a kubernetes cluster by following the installation instructions, creating a persistentVolume and running the helm install command with these options.

helm install -n $NS --wait --timeout 15m $RELEASE eclipse-iot/cloud2edge --set hono.prometheus.createInstance=false --set hono.grafana.enabled=false --dependency-update --debug

The yaml of the persistentVolume is the following and I create it in the same namespace that I install the package.

apiVersion: v1
kind: PersistentVolume
metadata:
  name: pv-device-registry
spec:
  accessModes: 
    - ReadWriteOnce
capacity:
  storage: 1Mi
hostPath:
  path: /mnt/
  type: Directory

Everything works perfectly, all pods were ready and running, until the other day when the cluster crashed and some pods stopped working.

The kubectl get pods -n $NS output is as follows:

NAME                                          READY   STATUS    RESTARTS   AGE
ditto-mongodb-7b78b468fb-8kshj                1/1     Running   0          50m
dt-adapter-amqp-vertx-6699ccf495-fc8nx        0/1     Running   0          50m
dt-adapter-http-vertx-545564ff9f-gx5fp        0/1     Running   0          50m
dt-adapter-mqtt-vertx-58c8975678-k5n49        0/1     Running   0          50m
dt-artemis-6759fb6cb8-5rq8p                   1/1     Running   1          50m
dt-dispatch-router-5bc7586f76-57dwb           1/1     Running   0          50m
dt-ditto-concierge-f6d5f6f9c-pfmcw            1/1     Running   0          50m
dt-ditto-connectivity-f556db698-q89bw         1/1     Running   0          50m
dt-ditto-gateway-589d8f5596-59c5b             1/1     Running   0          50m
dt-ditto-nginx-897b5bc76-cx2dr                1/1     Running   0          50m
dt-ditto-policies-75cb5c6557-j5zdg            1/1     Running   0          50m
dt-ditto-swaggerui-6f6f989ccd-jkhsk           1/1     Running   0          50m
dt-ditto-things-79ff869bc9-l9lct              1/1     Running   0          50m
dt-ditto-thingssearch-58c5578bb9-pwd9k        1/1     Running   0          50m
dt-service-auth-698d4cdfff-ch5wp              1/1     Running   0          50m
dt-service-command-router-59d6556b5f-4nfcj    0/1     Running   0          50m
dt-service-device-registry-7cf75d794f-pk9ct   0/1     Running   0          50m

The pods that fail all have the same error when running kubectl describe pod POD_NAME -n $NS.

Events:
Type     Reason     Age                    From               Message
----     ------     ----                   ----               -------
Normal   Scheduled  53m                    default-scheduler  Successfully assigned digitaltwins/dt-service-command-router-59d6556b5f-4nfcj to node1
Normal   Pulled     53m                    kubelet            Container image "index.docker.io/eclipse/hono-service-command-router:1.8.0" already present on machine
Normal   Created    53m                    kubelet            Created container service-command-router
Normal   Started    53m                    kubelet            Started container service-command-router
Warning  Unhealthy  52m                    kubelet            Readiness probe failed: Get "https://10.244.1.89:8088/readiness": net/http: request canceled (Client.Timeout exceeded while awaiting headers)
Warning  Unhealthy  2m58s (x295 over 51m)  kubelet            Readiness probe failed: HTTP probe failed with statuscode: 503

According to this, the readinessProbe fails. In the yalm definition of the affected deployments, the readinessProbe is defined:

readinessProbe:
  failureThreshold: 3
  httpGet:
     path: /readiness
     port: health
     scheme: HTTPS
  initialDelaySeconds: 45
  periodSeconds: 10
  successThreshold: 1
  timeoutSeconds: 1

I have tried increasing these values, increasing the delay to 600 and the timeout to 10. Also i have tried uninstalling the package and installing it again, but nothing changes: the installation fails because the pods are never ready and the timeout pops up. I have also exposed port 8088 (health) and called /readiness with wget and the result is still 503. On the other hand, I have tested if livenessProbe works and it works fine. I have also tried resetting the cluster. First I manually deleted everything in it and then used the following commands:

sudo kubeadm reset
sudo iptables -F && sudo iptables -t nat -F && sudo iptables -t mangle -F && sudo iptables -X
sudo systemctl stop kubelet
sudo systemctl stop docker
sudo rm -rf /var/lib/cni/
sudo rm -rf /var/lib/kubelet/*
sudo rm -rf /etc/cni/
sudo ifconfig cni0 down
sudo ifconfig flannel.1 down
sudo ifconfig docker0 down
sudo ip link set cni0 down
sudo brctl delbr cni0  
sudo systemctl start docker
sudo kubeadm init --apiserver-advertise-address=192.168.44.11 --pod-network-cidr=10.244.0.0/16
mkdir -p $HOME/.kube
sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
sudo chown $(id -u):$(id -g) $HOME/.kube/config
kubectl --kubeconfig $HOME/.kube/config apply -f https://raw.githubusercontent.com/coreos/flannel/master/Documentation/kube-flannel.yml

The cluster seems to work fine because the Eclipse Ditto part has no problem, it's just the Eclipse Hono part. I add a little more information in case it may be useful.

The kubectl logs dt-service-command-router-b654c8dcb-s2g6t -n $NS output:

12:30:06.340 [vert.x-eventloop-thread-1] ERROR io.vertx.core.net.impl.NetServerImpl - Client from origin /10.244.1.101:44142 failed to connect over ssl: javax.net.ssl.SSLHandshakeException: Received fatal alert: certificate_unknown
12:30:06.756 [vert.x-eventloop-thread-1] ERROR io.vertx.core.net.impl.NetServerImpl - Client from origin /10.244.1.100:46550 failed to connect over ssl: javax.net.ssl.SSLHandshakeException: Received fatal alert: certificate_unknown
12:30:07.876 [vert.x-eventloop-thread-1] ERROR io.vertx.core.net.impl.NetServerImpl - Client from origin /10.244.1.102:40706 failed to connect over ssl: javax.net.ssl.SSLHandshakeException: Received fatal alert: certificate_unknown
12:30:08.315 [vert.x-eventloop-thread-1] DEBUG o.e.h.client.impl.HonoConnectionImpl - starting attempt [#258] to connect to server [dt-service-device-registry:5671, role: Device Registration]
12:30:08.315 [vert.x-eventloop-thread-1] DEBUG o.e.h.c.impl.ConnectionFactoryImpl - OpenSSL [available: false, supports KeyManagerFactory: false]
12:30:08.315 [vert.x-eventloop-thread-1] DEBUG o.e.h.c.impl.ConnectionFactoryImpl - using JDK's default SSL engine
12:30:08.315 [vert.x-eventloop-thread-1] DEBUG o.e.h.c.impl.ConnectionFactoryImpl - enabling secure protocol [TLSv1.3]
12:30:08.315 [vert.x-eventloop-thread-1] DEBUG o.e.h.c.impl.ConnectionFactoryImpl - enabling secure protocol [TLSv1.2]
12:30:08.315 [vert.x-eventloop-thread-1] DEBUG o.e.h.c.impl.ConnectionFactoryImpl - connecting to AMQP 1.0 container [amqps://dt-service-device-registry:5671, role: Device Registration]
12:30:08.339 [vert.x-eventloop-thread-1] DEBUG o.e.h.c.impl.ConnectionFactoryImpl - can't connect to AMQP 1.0 container [amqps://dt-service-device-registry:5671, role: Device Registration]: Failed to create SSL connection
12:30:08.339 [vert.x-eventloop-thread-1] WARN  o.e.h.client.impl.HonoConnectionImpl - attempt [#258] to connect to server [dt-service-device-registry:5671, role: Device Registration] failed
javax.net.ssl.SSLHandshakeException: Failed to create SSL connection

The kubectl logs dt-adapter-amqp-vertx-74d69cbc44-7kmdq -n $NS output:

12:19:36.686 [vert.x-eventloop-thread-0] DEBUG o.e.h.client.impl.HonoConnectionImpl - starting attempt [#19] to connect to server [dt-service-device-registry:5671, role: Credentials]
12:19:36.686 [vert.x-eventloop-thread-0] DEBUG o.e.h.c.impl.ConnectionFactoryImpl - OpenSSL [available: false, supports KeyManagerFactory: false]
12:19:36.686 [vert.x-eventloop-thread-0] DEBUG o.e.h.c.impl.ConnectionFactoryImpl - using JDK's default SSL engine
12:19:36.686 [vert.x-eventloop-thread-0] DEBUG o.e.h.c.impl.ConnectionFactoryImpl - enabling secure protocol [TLSv1.3]
12:19:36.686 [vert.x-eventloop-thread-0] DEBUG o.e.h.c.impl.ConnectionFactoryImpl - enabling secure protocol [TLSv1.2]
12:19:36.686 [vert.x-eventloop-thread-0] DEBUG o.e.h.c.impl.ConnectionFactoryImpl - connecting to AMQP 1.0 container [amqps://dt-service-device-registry:5671, role: Credentials]
12:19:36.711 [vert.x-eventloop-thread-0] DEBUG o.e.h.c.impl.ConnectionFactoryImpl - can't connect to AMQP 1.0 container [amqps://dt-service-device-registry:5671, role: Credentials]: Failed to create SSL connection
12:19:36.712 [vert.x-eventloop-thread-0] WARN  o.e.h.client.impl.HonoConnectionImpl - attempt [#19] to connect to server [dt-service-device-registry:5671, role: Credentials] failed
javax.net.ssl.SSLHandshakeException: Failed to create SSL connection

The kubectl version output is as follows:

Client Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.3", GitCommit:"1e11e4a2108024935ecfcb2912226cedeafd99df", GitTreeState:"clean", BuildDate:"2020-10-14T12:50:19Z", GoVersion:"go1.15.2", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.16", GitCommit:"e37e4ab4cc8dcda84f1344dda47a97bb1927d074", GitTreeState:"clean", BuildDate:"2021-10-27T16:20:18Z", GoVersion:"go1.15.15", Compiler:"gc", Platform:"linux/amd64"}

Thanks in advance!


Solution

  • based on the iconic Failed to create SSL Connection output in the logs, I assume that you have run into the dreaded The demo certificates included in the Hono chart have expired problem.

    The Cloud2Edge package chart is being updated currently (https://github.com/eclipse/packages/pull/337) with the most recent version of the Ditto and Hono charts (which includes fresh certificates that are valid for two more years to come). As soon as that PR is merged and the Eclipse Packages chart repository has been rebuilt, you should be able to do a helm repo update and then (hopefully) succesfully install the c2e package.