I have set up a SAP Vora2.1 installation on AWS using kops. It is a 4 node cluster with 1 master and 3 nodes. the persistent volume requirements for vsystem-vrep is provided using AWS-EFS and for other stateful components by using AWS-EBS. While the installation goes through fine and runs for few days but after 3-4 days following 5 vora pods starts showing some issues, vora-catalog Vora-relational Vora-timeseries vora-tx-coordinator vora-disk
Each of these pods has 2 containers and both should be up and running. However after 3-4 days one of the containers goes down on its own although kubernetes cluster is up and running. I tried various ways to bring these pods up and running with all required containers in it but it does not come up.
I have captured events for vora-disk as sample but all of pods show same trace,
Events:
FirstSeen LastSeen Count From SubObjectPath Type Reason Message
--------- -------- ----- ---- ------------- -------- ------ -------
1h 7m 21 kubelet, ip-172-31-64-23.ap-southeast-2.compute.internal spec.containers{disk} Warning Unhealthy Liveness probe failed: dial tcp 100.96.7.21:10002: getsockopt: connection refused
1h 2m 11 kubelet, ip-172-31-64-23.ap-southeast-2.compute.internal spec.containers{disk} Normal Killing Killing container with id docker://disk:pod "vora-disk-0_vora(2f5ea6df-545b-11e8-90fd-029979a0ef92)" container "disk" is unhealthy, it will be killed and re-created.
1h 58s 51 kubelet, ip-172-31-64-23.ap-southeast-2.compute.internal Warning FailedSync Error syncing pod
1h 58s 41 kubelet, ip-172-31-64-23.ap-southeast-2.compute.internal spec.containers{disk} Warning BackOff Back-off restarting failed container
1h 46s 11 kubelet, ip-172-31-64-23.ap-southeast-2.compute.internal spec.containers{disk} Normal Started Started container
1h 46s 11 kubelet, ip-172-31-64-23.ap-southeast-2.compute.internal spec.containers{disk} Normal Pulled Container image "ip-172-31-13-236.ap-southeast-2.compute.internal:5000/vora/dqp:2.1.32.19-vora-2.1" already present on machine
1h 46s 11 kubelet, ip-172-31-64-23.ap-southeast-2.compute.internal spec.containers{disk} Normal Created Created container
1h 1s 988 kubelet, ip-172-31-64-23.ap-southeast-2.compute.internal spec.containers{disk} Warning Unhealthy Readiness probe failed: HTTP probe failed with statuscode: 503
Appreciate if any pointers to resolve this issue.
Thanks Frank for you suggestion and pointer. Definitely this has helped to overcome few issues but not all.
We have specifically observed issues related to Vora services going down for no reason. While we understand that there may be some reason why Vora goes down however the recovery procedure is not available either in admin guide or anywhere on internet. We have seen Vora services created by vora-operator going down (each of these pods contains one security container and other service specific container. Service specific container goes down and does not come up). we tried various options like restarting all vora pods or only restarting pods related to vora deployment operator but these pods do not come up. We are re-deploying Vora in such cases but that essentially means all previous work goes away. Is there any command or way so that Vora pods comes up with all container?
This issue is described in SAP Note 2631736 - Liveness and Readiness issue in Vora 2.x - it is suggested to increase the health check interval.