vora

SAP Vora2.1 on AWS intermittently goes down


I have set up a SAP Vora2.1 installation on AWS using kops. It is a 4 node cluster with 1 master and 3 nodes. the persistent volume requirements for vsystem-vrep is provided using AWS-EFS and for other stateful components by using AWS-EBS. While the installation goes through fine and runs for few days but after 3-4 days following 5 vora pods starts showing some issues, vora-catalog Vora-relational Vora-timeseries vora-tx-coordinator vora-disk

Each of these pods has 2 containers and both should be up and running. However after 3-4 days one of the containers goes down on its own although kubernetes cluster is up and running. I tried various ways to bring these pods up and running with all required containers in it but it does not come up.

I have captured events for vora-disk as sample but all of pods show same trace,

Events:
  FirstSeen     LastSeen        Count   From                                                            SubObjectPath           Type            Reason          Message
  ---------     --------        -----   ----                                                            -------------           --------        ------          -------
  1h            7m              21      kubelet, ip-172-31-64-23.ap-southeast-2.compute.internal        spec.containers{disk}   Warning         Unhealthy       Liveness probe failed: dial tcp 100.96.7.21:10002: getsockopt: connection refused
  1h            2m              11      kubelet, ip-172-31-64-23.ap-southeast-2.compute.internal        spec.containers{disk}   Normal          Killing         Killing container with id docker://disk:pod "vora-disk-0_vora(2f5ea6df-545b-11e8-90fd-029979a0ef92)" container "disk" is unhealthy, it will be killed and re-created.
  1h            58s             51      kubelet, ip-172-31-64-23.ap-southeast-2.compute.internal                                Warning         FailedSync      Error syncing pod
  1h            58s             41      kubelet, ip-172-31-64-23.ap-southeast-2.compute.internal        spec.containers{disk}   Warning         BackOff         Back-off restarting failed container
  1h            46s             11      kubelet, ip-172-31-64-23.ap-southeast-2.compute.internal        spec.containers{disk}   Normal          Started         Started container
  1h            46s             11      kubelet, ip-172-31-64-23.ap-southeast-2.compute.internal        spec.containers{disk}   Normal          Pulled          Container image "ip-172-31-13-236.ap-southeast-2.compute.internal:5000/vora/dqp:2.1.32.19-vora-2.1" already present on machine
  1h            46s             11      kubelet, ip-172-31-64-23.ap-southeast-2.compute.internal        spec.containers{disk}   Normal          Created         Created container
  1h            1s              988     kubelet, ip-172-31-64-23.ap-southeast-2.compute.internal        spec.containers{disk}   Warning         Unhealthy       Readiness probe failed: HTTP probe failed with statuscode: 503

Appreciate if any pointers to resolve this issue.


Thanks Frank for you suggestion and pointer. Definitely this has helped to overcome few issues but not all.

We have specifically observed issues related to Vora services going down for no reason. While we understand that there may be some reason why Vora goes down however the recovery procedure is not available either in admin guide or anywhere on internet. We have seen Vora services created by vora-operator going down (each of these pods contains one security container and other service specific container. Service specific container goes down and does not come up). we tried various options like restarting all vora pods or only restarting pods related to vora deployment operator but these pods do not come up. We are re-deploying Vora in such cases but that essentially means all previous work goes away. Is there any command or way so that Vora pods comes up with all container?


Solution

  • This issue is described in SAP Note 2631736 - Liveness and Readiness issue in Vora 2.x - it is suggested to increase the health check interval.