Kubernetes - Liveness and Readiness probe implementation

I'm developing a service using Spring and deploying it on OpenShift. Currently I'm using Spring Actuator health endpoint to serve as a liveness and readiness probe for Kubernetes.

However, I will add a call to another service in a Actuator health endpoint, and it looks to me that in that case I need to implement new liveness probe for my service. If I don't do that then a failure in a second service will result with a failure in liveness probe failing and Kubernetes will restart my service without any real need.

Is it OK, for a liveness probe, to implement some simple REST controller which will always return HTTP status 200? If it works, the service can always be considered as alive? Or is there any better way to do it?

Solution

Liveness Probe

Include only those checks which you think, if fails, will get healed with a pod restart. There is nothing wrong in having a new endpoint that always return an HTTP 200, which will serve as a liveness probe endpoint; provided you have independent monitoring and alerting in place for other services on which your first service depends on.

Where does a simple http 200 liveness helps?

Well, let's consider these examples.

If your application is a one-thread-per-http-request application (servlet based application - like application runs on tomcat - which is spring boot 1.X's default choice), in the case of heavy-load it may become unresponsive. A pod-restart will help here.
If you don't have memory configured while you start your application; in case of heavy-load, application may outrun the pod's allocated memory and app may become unresponsive. A pod-restart will help here too.

Readiness Probe

There are 2 aspects to it.

1) Let's just say url of the second service is hard coded in the first service, and you screwed up that url in a subsequent release of your first service. If your second service's http200 also included in the health check (of the 1st service) then that will prevent the screwed-up version of the deployment from going live; your old version will keep running because your newer version will never make it across the health-check. This additional check in your health-check will prevent the buggy version from going live

2) On the other hand, Let's assume that your first service has numerous other functionalities and this second service being down for a few hours will not affect any significant functionality that first service offers. Then, by all means you can opt out of the second service's liveness from first service's health check.

Either way, you need to set up proper alerting and monitoring for both the services. This will help to decide when humans should intervene.

What I would do is (ignore other irrelevant details),

readinessProbe:
  httpGet:
    path: </Actuator-healthcheck-endpoint>
    port: 8080
  initialDelaySeconds: 120
  timeoutSeconds: 5
livenessProbe:
  httpGet:
    path: </my-custom-endpoint-which-always-returns200>
    port: 8080
  initialDelaySeconds: 130
  timeoutSeconds: 10
  failureThreshold: 10