kubernetesgoogle-kubernetes-enginereadinessprobelivenessprobe

k8s readiness and liveness probes failing even though the endpoints are working


I've got a Next.js app which has 2 simple readiness and liveness endpoints with the following implementation:

return res.status(200).send('OK');

I've created the endpoints as per the api routes docs. Also, I've got a /stats basePath as per the docs here. So, the probes endpoints are at /stats/api/readiness and /stats/api/liveness.

When I build and run the app in a Docker container locally - the probe endpoints are accessible and returning 200 OK.

When I deploy the app to my k8s cluster, though, the probes fail. There's plenty of initialDelaySeconds time, so that's not the cause.

I connect to the service of the pod thru port-forward and when the pod has just started, before it fails, I can hit the endpoint and it returns 200 OK. And a bit after it starts failing as usual.

I also tried accessing the failing pod thru a healthy pod:

k exec -t [healthy pod name] -- curl -l 10.133.2.35:8080/stats/api/readiness

And the same situation - in the beginning, while the pod hasn't failed yet, I get 200 OK on the curl command. And a bit after, it start failing.

The error on the probes that I get is:

Readiness probe failed: Get http://10.133.2.35:8080/stats/api/readiness: net/http: request canceled (Client.Timeout exceeded while awaiting headers)

Funny experiment - I tried putting a random, non-existent endpoint for the probes, and I get the same error. Which leads me to the thought that the probes fail because it cannot access the proper endpoints?

But then again, the endpoints are accessible for a period of time before the probes start failing. So, I have literally no idea why this is happening.

Here is my k8s deployment config for the probes:

      livenessProbe:
        httpGet:
          path: /stats/api/liveness
          port: 8080
          scheme: HTTP
        initialDelaySeconds: 10
        timeoutSeconds: 3
        periodSeconds: 3
        successThreshold: 1
        failureThreshold: 5
      readinessProbe:
        httpGet:
          path: /stats/api/readiness
          port: 8080
          scheme: HTTP
        initialDelaySeconds: 10
        timeoutSeconds: 3
        periodSeconds: 3
        successThreshold: 1
        failureThreshold: 3

Update

used curl -v as requested from comments. The result is:

*   Trying 10.133.0.12:8080...
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0* Connected to 10.133.0.12 (10.133.0.12) port 8080 (#0)
> GET /stats/api/healthz HTTP/1.1
> Host: 10.133.0.12:8080
> User-Agent: curl/7.76.1
> Accept: */*
>
* Mark bundle as not supporting multiuse
< HTTP/1.1 200 OK
< ETag: "2-nOO9QiTIwXgNtWtBJezz8kv3SLc"
< Content-Length: 2
< Date: Wed, 16 Jun 2021 18:42:23 GMT
< Connection: keep-alive
< Keep-Alive: timeout=5
<
{ [2 bytes data]
100     2  100     2    0     0    666      0 --:--:-- --:--:-- --:--:--   666
* Connection #0 to host 10.133.0.12 left intact
OK%

Then, ofcourse, once it starts failing, the result is:

*   Trying 10.133.0.12:8080...
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0* connect to 10.133.0.12 port 8080 failed: Connection refused
* Failed to connect to 10.133.0.12 port 8080: Connection refused
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
* Closing connection 0
curl: (7) Failed to connect to 10.133.0.12 port 8080: Connection refused
command terminated with exit code 7

Solution

  • Error tells you: Client.Timeout exceeded while awaiting headers. Meaning the TCP connection is established (not refused, nor timing out).

    Your liveness/readiness probe timeout is too low. Your application doesn't have enough time to respond.

    Could be due to CPU or memory allocations being smaller than when using your laptop, due to higher concurrency, maybe a LimitRange that sets some defaults when you did not.

    Check with:

    time kubectl exec -t [healthy pod name] -- curl -l 127.0.0.1:8080/stats/api/readiness
    

    If you can't allocate more CPU, double that time, round it up, and fix your probes:

      livenessProbe:
        ...
        timeoutSeconds: 10
    
      readinessProbe:
        ...
        timeoutSeconds: 10
    

    Alternatively, though probably less in the spirit, you could replace those httpGet checks with tcpSocket ones. They would be faster, though may miss actual issues.