docker docker-compose docker-healthcheck

Docker Compose healthcheck: service never becomes unhealthy

I have a compose file with three services (database, backend and frontend). Backend depends on database being healthy, and frontend depends on backend being healthy.

Database (postgres) checks for its own health using pg_isready and backend (FastAPI) checks for its health via an endpoint http://localhost:8080/healthcheck

Compose file:

version: '3'
services:
  
  database:
    image: postgres:14-alpine
    healthcheck:
      test: pg_isready -U postgres
      interval: 1s
      timeout: 5s
      retries: 5
      start_period: 10s

  backend:
    depends_on:
      database:
        condition: service_healthy

    image: backend-api-image
    build: 
      context: backend
      dockerfile: Dockerfile

    ports:
      - "8080:8080"
    volumes:
      - './backend:/backend'

    healthcheck:
      test: wget --no-verbose --tries=1 --spider http://localhost:8080/healthcheck || exit 1
      interval: 1s
      timeout: 5s

  frontend:
    image: my-frontend
    depends_on:
      backend:
        condition: service_healthy
    build:
      context: ./frontend
      dockerfile: Dockerfile

FastAPI app

from fastapi import FastAPI
from fastapi.middleware.cors import CORSMiddleware

app = FastAPI()

app.add_middleware(
    CORSMiddleware,
    allow_origins=["*"],
    allow_credentials=True,
    allow_methods=["*"],
    allow_headers=["*"],
)

@app.get('/healthcheck')
def get_healthcheck():
    return 'OK'

So far this all works as expected. If, for example I were to have a typo in my healthcheck endpoint route (in my app), startup would fail, like so:

database  | 2023-06-01 23:01:44.410 UTC [1] LOG:  listening on IPv4 address "0.0.0.0", port 5432
database  | 2023-06-01 23:01:44.410 UTC [1] LOG:  listening on IPv6 address "::", port 5432
database  | 2023-06-01 23:01:44.411 UTC [1] LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
database  | 2023-06-01 23:01:44.414 UTC [22] LOG:  database system was shut down at 2023-06-01 22:51:10 UTC
database  | 2023-06-01 23:01:44.417 UTC [1] LOG:  database system is ready to accept connections
backend   | INFO:     Will watch for changes in these directories: ['/backend']
backend   | INFO:     Uvicorn running on http://0.0.0.0:8080 (Press CTRL+C to quit)
backend   | INFO:     Started reloader process [1] using StatReload
backend   | INFO:     Started server process [8]
backend   | INFO:     Waiting for application startup.
backend   | INFO:     Application startup complete.
backend   | INFO:     127.0.0.1:41294 - "GET /healthcheck HTTP/1.1" 404 Not Found
backend   | INFO:     127.0.0.1:41296 - "GET /healthcheck HTTP/1.1" 404 Not Found
backend   | INFO:     127.0.0.1:41298 - "GET /healthcheck HTTP/1.1" 404 Not Found
dependency failed to start: container backend is unhealthy

Where I'm getting confused is, that after a successful startup, if I change the app in such a way to make backend become unhealthy, the container would detect the change and the check would return a 404 (as expected) but it would never become unhealthy.

database  | 2023-06-01 23:06:37.396 UTC [1] LOG:  listening on IPv4 address "0.0.0.0", port 5432
database  | 2023-06-01 23:06:37.396 UTC [1] LOG:  listening on IPv6 address "::", port 5432
database  | 2023-06-01 23:06:37.397 UTC [1] LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
database  | 2023-06-01 23:06:37.400 UTC [22] LOG:  database system was shut down at 2023-06-01 23:06:34 UTC
database  | 2023-06-01 23:06:37.403 UTC [1] LOG:  database system is ready to accept connections
backend   | INFO:     Will watch for changes in these directories: ['/backend']
backend   | INFO:     Uvicorn running on http://0.0.0.0:8080 (Press CTRL+C to quit)
backend   | INFO:     Started reloader process [1] using StatReload
backend   | INFO:     Started server process [9]
backend   | INFO:     Waiting for application startup.
backend   | INFO:     Application startup complete.
backend   | INFO:     127.0.0.1:49450 - "GET /healthcheck HTTP/1.1" 200 OK
frontend  | 
frontend  | > frontend@0.0.0 dev
frontend  | > vite --host
frontend  | 
frontend  | Forced re-optimization of dependencies
frontend  | 
frontend  |   VITE v4.3.1  ready in 285 ms
frontend  | 
frontend  |   ➜  Local:   http://localhost:5173/
frontend  |   ➜  Network: http://172.26.0.4:5173/
backend   | INFO:     127.0.0.1:57966 - "GET /healthcheck HTTP/1.1" 200 OK
backend   | INFO:     127.0.0.1:57968 - "GET /healthcheck HTTP/1.1" 200 OK
backend   | INFO:     127.0.0.1:57982 - "GET /healthcheck HTTP/1.1" 200 OK
backend   | INFO:     127.0.0.1:57992 - "GET /healthcheck HTTP/1.1" 200 OK
backend   | INFO:     127.0.0.1:58002 - "GET /healthcheck HTTP/1.1" 200 OK
backend   | INFO:     127.0.0.1:58012 - "GET /healthcheck HTTP/1.1" 200 OK
backend   | INFO:     127.0.0.1:58018 - "GET /healthcheck HTTP/1.1" 200 OK
backend   | WARNING:  StatReload detected changes in 'src/main.py'. Reloading...
backend   | INFO:     Shutting down
backend   | INFO:     Waiting for application shutdown.
backend   | INFO:     Application shutdown complete.
backend   | INFO:     Finished server process [9]
backend   | INFO:     Started server process [76]
backend   | INFO:     Waiting for application startup.
backend   | INFO:     Application startup complete.
backend   | INFO:     127.0.0.1:58028 - "GET /healthcheck HTTP/1.1" 404 Not Found
backend   | INFO:     127.0.0.1:58040 - "GET /healthcheck HTTP/1.1" 404 Not Found
backend   | INFO:     127.0.0.1:35092 - "GET /healthcheck HTTP/1.1" 404 Not Found
backend   | INFO:     127.0.0.1:35098 - "GET /healthcheck HTTP/1.1" 404 Not Found
backend   | INFO:     127.0.0.1:35102 - "GET /healthcheck HTTP/1.1" 404 Not Found
backend   | INFO:     127.0.0.1:35116 - "GET /healthcheck HTTP/1.1" 404 Not Found
backend   | INFO:     127.0.0.1:35126 - "GET /healthcheck HTTP/1.1" 404 Not Found
backend   | INFO:     127.0.0.1:35134 - "GET /healthcheck HTTP/1.1" 404 Not Found

What I expected:

While running after a successful startup, upon changing the backend code in such a way that its healthcheck would fail, I expected frontend to exit or become degraded somehow, as its health dependency has failed.

What happened:

Everything kept running as if nothing happened, even though the backend healthcheck returned a failing value.

My questions:

Is the healthcheck only valid during startup to wait for a container to be "ready"? Documentation seems to suggest so.
If so, then why keep checking for health after successful startup?
If not, why is the backend container not being marked as unhealthy when changes cause its healthcheck to fail while running?
Is there a way to degrade a container to unhealthy while running after a successful startup?
I'm aware that I can use kill 1 instead of exit 1 and that would cause backend container to stop, but doesn't seem very clean.

Solution

In trying to reproduce the behavior you've described, the first problem I ran into is that the standard version of wget will make HEAD requests when using the --spider option, so that your healthcheck results in:

HEAD /healthcheck HTTP/1.1" 405 Method Not Allowed

This is using wget version 1.21 as installed in the python:3.11 image. I modified the healthcheck to look like this (and dropped the irrelevant parts of your docker-compose.yaml):

version: '3'
services:

  backend:
    image: backend-api-image
    build:
      context: backend
      dockerfile: Dockerfile

    ports:
      - "8080:8080"
    volumes:
      - './backend:/backend'

    healthcheck:
      test: wget --no-verbose -O /dev/null --tries=1 http://localhost:8080/healthcheck || exit 1
      interval: 1s
      timeout: 5s

I have your example FastAPI code in backend/backend.py, and my backend/Dockerfile looks like:

FROM python:3.11

WORKDIR /app
RUN python3 -m venv .venv
ENV PATH=/app/.venv/bin:/usr/local/bin:/usr/bin:/bin
COPY requirements.txt ./
RUN . .venv/bin/activate && pip install -r requirements.txt
COPY . ./

CMD ["uvicorn", "--reload", "--host", "0.0.0.0", "--port", "8080", "backend:app"]

When I run docker-compose up, I see:

backend_1  | INFO:     127.0.0.1:44856 - "GET /healthcheck HTTP/1.1" 200 OK
backend_1  | INFO:     127.0.0.1:44884 - "GET /healthcheck HTTP/1.1" 200 OK

...and the container enters the "healthy" state:

NAME                  IMAGE               COMMAND                  SERVICE             CREATED             STATUS                    PORTS
webserver_backend_1   backend-api-image   "uvicorn --reload --…"   backend             24 seconds ago      Up 23 seconds (healthy)   0.0.0.0:8080->8080/tcp, :::8080->8080/tcp

If I docker exec into the container and modify the FastAPI application to return an error, so that the code looks like this:

backend_1  | WARNING:  StatReload detected changes in 'backend.py'. Reloading...
backend_1  | INFO:     Shutting down
backend_1  | INFO:     Waiting for application shutdown.
backend_1  | INFO:     Application shutdown complete.
backend_1  | INFO:     Finished server process [8]
backend_1  | INFO:     Started server process [1050]
backend_1  | INFO:     Waiting for application startup.
backend_1  | INFO:     Application startup complete.
backend_1  | INFO:     127.0.0.1:44618 - "GET /healthcheck HTTP/1.1" 400 Bad Request
backend_1  | INFO:     127.0.0.1:48912 - "GET /healthcheck HTTP/1.1" 400 Bad Request

And the container enters the "unhealthy" state:

NAME                  IMAGE               COMMAND                  SERVICE             CREATED             STATUS                     PORTS
webserver_backend_1   backend-api-image   "uvicorn --reload --…"   backend             2 minutes ago       Up 2 minutes (unhealthy)   0.0.0.0:8080->8080/tcp, :::8080->8080/tcp

That all seems to work as expected: the container health status changes as the response from the FastAPI service changes.

Here are some questions to help further diagnose things on your end:

What does the Dockerfile for your FastAPI service look like? In particular, what's the base image?
Have you verified that the wget command in that image returns an error code as expected for a non-200 response from the server?