docker compose: hundreds of health check processes not terminated.
services:
tomcat:
...
healthcheck:
test:
- CMD-SHELL
- curl --fail http://localhost:8080 || exit 1
interval: 5s
timeout: 5s
retries: 30
ps aux | grep curl
tomcat 939765 0.0 0.0 0 0 ? Z 08:46 0:00 [curl] <defunct>
tomcat 939824 0.0 0.0 0 0 ? Z 08:46 0:00 [curl] <defunct>
tomcat 939904 0.0 0.0 0 0 ? Z 08:47 0:00 [curl] <defunct>
tomcat 939962 0.0 0.0 0 0 ? Z 08:47 0:00 [curl] <defunct>
tomcat 940038 0.0 0.0 0 0 ? Z 08:47 0:00 [curl] <defunct>
tomcat 940094 0.0 0.0 0 0 ? Z 08:47 0:00 [curl] <defunct>
tomcat 940321 0.0 0.0 0 0 ? Z 08:48 0:00 [curl] <defunct>
tomcat 940380 0.0 0.0 0 0 ? Z 08:48 0:00 [curl] <defunct>
tomcat 940460 0.0 0.0 0 0 ? Z 08:48 0:00 [curl] <defunct>
tomcat 940516 0.0 0.0 0 0 ? Z 08:48 0:00 [curl] <defunct>
tomcat 940600 0.0 0.0 0 0 ? Z 08:49 0:00 [curl] <defunct>
tomcat 940657 0.0 0.0 0 0 ? Z 08:49 0:00 [curl] <defunct>
tomcat 940734 0.0 0.0 0 0 ? Z 08:49 0:00 [curl] <defunct>
tomcat 940875 0.0 0.0 0 0 ? Z 08:49 0:00 [curl] <defunct>
tomcat 940955 0.0 0.0 0 0 ? Z 08:50 0:00 [curl] <defunct>
tomcat 941013 0.0 0.0 0 0 ? Z 08:50 0:00 [curl] <defunct>
tomcat 941102 0.0 0.0 0 0 ? Z 08:50 0:00 [curl] <defunct>
tomcat 941162 0.0 0.0 0 0 ? Z 08:50 0:00 [curl] <defunct>
tomcat 941244 0.0 0.0 0 0 ? Z 08:51 0:00 [curl] <defunct>
tomcat 941332 0.0 0.0 0 0 ? Z 08:51 0:00 [curl] <defunct>
tomcat 941392 0.0 0.0 0 0 ? Z 08:51 0:00 [curl] <defunct>
tomcat 941474 0.0 0.0 0 0 ? Z 08:51 0:00 [curl] <defunct>
tomcat 941532 0.0 0.0 0 0 ? Z 08:52 0:00 [curl] <defunct>
tomcat 941609 0.0 0.0 0 0 ? Z 08:52 0:00 [curl] <defunct>
tomcat 941671 0.0 0.0 0 0 ? Z 08:52 0:00 [curl] <defunct>
tomcat 941749 0.0 0.0 0 0 ? Z 08:52 0:00 [curl] <defunct>
tomcat 941810 0.0 0.0 0 0 ? Z 08:53 0:00 [curl] <defunct>
....
tomcat 941895 0.0 0.2 22364 8436 ? S 08:53 0:00 curl --fail http://localhost:8080
tomcat 941954 0.0 0.2 22364 8512 ? S 08:53 0:00 curl --fail http://localhost:8080
tomcat 942032 0.0 0.2 22364 8384 ? S 08:53 0:00 curl --fail http://localhost:8080
tomcat 942238 0.0 0.2 22364 8528 ? S 08:54 0:00 curl --fail http://localhost:8080
tomcat 942316 0.0 0.2 22364 8552 ? S 08:54 0:00 curl --fail http://localhost:8080
tomcat 942377 0.0 0.2 22364 8496 ? S 08:55 0:00 curl --fail http://localhost:8080
tomcat 942452 0.0 0.2 22364 8360 ? S 08:55 0:00 curl --fail http://localhost:8080
...
Will health checker continue to be run periodically even after the container has been checked to be healthy?
What is the reason that "curl" processes are not terminated?
Looking at this output, I read this as the curl
command not completing within 5 seconds and getting killed, and the main container process isn't set up to handle a case where it gains responsibility for a child process it didn't start itself.
I suspect that there are two things you can do to fix this:
init: true
.What I think is going on depends on some very specific details of how Linux (Unix) processes work. The CMD-SHELL
health check is injected by Docker as an additional process in the container process namespace, in the same way as docker exec
. But more specifically, there are two processes: a wrapper sh
process running the command pipeline, and the curl
command as a subprocess.
/bin/sh -c 'curl --fail http://localhost:8080 || exit 1'
+-- curl --fail http://localhost:8080
When you reach the timeout, Docker goes to terminate the process. It doesn't specifically know about the subprocess, though, so it sends a signal to the sh
process. If it still doesn't terminate, Docker sends it SIGTERM, the Unix signal equivalent of kill -9
, and the shell process ceases to exist.
What happens to the curl
process? Its parent used to be the sh
process, but that's gone. The standard Unix rules here are that it gets moved to be a child of the "init" process, with process ID 1. In a Docker context, the main container process (your ENTRYPOINT
if you have one, your CMD
if not) is that process.
Eventually the curl
process will complete, maybe with its own timeout. The standard Unix rules here are that most of the process cleans up but its process table entry stays around, and its parent process can wait(2) for it and find out its status code. A process that's exited but hasn't been waited for is a "zombie" process; that's your long listing of process entries that have Z
in the status column and <defunct>
at the end of the line. It's worth noting that these aren't using memory or file handles or other resources, only process table entries.
The combination of these things adds up to: the main process in a container is process ID 1; process ID 1 is expected to be the "init" process; process ID 1 can sometimes unexpectly get children attached to it that it didn't start itself, and it needs to clean up after them ("reap zombies"). If your main process is the Tomcat server, and it's not watching for additional children (or the SIGCHLD signal), then you'll leak processes in the way you're seeing.
The Compose init: true
option wraps the main container command in a lightweight init process, by default Tini. If process 1 needs to reap zombies, Tini does that, and it handles some cases around signals. That's pretty much all it does, but it's an important function.