I'm trying to build a fully dockerized deployment of slurm using docker stacks, but jobs don't complete consistently. Does anyone have any idea why this might be?
Other than this problem, the system works: All the nodes come up, I can submit jobs, and they run. The problem I am having is that some jobs don't complete properly. Right now it's running on a single-node swarm.
I can submit a bunch of them with:
salloc -t 1 srun sleep 10
and I can watch them with squeue
. Some of them will complete after 10 seconds as expected but most of them keep running until they hit the 1-minute timeout from -t 1
.
The system consists of five docker services:
slurm-stack_mysql
slurm-stack_slurmdbd
slurm-stack_slurmctld
slurm-stack_c1
slurm-stack_c2
c1
and c2
are the worker nodes. All five services run the same docker image (Dockerfile
below) and are configured with the docker-compose.yml
linked below.
Here are some things I've noticed and tried:
I based the Dockerfile
and docker-compose.yml
on a docker-compose
-based version (i.e., without stacks or swarm). That versions works just fine -- jobs complete as usual. So it seems like it's something in the transition to Docker Stacks that's causing trouble. The original is here: https://github.com/giovtorres/slurm-docker-cluster
I noticed in the logs that slurmdbd
was getting "Error connecting slurm stream socket at 10.0.2.6:6817: Connection refused" errors failure to an IP address that corresponded to the swarm load-balancer. I managed to get rid of these by declaring all the services as global deployments in docker-compose.yml
. Other than eliminating the connection failures, it didn't seem to change anything. EDIT @chris-becke pointed out that I was mis-using global
, so I've turned it off. No help, but the "connection refused" errors returned.
When I do host c2
, host c1
, or host <service>
for any of the services in my system from inside one of the containers, I always get back two IP addresses. One of them corresponds to what I see in the containers
section of docker network inspect slurm-stack_default
. The other is one lower (e.g., 10.0.38.12
and 10.0.38.11
). If I run ip addr
in one of the containers, the ip address it reports matches what's listed for that host in the output of docker network inspect
.
Here are all the configuration files for the system:
Dockerfile
: https://gist.github.com/stevenjswanson/b819ab3a68cc7d9aea72099263ef10bddocker-compose.yml
: https://gist.github.com/stevenjswanson/4b50e085385a0ffcb0d6ffed9186ed02slurm.conf
: https://gist.github.com/stevenjswanson/d8c48fcd6b19b504fda3a32c34227878slurmdb.conf
: https://gist.github.com/stevenjswanson/84b31b5ae793379f16eff16678f75b47install_slurm.sh
:
https://gist.github.com/stevenjswanson/bcd04828dbc69eb25acd48c3d4c8ef31docker-entrypoint.sh
: https://gist.github.com/stevenjswanson/0b3650a123fd93f54a1fd9b973ed2e65I start it with docker stack deploy -c docker-compose.yml slurm-stack
.
These are representative logs for when job is not finishing consistently. In this case, jobs 2 (running on c2
) and 3 (running on c1
) don't complete correctly, but job 1 (running on c1
) does.
slurmctld
logs: https://gist.github.com/stevenjswanson/67ca4c76bc00200d52b2d05ab7bfb422slurmdbd
logs: https://gist.github.com/stevenjswanson/b49d9571dbf6b9160555db3a0867410fc1
logs: https://gist.github.com/stevenjswanson/fab9ce8510804919fafe36804fd417f6c2
logs: https://gist.github.com/stevenjswanson/dd03f5bdf77851115086801691410099mysql
logs: https://gist.github.com/stevenjswanson/d7cfb82adde9c260ea4673e2037363d1Slurm version info:
$ sinfo -V
slurm-wlm 21.08.5
Docker version information:
$ docker version
Client:
Version: 20.10.12
API version: 1.41
Go version: go1.17.3
Git commit: 20.10.12-0ubuntu4
Built: Mon Mar 7 17:10:06 2022
OS/Arch: linux/amd64
Context: default
Experimental: true
Server: Docker Engine - Community
Engine:
Version: 20.10.22
API version: 1.41 (minimum version 1.12)
Go version: go1.18.9
Git commit: 42c8b31
Built: Thu Dec 15 22:25:49 2022
OS/Arch: linux/amd64
Experimental: false
containerd:
Version: 1.6.14
GitCommit: 9ba4b250366a5ddde94bb7c9d1def331423aa323
runc:
Version: 1.1.4
GitCommit: v1.1.4-0-g5fd4c4d
docker-init:
Version: 0.19.0
GitCommit: de40ad0
Linux version:
$ uname -a
Linux slurmctld 5.15.0-76-generic #83-Ubuntu SMP Thu Jun 15 19:16:32 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Edit: For good measure, I rebuilt everything on a brand new cloud instance with the latest docker (24.0.5) and kernel (5.15.0-78). The results are the same.
docker creates a VIP or virtual IP associated with each service. This VIP will, in the case that multiple tasks exist, load balance between the healthy tasks. It also ensures that consumers are not effected by IP changes when tasks restart.
Each task container gets its own IP. Normally consumers are insulated from this duality: The service name is associated with the VIP, and tasks.<service>
is the dnsrr entry associated with the 0, 1, or more IPS associated with each container.
However, docker also registers the hostname in its internal dns, and here steps in a frequent antipattern that refuses to die: Lots of compose files, for no reason at all, just love to declare a hostname the same as the service name.
This, as you have found, can have weird unintended side effects as now the hostname AND service name both resolve, resulting in a dnsrr that returns both the vip and task ip, where really you just want one response.
In dnsrr mode, and in compose deployments, Docker does not create VIPS, but simply registers each service task IP. So the service and hostname get registered with the same IP, and consequently c1 and c2 always resolve to a single IP.
Nonetheless, the fix should be to remove the service.hostname:
entries and switch back to vip mode as it will probably confuse the broker if a worker starts work on one ip, gets restarted for some reason, and finishes on a different ip.