dockersshdockerfiledistributed-system

How to run an MPI program across multiple docker containers without manually ssh'ing


I'm a bit new to docker and trying to simulate a cluster environment with it. I have defined a custom docker network that the containers share, and assign each container to a different port to simulate different network cards.

Currently, I have a working Dockerfile that copies over the needed ssh keys and I automatically have it start the ssh server with ENTRYPOINT service ssh start && bash.

Right now, my containers work, but the inconvenience is that when the containers start I have to manually run eval ssh-agent && ssh-add /.ssh/docker_id_rsa, then manually ssh into all the other containers, and then I am able to run my MPI program. If I don't do these steps first, I am not able to run the program across the containers.

So what I'd like to do is when I attach to one of the containers, I want to either (1) immediately run my MPI program across all of the containers without having to run all the steps I mentioned above, or (2) even just immediately ssh into the other containers, and then run my program.

Here is an example of my current Dockerfile:

FROM img_base AS img

COPY /keys/ /root/.ssh
COPY /keys/docker_id_rsa.pub /root/.ssh/authorized_keys
RUN sed -i 's/#PermitRootLogin no/PermitRootLogin yes/g' /etc/ssh/sshd_config
RUN sed -i 's/#PasswordAuthentication yes/PasswordAuthentication no/g' /etc/ssh/sshd_config
RUN sed -i "s+StrictHostKeyChecking .*+StrictHostKeyChecking allow-new+" /etc/ssh/sshd_config

RUN echo "localhost" >> hostfile
RUN echo "root@container2" >> hostfile
RUN echo "root@container3" >> hostfile
RUN echo "root@container4" >> hostfile

EXPOSE 22

ENTRYPOINT service ssh start && bash && eval `ssh-agent` && ssh-add /root/.ssh/docker_id_rsa

I start my containers with the following bash script:

#!/bin/bash
docker run --rm -dit --name container1 --network=my-net --ip=172.18.0.2 -p 4022:22 --add-host container2:172.18.0.3 --add-host container3:172.18.0.4 --add-host container4:172.18.0.5 img 

docker run --rm -dit --name container2 --network=my-net --ip=172.18.0.3 -p 3022:22 --add-host container1:172.18.0.2 --add-host container3:172.18.0.4 --add-host container4:172.18.0.5 img

docker run --rm -dit --name container3 --network=my-net --ip=172.18.0.4 -p 5022:22 --add-host container2:172.18.0.3 --add-host container1:172.18.0.2 --add-host container4:172.18.0.5 img

docker run --rm -dit --name container4 --network=my-net --ip=172.18.0.5 -p 6022:22 --add-host container2:172.18.0.3 --add-host container3:172.18.0.4 --add-host container1:172.18.0.2 img

docker attach container1

I have tried adding the eval and ssh-add commands in the ENTRYPOINT command.

I've also tried adding these commands to the docker run commands in the bash script.

And I've tried to do this with a docker-compose file but still do not really understand how to use the docker-compose functionalities

Any advice or references on the proper way to do this is greatly appreciated.


Solution

  • I'm not sure what your img_base looks like but I'll just assume that it's an Ubuntu image (or a derivative).

    You are setting up SSH access to the containers as the root user. This is not ideal but 100% fine to get things up and running. Perhaps change to a non-privileged user later?

    ๐Ÿ—Ž Dockerfile

    FROM ubuntu:22.04 AS img
    
    ENV DEBIAN_FRONTEND=noninteractive
    
    RUN apt-get update && \
        apt-get install -y openssh-server && \
        mkdir /var/run/sshd
    
    RUN sed -i 's/#PermitRootLogin prohibit-password/PermitRootLogin yes/' /etc/ssh/sshd_config
    
    COPY keys/docker_id_rsa.pub /root/.ssh/authorized_keys
    RUN chmod 700 /root/.ssh && chmod 600 /root/.ssh/authorized_keys
    
    CMD ["/usr/sbin/sshd", "-D"]
    

    Testing the image. Connecting port 2022 on the host to avoid conflict with SSHD running on host.

    enter image description here

    SSH connection confirmed. โœ…

    Now let's get this working with Docker Compose.

    ๐Ÿ—Ž Dockerfile

    FROM ubuntu:22.04 AS img
    
    ENV DEBIAN_FRONTEND=noninteractive
    
    RUN apt-get update && \
        apt-get install -y openssh-server && \
        mkdir /var/run/sshd
    
    # SSH server configuration.
    RUN sed -i 's/#PermitRootLogin prohibit-password/PermitRootLogin yes/' /etc/ssh/sshd_config
    # SSH client configuration.
    RUN echo "    StrictHostKeyChecking no" >> /etc/ssh/ssh_config
    RUN echo "    UserKnownHostsFile /dev/null" >> /etc/ssh/ssh_config
    
    COPY keys /root/.ssh/
    COPY keys/docker_id_rsa.pub /root/.ssh/authorized_keys
    RUN chmod 700 /root/.ssh && chmod 600 /root/.ssh/authorized_keys
    
    COPY setup.sh .
    RUN chmod +x /setup.sh
    
    CMD ["/usr/sbin/sshd", "-D"]
    

    ๐Ÿ—Ž docker-compose.yml

    version: '3.7'
    
    x-common-service: &common-service-template
      build:
        context: .
        dockerfile: Dockerfile
      networks:
          - my-net
    
    services:
      container1:
        <<: *common-service-template
        container_name: container1
        ports:
          - "4022:22"
        command: /bin/bash -c "/setup.sh"
    
      container2:
        <<: *common-service-template
        container_name: container2
        ports:
          - "3022:22"
    
      container3:
        <<: *common-service-template
        container_name: container3
        ports:
          - "5022:22"
    
      container4:
        <<: *common-service-template
        container_name: container4
        ports:
          - "6022:22"
    
    networks:
      my-net:
    

    The container1 service is slightly different because it runs the setup.sh script. This script (see below) will run code on the other three containers via SSH. So you can use this to set up all of the containers. For the moment though it just prints a message on each of the containers.

    ๐Ÿ—Ž setup.sh

    #!/bin/bash
    
    echo "* Setting up cluster."
    
    ssh -i ~/.ssh/docker_id_rsa root@container2 'echo "- Running code on container2! $(hostname)"'
    ssh -i ~/.ssh/docker_id_rsa root@container3 'echo "- Running code on container3! $(hostname)"'
    ssh -i ~/.ssh/docker_id_rsa root@container4 'echo "- Running code on container4! $(hostname)"'
    
    echo "* Done!"
    
    /usr/sbin/sshd -D
    

    Launch.

    docker-compose build && docker-compose up
    

    So container1 is effectively acting as the master and setting things up on the other containers.

    enter image description here