dockerkubernetesgpunvidia-dockerdocker-in-docker

"docker:19.03-dind" could not select device driver "nvidia" with capabilities: [[gpu]]


I got a K8S+DinD issue:

Full error

http://localhost:2375/v1.40/containers/long-hash-string/start: Internal Server Error ("could not select device driver "nvidia" with capabilities: [[gpu]]")

exec to the DinD image inside of K8S pod, nvidia-smi is not available.

Some debugging and it seems it's due to the DinD is missing the Nvidia-docker-toolkit, I had the same error when I ran the same job directly on my local laptop docker, I fixed the same error by installing nvidia-docker2 sudo apt-get install -y nvidia-docker2.

I'm thinking maybe I can try to install nvidia-docker2 to the DinD 19.03 (docker:19.03-dind), but not sure how to do it? By multiple stage docker build?

Thank you very much!


update:

pod spec:

spec:
    containers:
      - name: dind-daemon
        image: docker:19.03-dind

Solution

  • I got it working myself.

    Referring to

    First, I modified the ubuntu-dind image (https://github.com/billyteves/ubuntu-dind) to install nvidia-docker (i.e. added the instructions in the nvidia-docker site to the Dockerfile) and changed it to be based on nvidia/cuda:9.2-runtime-ubuntu16.04.

    Then I created a pod with two containers, a frontend ubuntu container and the a privileged docker daemon container as a sidecar. The sidecar's image is the modified one I mentioned above.

    But since this post is 3 year ago from now, I did spent quite some time to match up the dependencies versions, repo migration over 3 years, etc.

    My modified version of Dockerfile to build it

    ARG CUDA_IMAGE=nvidia/cuda:11.0.3-runtime-ubuntu20.04
    FROM ${CUDA_IMAGE}
    
    ARG DOCKER_CE_VERSION=5:18.09.1~3-0~ubuntu-xenial
    
    
    RUN apt-get update -q && \
        apt-get install -yq \
            apt-transport-https \
            ca-certificates \
            curl \
            gnupg-agent \
            software-properties-common && \
        curl -fsSL https://download.docker.com/linux/ubuntu/gpg | apt-key add - && \
        add-apt-repository \
           "deb [arch=amd64] https://download.docker.com/linux/ubuntu \
           $(lsb_release -cs) \
           stable"  && \
        apt-get update -q && apt-get install -yq docker-ce docker-ce-cli containerd.io
    
    # https://github.com/docker/docker/blob/master/project/PACKAGERS.md#runtime-dependencies
    RUN set -eux; \
        apt-get update -q && \
        apt-get install -yq \
            btrfs-progs \
            e2fsprogs \
            iptables \
            xfsprogs \
            xz-utils \
    # pigz: https://github.com/moby/moby/pull/35697 (faster gzip implementation)
            pigz \
    #        zfs \
            wget
    
    
    # set up subuid/subgid so that "--userns-remap=default" works out-of-the-box
    RUN set -x \
        && addgroup --system dockremap \
        && adduser --system -ingroup dockremap dockremap \
        && echo 'dockremap:165536:65536' >> /etc/subuid \
        && echo 'dockremap:165536:65536' >> /etc/subgid
    
    # https://github.com/docker/docker/tree/master/hack/dind
    ENV DIND_COMMIT 37498f009d8bf25fbb6199e8ccd34bed84f2874b
    
    RUN set -eux; \
        wget -O /usr/local/bin/dind "https://raw.githubusercontent.com/docker/docker/${DIND_COMMIT}/hack/dind"; \
        chmod +x /usr/local/bin/dind
    
    
    ##### Install nvidia docker #####
    # Add the package repositories
    RUN curl -fsSL https://nvidia.github.io/nvidia-docker/gpgkey | apt-key add --no-tty -
    
    RUN distribution=$(. /etc/os-release;echo $ID$VERSION_ID) && \
        echo $distribution &&  \
        curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | \
          tee /etc/apt/sources.list.d/nvidia-docker.list
    
    RUN apt-get update -qq --fix-missing
    
    RUN apt-get install -yq nvidia-docker2
    
    RUN sed -i '2i \ \ \ \ "default-runtime": "nvidia",' /etc/docker/daemon.json
    
    RUN mkdir -p /usr/local/bin/
    COPY dockerd-entrypoint.sh /usr/local/bin/
    RUN chmod 777 /usr/local/bin/dockerd-entrypoint.sh
    RUN ln -s /usr/local/bin/dockerd-entrypoint.sh /
    
    VOLUME /var/lib/docker
    EXPOSE 2375
    
    ENTRYPOINT ["dockerd-entrypoint.sh"]
    #ENTRYPOINT ["/bin/sh", "/shared/dockerd-entrypoint.sh"]
    CMD []
    
    

    When I use exec to login into the Docker-in-Docker container, I can successfully run nvidia-smi (which previously return not found error then cannot run any GPU resource related docker run)

    Welcome to pull my image at brandsight/dind:nvidia-docker