gpulxcpass-through

How to enable gpu passthrough on centos/rhel/ol8 using snapd's lxd/lxc containers?


The guides I have for deploying LXC on CentOS is to install snapd's lxd https://www.cyberciti.biz/faq/set-up-use-lxd-on-centos-rhel-8-x/

SnapD is a type of service that allows installing debian/ubuntu based packages with the logic being lxd is most up to date on that platform.

Well. I'm all open to installing an alternative version if it's easier to enable gpu passthrough.

Ultimately I'm trying to build a container environment where I can run the latest version of python and jupyter that has gpu support.

I have some guides on how to enable gpu passthrough.

https://theorangeone.net/posts/lxc-nvidia-gpu-passthrough/
https://www.reddit.com/r/Proxmox/comments/glog5j/lxc_gpu_passthrough/

I've added the following kernel modules on my ol8 host

/etc/modules-load.d/vfio-pci.conf
    # Nvidia modules
    nvidia
    nvidia_uvm

#noticed snapd has a modules file I can't edit  

/var/lib/snapd/snap/core18/1988/etc/modules-load.d/modules.conf
            

Then modified grub

nano /etc/default/grub 
    #https://access.redhat.com/documentation/en-us/red_hat_virtualization/4.1/html/installation_guide/appe-configuring_a_hypervisor_host_for_pci_passthrough
    GRUB_CMDLINE_LINUX
    #iommu=on amd_iommu=on
    iommu=pt amd_iommu=pt
            
grub2-mkconfig -o /boot/grub2/grub.cfg

Then added udev rules

    nano /etc/udev/rules.d/70-nvidia.rules
    KERNEL=="nvidia", RUN+="/bin/bash -c '/usr/bin/nvidia-smi -L && /bin/chmod 666 /dev/nvidia*'"
    KERNEL=="nvidia_uvm", RUN+="/bin/bash -c '/usr/bin/nvidia-modprobe -c0 -u && /bin/chmod 0666 /dev/nvidia-uvm*'"

#reboot

Then added gpu to lxc.conf

ls -l /dev/nvidia*

# Allow cgroup access
lxc.cgroup.devices.allow: c 195:* rwm
lxc.cgroup.devices.allow: c 243:* rwm

nano /var/snap/lxd/common/lxd/logs/nvidia-test/lxc.conf
        

# Pass through device files
lxc.mount.entry: /dev/nvidia0 dev/nvidia0 none bind,optional,create=file
lxc.mount.entry: /dev/nvidiactl dev/nvidiactl none bind,optional,create=file
lxc.mount.entry: /dev/nvidia-uvm dev/nvidia-uvm none bind,optional,create=file
lxc.mount.entry: /dev/nvidia-modeset dev/nvidia-modeset none ind,optional,create=file
lxc.mount.entry: /dev/nvidia-uvm-tools dev/nvidia-uvm-tools none bind,optional,create=file

inside lxc container I started (ol8)

#installed nvidia-driver that comes with nvidia-smi
    nvidia-driver-cuda-3:460.32.03-1.el8.x86_64
    
#installed cuda
    cuda-11-2-11.2.2-1.x86_64

when I go to run nvidia-smi

[root@nvidia-test ~]# nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

because I couldn't edit the snapd module file thought to manually copy the nvidia kernel module files over and insmod them (determined using modprobe --show-depends )

[root@nvidia-test ~]# insmod nvidia.ko.xz NVreg_DynamicPowerManagement=0x02
insmod: ERROR: could not insert module nvidia.ko.xz: Function not implemented

some diagnostic information inside my container

[root@nvidia-test ~]# find /sys | grep dmar
find: '/sys/kernel/debug': Permission denied
find: '/sys/fs/pstore': Permission denied
find: '/sys/fs/fuse/connections/59': Permission denied
[root@nvidia-test ~]# lspci | grep -i nvidia
05:00.0 VGA compatible controller: NVIDIA Corporation GP107GL [Quadro P1000] (rev a1)
05:00.1 Audio device: NVIDIA Corporation GP107GL High Definition Audio Controller (rev a1)

So... is there something else I should do? Should I remove snapd lxd and go with the default lxc provided by OL8?


Solution

  • You can use GPU Passthrough to a LXD container by creating a LXD gpu device. This gpu device will collectively do all the necessary tasks to expose the GPU to the container, including the configuration you made above explicitly.

    Here is the documentation with all extra parameters (for example, if there are more than one GPU, how do you distinguish), https://linuxcontainers.org/lxd/docs/master/instances#type-gpu

    In the simplest form, you can run the following to an existing container to add the default GPU (to the container).

    lxc config device add mycontainer mynvidia gpu
    

    When you add a GPU in a NVidia container, you also need to add the corresponding NVidia runtime to the container (so that it matches the kernel version on the host!). In containers we do not need (and cannot) add kernel drivers but we need to add the runtime (libraries, utilities, and other software). LXD takes care of this and is downloading for you the appropriate version of the NVidia container runtime and attaches it to the container. Here is a full example that creates a container while enabling the NVidia runtime, and then adds the NVidia GPU device to that container.

    $ lxc launch ubuntu: mycontainer -c nvidia.runtime=true -c nvidia.driver.capabilities=all
    Creating mycontainer
    Starting mycontainer
    $ lxc config device add mycontainer mynvidia gpu
    Device mynvidia added to mycontainer
    $ lxc shell mycontainer
    root@mycontainer:~# nvidia-smi 
    Mon Mar 15 13:37:24 2021       
    +-----------------------------------------------------------------------------+
    | NVIDIA-SMI 450.102.04   Driver Version: 450.102.04   CUDA Version: 11.0     |
    |-------------------------------+----------------------+----------------------+
    ...
    $ 
    

    If you are creating often such GPU containers, you can create a LXD profile with the GPU configuration. Then, if you want a GPU container, you can either launch the container with the nvidia profile, or you can apply the nvidia profile to existing containers and thus make them GPU containers!

    $ cat mynvidiaLXDprofile.txt
    config:
      nvidia.driver.capabilities: all
      nvidia.runtime: "true"
    description: ""
    devices:
      mygpu:
        type: gpu
    name: nvidia
    used_by: []
    $ lxc profile create nvidia
    Profile nvidia created
    $ lxc profile edit nvidia < mynvidiaLXDprofile.txt
    $ lxc launch ubuntu:20.04 mycontainer --profile default --profile nvidia
    Creating mycontainer
    Starting mycontainer
    $ 
    

    We have been using the snap package of LXD for all the above instructions.