I'm trying to run a docker container (immich machine learning). When I start the container using docker-compose it downloads and gets the container images. But then it just hangs on creating the container. It gets created, but never actually starts. There are no error messages with docker logs or anywhere else I can find through docker.
However, journalctl gives the following error every time I try to start the docker.
May 23 08:36:26 cyan systemd[1]: Started libcontainer container a05a88a4ac884107e5b0898421910d8380e15906fa17d3f8d5217204c76717b3.
May 23 08:36:27 cyan systemd-coredump[4959]: Process 4941 (nvidia-containe) of user 0 terminated abnormally with signal 11/SEGV, processing...
May 23 08:36:27 cyan kernel: nvidia-containe[4941]: segfault at 0 ip 000077a9e05c3398 sp 00007ffdff95bd80 error 4 in libnvidia-container.so.1.17.7[a398,77a9e05bd000+15000] likely on CPU 2 (core 2, socket 0)
May 23 08:36:27 cyan kernel: Code: 00 8b 00 89 85 38 fe ff ff f6 c4 80 0f 85 d8 01 00 00 48 8d 85 50 fe ff ff 48 89 85 18 fe ff ff 48 8b 85 18 fe ff ff 4c 8b 00 <41> 80 38 40 0f 84 ea 01 00 00 ba f6 01 00 00 bf 49 00 00 00 48 8b
May 23 08:36:27 cyan systemd[1]: Created slice Slice /system/systemd-coredump.
May 23 08:36:27 cyan systemd[1]: Started Process Core Dump (PID 4959/UID 0).
May 23 08:36:27 cyan systemd-coredump[4960]: [🡕] Process 4941 (nvidia-containe) of user 0 dumped core.
Stack trace of thread 4941:
#0 0x000077a9e05c3398 nvc_ldcache_update (libnvidia-container.so.1 + 0xa398)
#1 0x0000622b6506b451 n/a (/usr/bin/nvidia-container-cli + 0x6451)
#2 0x0000622b65067353 n/a (/usr/bin/nvidia-container-cli + 0x2353)
#3 0x000077a9e03e46b5 n/a (libc.so.6 + 0x276b5)
#4 0x000077a9e03e4769 __libc_start_main (libc.so.6 + 0x27769)
#5 0x0000622b65067655 n/a (/usr/bin/nvidia-container-cli + 0x2655)
ELF object binary architecture: AMD x86-64
May 23 08:36:27 cyan systemd[1]: systemd-coredump@0-4959-0.service: Deactivated successfully.
I'm running endeavouros, and I'm fully updated. Looking around for this error, most seem to suggest it's an issue with the nvidia-container-toolkit, but that's already installed on latest version through pacman. I've tried on multiple different immich images, including ones I know have worked for me in the past, but still same issue. So I don't think it's an immich issue, I think it's something else. I've rebooted a few times and prior erros gave me the error on cpu 1 core 1, this time it's cpu2 is what its saying. So I don't think it's an actual hardware issue?? but maybe?.
Not sure how to debug this further.
I experienced the same issue running Arch Linux, turned out that this is caused by nvidia-container-toolkit 1.17.7-1. Downgrading to libnvidia-container-1.17.6-1 and nvidia-container-toolkit-1.17.6-1 solved the issue. See also this issue on Github: link
The way i downgraded these packages was by using the downgrade package (yay -S downgrade): sudo downgrade nvidia-container-toolkit libnvidia-container.