I have installed Nvidia's GPU operator and have my GPU-enabled node automatically labelled (what I treat as important, long list of other labels is there as well):
nvidia.com/gpu.count=1
Node is seemingly schedule able
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
NetworkUnavailable False Tue, 10 Sep 2024 15:05:17 +0000 Tue, 10 Sep 2024 15:05:17 +0000 CalicoIsUp Calico is running on this node
MemoryPressure False Tue, 10 Sep 2024 16:26:50 +0000 Tue, 10 Sep 2024 15:05:04 +0000 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False Tue, 10 Sep 2024 16:26:50 +0000 Tue, 10 Sep 2024 15:05:04 +0000 KubeletHasNoDiskPressure kubelet has no disk pressure
PIDPressure False Tue, 10 Sep 2024 16:26:50 +0000 Tue, 10 Sep 2024 15:05:04 +0000 KubeletHasSufficientPID kubelet has sufficient PID available
Ready True Tue, 10 Sep 2024 16:26:50 +0000 Tue, 10 Sep 2024 15:05:04 +0000 KubeletReady kubelet is posting ready status
Node also reports as ready in "kubectl get nodes". However when I'm looking at demo workload, I see
`Warning FailedScheduling 11s (x17 over 79m) default-scheduler 0/6 nodes are available: 3 Insufficient nvidia.com/gpu, 3 node(s) had untolerated taint {node-role.kubernetes.io/control-plane: }. preemption: 0/6 nodes are available: 3 No preemption victims found for incoming pod, 3 Preemption is not helpful for scheduling.`
I have even tried to manually add label node with nvidia.com/gpu=1
no luck so far.
I have followed guide from Nvidia https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/getting-started.html.
Deviations from automatic deployment was that I have installed driver (v550) manually as Nvdia hasn't generated images for Ubuntu 24.
I see an output in nvidia-smi, which essentially should be true as node being labelled by operator.
Kubernetes v1.31.0
Anything else I am missing?
Tried manually label node and re-create pod. Expectations are to see pod scheduled
Well, it's embarrassing that I for some reason overlooked failing nvidia-operator-validator
pod. Would anybody believe "I bet it was running"?
Anyway looking at pod logs or description does not give any information. But going onto worker node where container is scheduled (one with GPU) and running sudo crictl ps -a
shows container driver-validation
with increasing fail counter. Those logs are actually useful and besides of (in my case) successfully executing nvidia-smi
gives an answer:
` time="2024-09-12T15:33:34Z" level=info msg="creating symlinks under /dev/char that correspond to NVIDIA character devices" time="2024-09-12T15:33:34Z" level=info msg="Error: error validating driver installation: error creating symlink creator: failed to load NVIDIA kernel modules: failed to load module nvidia: exit status 1; output=modprobe: ERROR: ../libkmod/libkmod-module.c:968 kmod_module_insert_module() could not find module by name='nvidia_current_updates'\nmodprobe: ERROR: could not insert 'nvidia_current_updates': Unknown symbol in module, or unknown parameter (see dmesg)\n\n\nFailed to create symlinks under /dev/char that point to all possible NVIDIA character devices.\nThe existence of these symlinks is required to address the following bug:\n\n https://github.com/NVIDIA/gpu-operator/issues/430\n\nThis bug impacts container runtimes configured with systemd cgroup management enabled.\nTo disable the symlink creation, set the following envvar in ClusterPolicy:\n\n validator:\n driver:\n env:\n - name: DISABLE_DEV_CHAR_SYMLINK_CREATION\n value: \"true\"" `
I wasn't savvy enough to understand where would I put that ClusterPolicy, but reinstalling gpu-operator with helm install --wait gpu-operator-1 -n gpu-operator --create-namespace nvidia/gpu-operator --set driver.enabled=false --set validator.driver.env[0].name=DISABLE_DEV_CHAR_SYMLINK_CREATION --set-string validator.driver.env[0].value=true
saved the day.
Update 1. I was right that all was working until GPU-enabled worker was rebooted. After reboot host was not seeing Nvidia drivers, but even after reinstalling them, feature discovery, container toolkit and device plugin pods were in fail backoff state. Quick fix was to reinstall GPU operator, but definitely it's not a fix.