I'm trying to setup a Google Kubernetes Engine cluster with GPU's in the nodes loosely following these instructions, because I'm programmatically deploying using the Python client.
For some reason I can create a cluster with a NodePool that contains GPU's
...But, the nodes in the NodePool don't have access to those GPUs.
I've already installed the NVIDIA DaemonSet with this yaml file: https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded.yaml
You can see that it's there in this image:
For some reason those 2 lines always seem to be in status "ContainerCreating" and "PodInitializing". They never flip green to status = "Running". How can I get the GPU's in the NodePool to become available in the node(s)?
Based on comments I ran the following commands on the 2 NVIDIA pods; kubectl describe pod POD_NAME --namespace kube-system
.
To do this I opened the UI KUBECTL command terminal on the node. Then I ran the following commands:
gcloud container clusters get-credentials CLUSTER-NAME --zone ZONE --project PROJECT-NAME
Then, I called kubectl describe pod nvidia-gpu-device-plugin-UID --namespace kube-system
and got this output:
Name: nvidia-gpu-device-plugin-UID
Namespace: kube-system
Priority: 2000001000
Priority Class Name: system-node-critical
Node: gke-mycluster-clust-default-pool-26403abb-zqz6/X.X.X.X
Start Time: Wed, 02 Mar 2022 20:19:49 +0000
Labels: controller-revision-hash=79765599fc
k8s-app=nvidia-gpu-device-plugin
pod-template-generation=1
Annotations: <none>
Status: Pending
IP:
IPs: <none>
Controlled By: DaemonSet/nvidia-gpu-device-plugin
Containers:
nvidia-gpu-device-plugin:
Container ID:
Image: gcr.io/gke-release/nvidia-gpu-device-plugin@sha256:aa80c85c274a8e8f78110cae33cc92240d2f9b7efb3f53212f1cefd03de3c317
Image ID:
Port: 2112/TCP
Host Port: 0/TCP
Command:
/usr/bin/nvidia-gpu-device-plugin
-logtostderr
--enable-container-gpu-metrics
--enable-health-monitoring
State: Waiting
Reason: ContainerCreating
Ready: False
Restart Count: 0
Limits:
cpu: 50m
memory: 50Mi
Requests:
cpu: 50m
memory: 20Mi
Environment:
LD_LIBRARY_PATH: /usr/local/nvidia/lib64
Mounts:
/dev from dev (rw)
/device-plugin from device-plugin (rw)
/etc/nvidia from nvidia-config (rw)
/proc from proc (rw)
/usr/local/nvidia from nvidia (rw)
/var/lib/kubelet/pod-resources from pod-resources (rw)
/var/run/secrets/kubernetes.io/serviceaccount from default-token-qnxjr (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
device-plugin:
Type: HostPath (bare host directory volume)
Path: /var/lib/kubelet/device-plugins
HostPathType:
dev:
Type: HostPath (bare host directory volume)
Path: /dev
HostPathType:
nvidia:
Type: HostPath (bare host directory volume)
Path: /home/kubernetes/bin/nvidia
HostPathType: Directory
pod-resources:
Type: HostPath (bare host directory volume)
Path: /var/lib/kubelet/pod-resources
HostPathType:
proc:
Type: HostPath (bare host directory volume)
Path: /proc
HostPathType:
nvidia-config:
Type: HostPath (bare host directory volume)
Path: /etc/nvidia
HostPathType:
default-token-qnxjr:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-qnxjr
Optional: false
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: :NoExecute op=Exists
:NoSchedule op=Exists
node.kubernetes.io/disk-pressure:NoSchedule op=Exists
node.kubernetes.io/memory-pressure:NoSchedule op=Exists
node.kubernetes.io/not-ready:NoExecute op=Exists
node.kubernetes.io/pid-pressure:NoSchedule op=Exists
node.kubernetes.io/unreachable:NoExecute op=Exists
node.kubernetes.io/unschedulable:NoSchedule op=Exists
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 8m55s default-scheduler Successfully assigned kube-system/nvidia-gpu-device-plugin-hxdwx to gke-opcode-trainer-clust-default-pool-26403abb-zqz6
Warning FailedMount 6m42s kubelet Unable to attach or mount volumes: unmounted volumes=[nvidia], unattached volumes=[nvidia-config default-token-qnxjr device-plugin dev nvidia pod-resources proc]: timed out waiting for the condition
Warning FailedMount 4m25s kubelet Unable to attach or mount volumes: unmounted volumes=[nvidia], unattached volumes=[proc nvidia-config default-token-qnxjr device-plugin dev nvidia pod-resources]: timed out waiting for the condition
Warning FailedMount 2m11s kubelet Unable to attach or mount volumes: unmounted volumes=[nvidia], unattached volumes=[pod-resources proc nvidia-config default-token-qnxjr device-plugin dev nvidia]: timed out waiting for the condition
Warning FailedMount 31s (x12 over 8m45s) kubelet MountVolume.SetUp failed for volume "nvidia" : hostPath type check failed: /home/kubernetes/bin/nvidia is not a directory
Then, I called kubectl describe pod nvidia-driver-installer-UID --namespace kube-system
and got this output:
Name: nvidia-driver-installer-UID
Namespace: kube-system
Priority: 0
Node: gke-mycluster-clust-default-pool-26403abb-zqz6/X.X.X.X
Start Time: Wed, 02 Mar 2022 20:20:06 +0000
Labels: controller-revision-hash=6bbfc44f6d
k8s-app=nvidia-driver-installer
name=nvidia-driver-installer
pod-template-generation=1
Annotations: <none>
Status: Pending
IP: 10.56.0.9
IPs:
IP: 10.56.0.9
Controlled By: DaemonSet/nvidia-driver-installer
Init Containers:
nvidia-driver-installer:
Container ID:
Image: gke-nvidia-installer:fixed
Image ID:
Port: <none>
Host Port: <none>
State: Waiting
Reason: ImagePullBackOff
Ready: False
Restart Count: 0
Requests:
cpu: 150m
Environment: <none>
Mounts:
/boot from boot (rw)
/dev from dev (rw)
/root from root-mount (rw)
/var/run/secrets/kubernetes.io/serviceaccount from default-token-qnxjr (ro)
Containers:
pause:
Container ID:
Image: gcr.io/google-containers/pause:2.0
Image ID:
Port: <none>
Host Port: <none>
State: Waiting
Reason: PodInitializing
Ready: False
Restart Count: 0
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from default-token-qnxjr (ro)
Conditions:
Type Status
Initialized False
Ready False
ContainersReady False
PodScheduled True
Volumes:
dev:
Type: HostPath (bare host directory volume)
Path: /dev
HostPathType:
boot:
Type: HostPath (bare host directory volume)
Path: /boot
HostPathType:
root-mount:
Type: HostPath (bare host directory volume)
Path: /
HostPathType:
default-token-qnxjr:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-qnxjr
Optional: false
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: op=Exists
node.kubernetes.io/disk-pressure:NoSchedule op=Exists
node.kubernetes.io/memory-pressure:NoSchedule op=Exists
node.kubernetes.io/not-ready:NoExecute op=Exists
node.kubernetes.io/pid-pressure:NoSchedule op=Exists
node.kubernetes.io/unreachable:NoExecute op=Exists
node.kubernetes.io/unschedulable:NoSchedule op=Exists
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 4m20s default-scheduler Successfully assigned kube-system/nvidia-driver-installer-tzw42 to gke-opcode-trainer-clust-default-pool-26403abb-zqz6
Normal Pulling 2m36s (x4 over 4m19s) kubelet Pulling image "gke-nvidia-installer:fixed"
Warning Failed 2m34s (x4 over 4m10s) kubelet Failed to pull image "gke-nvidia-installer:fixed": rpc error: code = Unknown desc = failed to pull and unpack image "docker.io/library/gke-nvidia-installer:fixed": failed to resolve reference "docker.io/library/gke-nvidia-installer:fixed": pull access denied, repository does not exist or may require authorization: server message: insufficient_scope: authorization failed
Warning Failed 2m34s (x4 over 4m10s) kubelet Error: ErrImagePull
Warning Failed 2m22s (x6 over 4m9s) kubelet Error: ImagePullBackOff
Normal BackOff 2m7s (x7 over 4m9s) kubelet Back-off pulling image "gke-nvidia-installer:fixed"
According the docker image that the container is trying to pull (gke-nvidia-installer:fixed
), it looks like you're trying use Ubuntu daemonset instead of cos
.
You should run kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded.yaml
This will apply the right daemonset for your cos
node pool, as stated here.
In addition, please verify your node pool has the https://www.googleapis.com/auth/devstorage.read_only
scope which is needed to pull the image. You can should see it in your node pool page in GCP Console, under Security -> Access scopes (The relevant service is Storage).