azurevirtual-machinenvidiaubuntu-20.04tesla

NvidiaGpuDriverLinux fails to install on NC6 instance


Pretty much what the title says. The VM is "Standard NC6s v3" running Linux (ubuntu 20.04) which supports NVIDIA Tesla V100. I added the NVIDIA GPU Driver Extension when I provisioned this machine.

The actual deployment is stuck in "Transitioning" state

enter image description here

I'm able to connect to the VM and can confirm that there's a background apt-get task running:

> ps -aux | grep 2736
0:01 apt-get -o Dpkg::Options::=--force-overwrite --no-install-recommends install -y cuda-drivers
0:00 /usr/bin/perl -w /usr/share/debconf/frontend /usr/lib/dkms/common.postinst nvidia 530.30.02 /usr/share/nvidia x86_64

It's been more than 40 mins. How long should this take to complete (if it would complete at all)?


Solution

  • The issue with the NvidiaGpuDriverLinux extension being stuck in a transition state seems to be intermittent. I tried provisioning a Linux VM with the same extension and configuration in my environment.

    The first attempt failed, but when I tried again with the same configuration, it succeeded.

    It's been more than 40 mins. How long should this take to complete (if it would complete at all)?

    The deployment usually takes 10-15 minutes, sometimes up to 30 minutes. However, if the extension remains in the transitioning state for more than 30 minutes, it is possible that the deployment of the extension has failed.

    You can try redeploying the extension in the 'Extensions + applications' tab in the VM by following the below steps or create a new VM.

    delete the failed extension

    install the extension again

    enter image description here

    References: NVIDIA GPU Driver Extension for Linux | Microsoft Documentation