tensorflowgoogle-cloud-platformnvidiagoogle-dl-platform

Stopping and starting a deep learning google cloud VM instance causes tensorflow to stop recognizing GPU


I am using the pre-built deep learning VM instances offered by google cloud, with an Nvidia tesla K80 GPU attached. I choose to have Tensorflow 2.5 and CUDA 11.0 automatically installed. When I start the instance, everything works great - I can run:

Import tensorflow as tf
tf.config.list_physical_devices()

And my function returns the CPU, accelerated CPU, and the GPU. Similarly, if I run tf.test.is_gpu_available(), the function returns True.

However, if I log out, stop the instance, and then restart the instance, running the same exact code only sees the CPU and tf.test.is_gpu_available() results in False. I get an error that looks like the driver initialization is failing:

 E tensorflow/stream_executor/cuda/cuda_driver.cc:355] failed call to cuInit: CUDA_ERROR_UNKNOWN: unknown error

Running nvidia-smi shows that the computer still sees the GPU, but my tensorflow can’t see it.

Does anyone know what could be causing this? I don’t want to have to reinstall everything when I’m restarting the instance.


Solution

  • Some people (sadly not me) are able to resolve this by setting the following at the beginning of their script/main:

    import os
    os.environ["CUDA_VISIBLE_DEVICES"] = "0"
    

    I had to reinstall CUDA drivers and from then on it worked even after restarting the instance. You can configure your system settings on NVIDIAs website and it will provide you the commands you need to follow to install cuda. It also asks you if you want to uninstall the previous cuda version (yes!).This is luckily also very fast.