gpunvidiatensorflowcudnntheano-cuda

GPU is lost during execution of either Tensorflow or Theano code


When training either one of two different neural networks, one with Tensorflow and the other with Theano, sometimes after a random amount of time (could be a few hours or minutes, mostly a few hours), the execution freezes and I get this message by running "nvidia-smi":

"Unable to determine the device handle for GPU 0000:02:00.0: GPU is lost. Reboot the system to recover this GPU"

I tried to monitor the GPU performance for 13-hours execution, and everything seems stable: enter image description here

I'm working with:

I'm not sure how to approach this problem, can anyone please suggest ideas of what can cause this and how to diagnose/fix this?


Solution

  • I posted this question a while ago, but after some investigation back then that took a few weeks, we managed to find the problem (and a solution). I don't remember all the details now, but I'm posting our main conclusion, in case someone will find it useful.

    Bottom line is - the hardware we had was not strong enough to support high load GPU-CPU communication. We observed these issues on a rack server with 1 CPU and 4 GPU devices, There was simply an overload on the PCI bus. The problem was solved by adding another CPU to the rack server.