When training either one of two different neural networks, one with Tensorflow and the other with Theano, sometimes after a random amount of time (could be a few hours or minutes, mostly a few hours), the execution freezes and I get this message by running "nvidia-smi":
"Unable to determine the device handle for GPU 0000:02:00.0: GPU is lost. Reboot the system to recover this GPU"
I tried to monitor the GPU performance for 13-hours execution, and everything seems stable:
I'm working with:
- Ubuntu 14.04.5 LTS
- GPUs are Nvidia Titan Xp (this behavior repeats on another GPU on the same machine)
- CUDA 8.0
- CuDNN 5.1
- Tensorflow 1.3
- Theano 0.8.2
I'm not sure how to approach this problem, can anyone please suggest ideas of what can cause this and how to diagnose/fix this?