GPU is lost during execution of either Tensorflow or Theano code
Asked Answered
J

1

12

When training either one of two different neural networks, one with Tensorflow and the other with Theano, sometimes after a random amount of time (could be a few hours or minutes, mostly a few hours), the execution freezes and I get this message by running "nvidia-smi":

"Unable to determine the device handle for GPU 0000:02:00.0: GPU is lost. Reboot the system to recover this GPU"

I tried to monitor the GPU performance for 13-hours execution, and everything seems stable: enter image description here

I'm working with:

  • Ubuntu 14.04.5 LTS
  • GPUs are Nvidia Titan Xp (this behavior repeats on another GPU on the same machine)
  • CUDA 8.0
  • CuDNN 5.1
  • Tensorflow 1.3
  • Theano 0.8.2

I'm not sure how to approach this problem, can anyone please suggest ideas of what can cause this and how to diagnose/fix this?

Jordanson answered 26/8, 2017 at 4:29 Comment(2)
Did you find a solution/answer?Bautzen
Yup, added an answer, I hope this helps.Jordanson
J
11

I posted this question a while ago, but after some investigation back then that took a few weeks, we managed to find the problem (and a solution). I don't remember all the details now, but I'm posting our main conclusion, in case someone will find it useful.

Bottom line is - the hardware we had was not strong enough to support high load GPU-CPU communication. We observed these issues on a rack server with 1 CPU and 4 GPU devices, There was simply an overload on the PCI bus. The problem was solved by adding another CPU to the rack server.

Jordanson answered 19/2, 2019 at 17:6 Comment(3)
Thank you for the answer! Did you remember how did you get that this was due to an overload on the PCI bus?Bautzen
We tried to characterize when these failures happen in terms of the code we were running. We found they occur either when we use 3-4 GPUs in parallel or when running code that causes a lot of CPU-GPU traffic. Then we compared our server spec to commonly used specs and saw that usually there are two CPUs while we had just one. So we bought another one, and the problem was solved.Jordanson
I also remember we looked a lot at system logs of the server and saw many warnings/errors from the PCI bus. Sorry for the lack of details, I didn't document our investigation process.Jordanson

© 2022 - 2024 — McMap. All rights reserved.