GPU utilization mostly 0% during training

(GTX 1080, Tensorflow 1.0.0)

During the training nvidia-smi output (below) suggests that the GPU utilization is 0% most of the time (despite usage of GPU). Regarding the time I already train, that seems to be the case. Once in a while it peaks up to 100% or similar, for a second though.

+-----------------------------------------------------------------------------+
    | NVIDIA-SMI 375.26                 Driver Version: 375.26                    |
    |-------------------------------+----------------------+----------------------+
    | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
    |===============================+======================+======================|
    |   0  GeForce GTX 1080    Off  | 0000:01:00.0      On |                  N/A |
    | 33%   35C    P2    49W / 190W |   7982MiB /  8110MiB |      0%      Default |
    +-------------------------------+----------------------+----------------------+

    +-----------------------------------------------------------------------------+
    | Processes:                                                       GPU Memory |
    |  GPU       PID  Type  Process name                               Usage      |
    |=============================================================================|
    |    0      1093    G   /usr/lib/xorg/Xorg                             175MiB |
    |    0      1915    G   compiz                                          90MiB |
    |    0      4383    C   python                                        7712MiB |
    +-----------------------------------------------------------------------------+

The situation occurs to me as I described in this issue. The problem can be replicated either with the code from that github repository or by following this simple retraining example from tensorflow's website and passing restricted per_process_gpu_memory_fraction (less than 1.0) like that in the session:

config = tf.ConfigProto()
config.gpu_options.per_process_gpu_memory_fraction = 0.4
session = tf.Session(config=config, ...)

Question 1: How to really use (utilize) GPU during the training while using <1.0 of the GPU?

Question 2: How to really use full GPU (not setting it to <1.0) with my graphic card?

Help&hints appreciated!

When you create a graph that is bigger than the memory of the GPU TensorFlow falls back to the CPU, to it uses RAM and CPU instead of the GPU. So just remove the option for per_process_gpu_memory_fraction and decrement the batch size. Most probably the examples uses a big batch size because it was trained in more than one GPU or in a CPU with >32Gb, which it is not your case. It can also be the optimizer algorithm you chose. SGD uses less memory than other algorithms, try to set it first. With 8Gb in the GPU you can try a batch size of 16 and SGD, it should work. Then you can increase the batch size or use other algoritms like RMSprop.

If it is still not working you probably are doing something else. Like for example you are saving a checkpoint in every iteration. Saving a checkpoints is done in the CPU and probably takes much more time than a simple iteration in the GPU. That could be the reason you are seeing spikes in the GPU usage.

Recommended topics

Hot tags