GPU utilization 0% during TensorFlow retraining for poets
Asked Answered
L

2

13

I am following instructions for TensorFlow Retraining for Poets. GPU utilization seemed low so I instrumented the retrain.py script per the instructions in Using GPU. The log verifies that the TF graph is being built on GPU. I am retraining for a large number of classes and images. Please help me tweak the parameters in TF and the retraining script to utilize GPU.

I am aware of this question that I should decrement the batch size. It is not obvious what constitutes "batch size" for this script. I have 60 classes and 1MM training images. It starts by making 1MM bottleneck files. That part is CPU and slow and I understand that. Then it trains in 4,000 steps where it takes 100 images per time in the step. Is this the batch? Will GPU utilization go up if I reduce the number of images per step?

Your help would be really appreciated!

Lialiabilities answered 3/6, 2018 at 19:6 Comment(4)
also any pointers on making the training script run faster would be great, currently my training run takes 2 weeks... (1500k steps)Quinte
I think the link changed to a new tutorial, the link doesn't point to the tensorflow retraining for poets anymore. Do you have the original link?Shelah
This came from, codelabs.developers.google.com/codelabs/tensorflow-for-poets originallyQuinte
if u r using GPU, there will be a max batch size do to the GPU memory overflow, if u need let's schedule a zoom meetingNahtanha
H
4

I usually do the things below.

  1. Check if you are using GPU.

    tf.test.is_gpu_available()
    
  2. Monitor GPU usage.

    watch -n 0.1 nvidia-smi
    
  3. If your CPU usage is low. Write this after

    train_batches = train.shuffle(SHUFFLE_BUFFER_SIZE).batch(BATCH_SIZE)
    
    train_batches = train_batches.prefetch(1) #  This will prefetch one batch
    
  4. If your GPU usage is still low.

    batch_size = 128
    
  5. If your GPU is still low. May be:

    • Your graph is too simple to use more GPU.
    • Code bug or package bug.
Hazelhazelnut answered 17/6, 2020 at 8:31 Comment(0)
C
3

Let's go one by one with your questions:

  1. Batch size is the number of images on which the training/testing/validation is done at a time. You can find the respective parameters and their default values defined in the script:
  parser.add_argument(
      '--train_batch_size',
      type=int,
      default=100,
      help='How many images to train on at a time.'
  )
  parser.add_argument(
      '--test_batch_size',
      type=int,
      default=-1,
      help="""\
      How many images to test on. This test set is only used once, to evaluate
      the final accuracy of the model after training completes.
      A value of -1 causes the entire test set to be used, which leads to more
      stable results across runs.\
      """
  )
  parser.add_argument(
      '--validation_batch_size',
      type=int,
      default=100,
      help="""\
      How many images to use in an evaluation batch. This validation set is
      used much more often than the test set, and is an early indicator of how
      accurate the model is during training.
      A value of -1 causes the entire validation set to be used, which leads to
      more stable results across training iterations, but may be slower on large
      training sets.\
      """
  )

So if you want to decrease training batch size, you should run the script with this parameter among others:

python -m retrain --train_batch_size=16

I also recommend you to specify the number of the batch size as a power of 2 (16, 32, 64, 128, ...). And this number depends on the GPU you are using. The less memory the GPU has the lesser batch size you should use. With 8Gb in the GPU, you can try a batch size of 16.

  1. To discover whether you are using GPUs at all you can follow the steps in the Tensorflow documentation you mentioned - just put tf.debugging.set_log_device_placement(True)

as the first statement of your script.

Device placement logging causes any Tensor allocations or operations will be printed.

Calipash answered 16/6, 2020 at 21:28 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.