MirroredStrategy doesn't use GPUs
Asked Answered
E

2

10

I wanted to use the tf.contrib.distribute.MirroredStrategy() on my Multi GPU System but it doesn't use the GPUs for the training (see the output below). Also I am running tensorflow-gpu 1.12.

I did try to specify the GPUs directly in the MirroredStrategy, but the same problem appeared.

model = models.Model(inputs=input, outputs=y_output)
optimizer = tf.train.AdamOptimizer(LEARNING_RATE)
model.compile(loss=lossFunc, optimizer=optimizer)

NUM_GPUS = 2
strategy = tf.contrib.distribute.MirroredStrategy(num_gpus=NUM_GPUS)
config = tf.estimator.RunConfig(train_distribute=strategy)
estimator = tf.keras.estimator.model_to_estimator(model,
                                              config=config)

These are the results I am getting:

INFO:tensorflow:Device is available but not used by distribute strategy: /device:CPU:0
INFO:tensorflow:Device is available but not used by distribute strategy: /device:GPU:0
INFO:tensorflow:Device is available but not used by distribute strategy: /device:GPU:1
WARNING:tensorflow:Not all devices in DistributionStrategy are visible to TensorFlow session.

The expected result would be obviously to run the training on a Multi GPU system. Are those known issues?

Endicott answered 19/2, 2019 at 12:41 Comment(2)
Maybe this will help: tensorflow.org/guide/using_gpu#using_multiple_gpusJamesy
Or this: https://mcmap.net/q/1164060/-keras-multi_gpu_model-causes-system-to-crashJamesy
C
9

I've been facing a similar issue with MirroredStrategy failing on tensorflow 1.13.1 with 2x RTX2080 running an Estimator.

The failure seems to be in the NCCL all_reduce method (error message - no OpKernel registered for NCCL AllReduce).

I got it to run by changing from NCCL to hierarchical_copy, which meant using the contrib cross_device_ops methods as follows:

Failed command:

mirrored_strategy = tf.distribute.MirroredStrategy(devices=["/gpu:0","/gpu:1"])

Successful command:

mirrored_strategy = tf.distribute.MirroredStrategy(devices=["/gpu:0","/gpu:1"],
                      cross_device_ops=tf.contrib.distribute.AllReduceCrossDeviceOps(
                         all_reduce_alg="hierarchical_copy")
                                                   )
Conjugated answered 10/4, 2019 at 9:37 Comment(2)
Oh cool! Gonna check it out if I get the chance and comeback to you :)Endicott
@Endicott so what is the result ?Corkhill
T
5

In TensorFlow new version, AllReduceCrossDeviceOps isn't exist. You may use distribute.HierarchicalCopyAllReduce() instead:

mirrored_strategy = tf.distribute.MirroredStrategy(devices= ["/gpu:0","/gpu:1"],cross_device_ops=tf.distribute.HierarchicalCopyAllReduce())
Torchier answered 21/4, 2021 at 17:28 Comment(2)
This works for tensorflow 2.9.1 (tensorflow 2.9.1 cuda112py310he87a039_0 conda-forge)Phebe
Thanks @SadVaseb. This works with tensorflow==2.11.1 as well.Monique

© 2022 - 2024 — McMap. All rights reserved.