Tensorflow Mirror Strategy and Horovod Distribution Strategy

Asked 5/3, 2019 at 17:15 Answered 6/10, 2020 at 4:15

tensorflow deep-learning mpi distributed-tensorflow horovod

I am trying to understand what are the basic difference between Tensorflow Mirror Strategy and Horovod Distribution Strategy.

From the documentation and the source code investigation I found that Horovod (https://github.com/horovod/horovod) is using Message Passing Protocol (MPI) to communicate between multiple nodes. Specifically it uses all_reduce, all_gather of MPI.

From my observation (I may be wrong) Mirror Strategy is also using all_reduce algorithm (https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/distribute).

Both of them are using data-parallel, synchronous training approach. So I am a bit confused how they are different? Is the difference only in implementation or there are other (theoretical) difference?

And how is the performance of mirror strategy compared to horovod?

Lysozyme answered 5/3, 2019 at 17:15 Comment(1)

Take a look logicalclocks.com/… – Porta 18/3, 2019 at 17:0

Mirror Strategy has its own all_reduce algorithm which use remote procedural calls (gRPC) under the hood.

Like you mentioned Horovod uses MPI/GLOO to communicate between multiple processes.

Baud answered 6/10, 2020 at 0:6 Comment(0)

Regarding the performance, one of my colleagues have performed experiments before using 4 Tesla V100 GPUs using the codes from here. The results suggested that 3 settings work the best: replicated with all_reduce_spec=nccl, collective_all_reduce with properly tuned allreduce_merge_scope (e.g. 32), and horovod. I did not see significant differences among these 3.

Bit answered 6/10, 2020 at 4:15 Comment(0)

Recommended topics

Hot tags