Implications of using MPI with TensorFlow

Asked 18/9, 2017 at 15:8 Answered 15/11, 2017 at 12:50

I come from a sort of HPC background and I am just starting to learn about machine learning in general and TensorFlow in particular. I was initially surprised to find out that distributed TensorFlow is designed to communicate with TCP/IP by default though it makes sense in hindsight given what Google is and the kind of hardware it uses most commonly.

I am interested in experimenting with TensorFlow in a parallel way with MPI on a cluster. From my perspective, this should be advantageous because latency should be much lower due to MPI's use of Remote Direct Memory Access (RDMA) across machines without shared memory.

So my question is, why doesn't this approach seem to be more common given the increasing popularity of TensorFlow and machine learning ? Isn't latency a bottleneck ? Is there some typical problem that is solved, that makes this sort of solution impractical? Are there likely to be any meaningful differences between calling TensorFlow functions in a parallel way vs implementing MPI calls inside of the TensorFlow library ?

Thanks

Klaraklarika answered 18/9, 2017 at 15:8 Comment(2)

CNTK is built on MPI, might be worth looking into. – Principal 18/9, 2017 at 15:52

Horovod from Uber use MPI, too. – Fusionism 8/5, 2018 at 3:31

It seems tensorflow already supports MPI, as stated at https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/mpi MPI support for tensorflow was also discussed at https://arxiv.org/abs/1603.02339

Generally speaking, keep in mind MPI is best at sending/receiving messages, but not so great at sending notifications and acting upon events. Last but not least, MPI support of multi-threaded applications (e.g. MPI_THREAD_MULTIPLE) has not always been production-ready among MPI implementation s. These were two general statements and i honestly do not know if they are relevant for tensorflow.

Mame answered 18/9, 2017 at 15:40 Comment(0)

According to the doc in Tensorflow git repo，actually tf utilizes gRPC library by detault, which is based on HTTP2 protocol, rather than TCP/IP protocol, and this paper should give you some insight, hope this information is useful.

Ashworth answered 15/11, 2017 at 12:50 Comment(0)

Recommended topics

Hot tags