With Horovod, you basically run N independent instances (so it is a form of between-graph replication), and they communicate via special Horovod ops (basically broadcast + reduce).
Now let's say either instance 0, or some other external instance loads your data (via tf.data.Dataset
). How would you distribute the iterator.get_next()
to each instance? Using Horovod broadcast would be inefficient, as you would copy all the data to all instances.
Having the dataset in every instance, and doing all the loading in there, and then using shard
on the dataset would also be inefficient, as you would load the data everywhere, and then throw away (N-1)/N of it.
So that's why you would also not want sharding, and instead have the dataset loading only in a single (producer/dataset worker) instance, which then distributes the batches on all the train workers.
I guess the TF MultiDeviceIterator
provides some similar functionality (or basically exactly that) but I'm not sure whether that works together with Horovod, and how you would set it up?
Or maybe you can make the distribution somehow via TF workers (guide)? (Maybe that is how you would configure MultiDeviceIterator
as well?)
If possible, this should be via TensorFlow operations / functions (there are many related functions which might already give me this, but I might not know about them, or have misunderstood how it works). Or maybe the answer is that TensorFlow does not provide any such functionality yet? (This would still be useful to know. Then I would implement my own solution in C++, wrapped as a TensorFlow op. But before doing so, it would be good to know whether this is really necessary.)
(Related is also this Horovod issue.)
(This question is actually a bit more generic than just Horovod, although Horovod might be a good example. You might have this problem always for distributed TensorFlow?)
(I collected an overview of all the distributed TensorFlow terminology and aspects here, mostly for clarification.)
(Related are (maybe?) also this, this, this, this or this questions.)
tf.data
service for distributed training. I'm not sure if it solves your problem in particular but might be worth for you to keep an eye on it. – Hartsjob_name
). Or at least almost, because it will not be distributed evenly, but on a first-come first-served basis (which is maybe ok). Interestingly, when the job is not shared, this is another solution I currently implemented already, by having each worker just using a different random seed for the dataset shuffling. – Isola