Maximize tensorflow multi gpu performance

I was wondering if anybody could advise on how to get peak performance out of tensorflow in a 4 GPU setting.

As a test I created two of the same network (18 ish layer residual network with small filter banks (ranging from 16-128) on 32x32 inputs. Batch size 512, 128 per GPU.). One in MXNet and one I have modelled off of the inception example.

My MXNet network can train at around 7k examples a second where tensorflow is only capable of 4.2k with dummy data and 3.7 with real data.

(when running on 1 GPU the numbers are 1.2k examples a second vs 2.1k)

In my experiment I have a few questions in hopes to speed things up.

GPU utilization seems quite low when training. I noticed that in the tensorflow white paper there is support for running multiple streams on the same GPU. Is this possible in the public release?
Is there anyway to perform multiple train operations in one execution of session.run()? Or have async execution? This would allow for weight updates to be done at the same time as the next batches forward pass? I have tried using 2 threads (both system and with QueueRunners's), but this only resulted in a slowdown. MXNet is able to increase speeds by running weight updates on the CPU so that the gpu's can be used for the next batch.
Will the new distributed run time get around some of these issues by letting me run more than one worker on a single machine?
Is there something else that can be done?

I know there are a number of similar questions here on stack overflow, but though my searching I couldn't find a solution to my problems that I have not already tried.

Edit:

I did a little bit of CUDA profiling to see what the expensive kernels were. According to my run, 21.4% of the time is spent inside:

void Eigen::internal::EigenMetaKernel_NonVectorizable<Eigen::TensorEvaluator
<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, int=4, int=1, long>, int=16>,
Eigen::TensorPaddingOp<Eigen::array<std::pair<int, int>,
unsigned long=4> const, Eigen::TensorMap<Eigen::Tensor<float const,
int=4, int=1, long>, int=16> const > const > const, Eigen::GpuDevice>, long>(float, int=4)

and 20.0% of the time were spent in

void Eigen::internal::EigenMetaKernel_NonVectorizable<Eigen::TensorEvaluator
<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, int=4, int=1, long>, int=16>,
Eigen::TensorBroadcastingOp<Eigen::array<int, unsigned long=4>
const, Eigen::TensorMap<Eigen::Tensor<float const, int=4, int=1, long>,
int=16> const > const > const, Eigen::GpuDevice>, long>(float, int=4)

Off of the Signature I am not exactly sure what these are doing. Do these make sense?

In addition to this, the analysis reports low kernel concurrency, 0%, as expected. And Low compute utilization 34.9% (granted this includes start-up time and a little bit of python in train loop. Around 32 seconds total out of 91. This comes out to around 50% utilization inside tensorflow.)

Edit 2:

I have attached a copy of the trimmed down source code. In general though I am more concerned about question 1-3 and don't want to take too much of ever bodies time.

In addition I am running on tensorflow built from: f07234db2f7b316b08f7df25417245274b63342a

Edit 3:

Updated to the most recent tensorflow (63409bd23facad471973b110df998782c0e19c06) same code, default data format (NHWC) and that seemed to speed this up a lot. On fake data 6.7k-6.8k (thermal dependence I think?) examples a second 4gpu. 1gpu -- 2.0k examples a second. Real data performance is around 4.9k examples a second for 4gpu. 1gpu -- 1.7k examples a second.

Edit 4:

In addition I tried out switching data formats to BCHW. I made the conversion modelled off of Soumith's benchmarks. The convolution parts were indeed faster, but batch norm appears to be messing everything up. With a naive implementation (fixing axis, and making weights [1,C,1,1] instead of [C,]) I am only able to get 1.2k examples a second on 4 gpu (fake data). Where as with a transpose before and after the batch norm op I am able to get 6.2k examples a second (fake data). Still slower than the NHWC data_format.

Recommended topics

Hot tags