Maximize tensorflow multi gpu performance
Asked Answered
S

1

11

I was wondering if anybody could advise on how to get peak performance out of tensorflow in a 4 GPU setting.

As a test I created two of the same network (18 ish layer residual network with small filter banks (ranging from 16-128) on 32x32 inputs. Batch size 512, 128 per GPU.). One in MXNet and one I have modelled off of the inception example.

My MXNet network can train at around 7k examples a second where tensorflow is only capable of 4.2k with dummy data and 3.7 with real data.

(when running on 1 GPU the numbers are 1.2k examples a second vs 2.1k)

In my experiment I have a few questions in hopes to speed things up.

  1. GPU utilization seems quite low when training. I noticed that in the tensorflow white paper there is support for running multiple streams on the same GPU. Is this possible in the public release?

  2. Is there anyway to perform multiple train operations in one execution of session.run()? Or have async execution? This would allow for weight updates to be done at the same time as the next batches forward pass? I have tried using 2 threads (both system and with QueueRunners's), but this only resulted in a slowdown. MXNet is able to increase speeds by running weight updates on the CPU so that the gpu's can be used for the next batch.

  3. Will the new distributed run time get around some of these issues by letting me run more than one worker on a single machine?

  4. Is there something else that can be done?

I know there are a number of similar questions here on stack overflow, but though my searching I couldn't find a solution to my problems that I have not already tried.

Edit:

I did a little bit of CUDA profiling to see what the expensive kernels were. According to my run, 21.4% of the time is spent inside:

void Eigen::internal::EigenMetaKernel_NonVectorizable<Eigen::TensorEvaluator
<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, int=4, int=1, long>, int=16>,
Eigen::TensorPaddingOp<Eigen::array<std::pair<int, int>,
unsigned long=4> const, Eigen::TensorMap<Eigen::Tensor<float const,
int=4, int=1, long>, int=16> const > const > const, Eigen::GpuDevice>, long>(float, int=4)

and 20.0% of the time were spent in

void Eigen::internal::EigenMetaKernel_NonVectorizable<Eigen::TensorEvaluator
<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, int=4, int=1, long>, int=16>,
Eigen::TensorBroadcastingOp<Eigen::array<int, unsigned long=4>
const, Eigen::TensorMap<Eigen::Tensor<float const, int=4, int=1, long>,
int=16> const > const > const, Eigen::GpuDevice>, long>(float, int=4)

Off of the Signature I am not exactly sure what these are doing. Do these make sense?

In addition to this, the analysis reports low kernel concurrency, 0%, as expected. And Low compute utilization 34.9% (granted this includes start-up time and a little bit of python in train loop. Around 32 seconds total out of 91. This comes out to around 50% utilization inside tensorflow.)

Edit 2:

I have attached a copy of the trimmed down source code. In general though I am more concerned about question 1-3 and don't want to take too much of ever bodies time.

In addition I am running on tensorflow built from: f07234db2f7b316b08f7df25417245274b63342a

Edit 3:

Updated to the most recent tensorflow (63409bd23facad471973b110df998782c0e19c06) same code, default data format (NHWC) and that seemed to speed this up a lot. On fake data 6.7k-6.8k (thermal dependence I think?) examples a second 4gpu. 1gpu -- 2.0k examples a second. Real data performance is around 4.9k examples a second for 4gpu. 1gpu -- 1.7k examples a second.

Edit 4:

In addition I tried out switching data formats to BCHW. I made the conversion modelled off of Soumith's benchmarks. The convolution parts were indeed faster, but batch norm appears to be messing everything up. With a naive implementation (fixing axis, and making weights [1,C,1,1] instead of [C,]) I am only able to get 1.2k examples a second on 4 gpu (fake data). Where as with a transpose before and after the batch norm op I am able to get 6.2k examples a second (fake data). Still slower than the NHWC data_format.

Steeple answered 16/3, 2016 at 22:12 Comment(0)
P
1

It's a bit hard to diagnose your program's performance problem without seeing the code. Is it possible for us to read your test code somehow?

TensorPadding showing on the top is a bit strange. I'd expect cudnn calls should be on the top of the profile. Anyway, showing us the test code will be helpful.

Palpate answered 18/3, 2016 at 6:8 Comment(3)
I attached a gist of the source. Thank you for the help. Is it safe to assume that the the second template argument of TensorMap is the kernel being applied? How do you know that it is TensorPadding and not TensorAssign for example?Steeple
A few suggestions: 1) Try recloning from HEAD -- there were several improvements to padding in Eigen since March that should help with speed. 2) Convolutions are currently faster when using the best supported layout by CuDNN: NCHW is currently the best tensor layout. github.com/soumith/convnet-benchmarks/blob/master/tensorflow/… for an example of how you can specify the data format order for convolutions, max pooling, etc.Commensurate
@Commensurate Recloning from HEAD results in a significant improvement in performance. Thanks! As for data_format I have updated my original post. I am looking into the slowdown. (fairly sure its the reduce on different dimensions.)Steeple

© 2022 - 2024 — McMap. All rights reserved.