Is using batch size as 'powers of 2' faster on tensorflow?
Asked Answered
S

3

19

I read from somewhere that if you choose a batch size that is a power 2, training will be faster. What is this rule? Is this applicable to other applications? Can you provide a reference paper?

Strophe answered 11/6, 2017 at 11:17 Comment(1)
If you use gpu calculation and if tensorflow is using batch size as global work size then it make senseUndersell
E
11

Algorithmically speaking, using larger mini-batches allows you to reduce the variance of your stochastic gradient updates (by taking the average of the gradients in the mini-batch), and this in turn allows you to take bigger step-sizes, which means the optimization algorithm will make progress faster.

However, the amount of work done (in terms of number of gradient computations) to reach a certain accuracy in the objective will be the same: with a mini-batch size of n, the variance of the update direction will be reduced by a factor n, so the theory allows you to take step-sizes that are n times larger, so that a single step will take you roughly to the same accuracy as n steps of SGD with a mini-batch size of 1.

As for tensorFlow, I found no evidence of your affirmation, and its a question that has been closed on github : https://github.com/tensorflow/tensorflow/issues/4132

Note that image resized to power of two makes sense (because pooling is generally done in 2X2 windows), but that’s a different thing altogether.

Emirate answered 11/6, 2017 at 12:43 Comment(3)
what do you mean by "by taking the average of the gradients in the mini-batch" when using larger mini-batch sizes?Strophe
not the best wording indeed. Each element in your minibatch gives you a gradient, and you average them.Emirate
Basically...if you have a minibatch size of 10...then gradients are averaged over these ten examples and updated in a single shot.Erminois
P
17

The notion comes from aligning computations (C) onto the physical processors (PP) of the GPU.

Since the number of PP is often a power of 2, using a number of C different from a power of 2 leads to poor performance.

You can see the mapping of the C onto the PP as a pile of slices of size the number of PP. Say you've got 16 PP. You can map 16 C on them : 1 C is mapped onto 1 PP. You can map 32 C on them : 2 slices of 16 C , 1 PP will be responsible for 2 C.

This is due to the SIMD paradigm used by GPUs. This is often called Data Parallelism : all the PP do the same thing at the same time but on different data.

Phototypy answered 1/2, 2019 at 18:47 Comment(0)
E
11

Algorithmically speaking, using larger mini-batches allows you to reduce the variance of your stochastic gradient updates (by taking the average of the gradients in the mini-batch), and this in turn allows you to take bigger step-sizes, which means the optimization algorithm will make progress faster.

However, the amount of work done (in terms of number of gradient computations) to reach a certain accuracy in the objective will be the same: with a mini-batch size of n, the variance of the update direction will be reduced by a factor n, so the theory allows you to take step-sizes that are n times larger, so that a single step will take you roughly to the same accuracy as n steps of SGD with a mini-batch size of 1.

As for tensorFlow, I found no evidence of your affirmation, and its a question that has been closed on github : https://github.com/tensorflow/tensorflow/issues/4132

Note that image resized to power of two makes sense (because pooling is generally done in 2X2 windows), but that’s a different thing altogether.

Emirate answered 11/6, 2017 at 12:43 Comment(3)
what do you mean by "by taking the average of the gradients in the mini-batch" when using larger mini-batch sizes?Strophe
not the best wording indeed. Each element in your minibatch gives you a gradient, and you average them.Emirate
Basically...if you have a minibatch size of 10...then gradients are averaged over these ten examples and updated in a single shot.Erminois
G
1

I've heard this, too. Here's a white paper about training on CIFAR-10 where some Intel researchers make the claim:

In general, the performance of processors is better if the batch size is a power of 2.

(See: https://software.intel.com/en-us/articles/cifar-10-classification-using-intel-optimization-for-tensorflow.)

However, it's unclear just how big the advantage may be because the authors don't provide any training duration data :/

Goatherd answered 28/1, 2018 at 22:49 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.