How to calculate optimal batch size?

B

7

46

Sometimes I run into a problem:

OOM when allocating tensor with shape

e.g.

OOM when allocating tensor with shape (1024, 100, 160)

Where 1024 is my batch size and I don't know what's the rest. If I reduce the batch size or the number of neurons in the model, it runs fine.

Is there a generic way to calculate optimal batch size based on model and GPU memory, so the program doesn't crash?

In short: I want the largest batch size possible in terms of my model, which will fit into my GPU memory and won't crash the program.

Bigener answered 9/10, 2017 at 20:25 Comment(1)

Honestly, from what you've posted just try with 512. If that doesn't work, then half again. You're limited to powers of 2 so keep reducing till it works. It isn't so much 'optimal' batch size as it is 'what fits in memory'. – Warnerwarning 23/3, 2023 at 13:24

A

30

You can estimate the largest batch size using:

Max batch size= available GPU memory bytes / 4 / (size of tensors + trainable parameters)

Aires answered 9/10, 2017 at 23:33 Comment(12)

How do I get the size of tensors and the number trainable parameters? Aren't you missing the model size in the equation? – Bigener 10/10, 2017 at 0:11

@gisek the model size is actually the no of training parameters, which in Keras you get with model.summary() – Flatboat 10/10, 2017 at 8:53

@Flatboat I'm not sure if you're right. If I create a large netowork and feed it with batch_size=1, I also get the the same error. – Bigener 10/10, 2017 at 9:10

Of course - it can certainly happen that the combination of your model size (trainable parameters) and input data size exhaust your memory even with batch_size = 1, especially if you have a small GPU... – Flatboat 10/10, 2017 at 9:29

@Flatboat hehe, I didn't get that "no" stands for "number". Now it makes sense :) – Bigener 13/10, 2017 at 17:56

What is size of tensors ? I am still confused about that part. – Homeric 4/7, 2018 at 15:59

@Homeric Each layer has its tensor + one or more weight matrices (usually referred to as trainable parameters). For example: if you're feeding your network with 200x200 RGB images, then the size of your input tensor (in bytes) is [batch size] * 3 * 200 * 200 ( * 4 if you use 64bit integers) – Aires 5/7, 2018 at 11:43

@Aires Theoretically your formula makes sense. Have you ever tested it empirically? I am observing the following: For Alexnet with 62 million parameters and a image size of 224x224x3 and a 6GB graphics card, I should be able to fit: (6 GB - (62 Million * 4 bytes)) / (224 * 224 * 3 * 4 bytes) = 9553 as max_batch_size. In practice I am not able to run training with more than batch_size = 512. With 1024 it already crashes. Second example: Resnet-50 has only 25 Million parameters. So I should get an even higher max_batch_size. In practice training crashes with batch_size=128. Please advise. – Vuong 7/8, 2018 at 14:59

@Vuong You should take into account all the tensors, not just the input – Aires 20/8, 2018 at 0:50

@Aires Could you please give an example what tensors you mean? I thought with all the trainable parameters I do take that into consideration? Please correct me if I am wrong. – Vuong 27/8, 2018 at 7:25

@Vuong For each layer your model has to store an input placeholder, one or more weight matrices (trainable or otherwise) and an output placeholder (which may also be the next layer's input). – Aires 30/8, 2018 at 10:43

Is it possible to include reference from which paper this was used? – Ernaernald 24/8, 2022 at 14:22

F

47

From the recent Deep Learning book by Goodfellow et al., chapter 8:

Minibatch sizes are generally driven by the following factors:

Larger batches provide a more accurate estimate of the gradient, but with less than linear returns.

Multicore architectures are usually underutilized by extremely small batches. This motivates using some absolute minimum batch size, below which there is no reduction in the time to process a minibatch.

If all examples in the batch are to be processed in parallel (as is typically the case), then the amount of memory scales with the batch size. For many hardware setups this is the limiting factor in batch size.

Some kinds of hardware achieve better runtime with speciﬁc sizes of arrays. Especially when using GPUs, it is common for power of 2 batch sizes to offer better runtime. Typical power of 2 batch sizes range from 32 to 256, with 16 sometimes being attempted for large models.

Small batches can offer a regularizing effect (Wilson and Martinez, 2003), perhaps due to the noise they add to the learning process. Generalization error is often best for a batch size of 1. Training with such a small batch size might require a small learning rate to maintain stability because of the high variance in the estimate of the gradient. The total runtime can be very high as a result of the need to make more steps, both because of the reduced learning rate and because it takes more steps to observe the entire training set.

Which in practice usually means "in powers of 2 and the larger the better, provided that the batch fits into your (GPU) memory".

You might want also to consult several good posts here in Stack Exchange:

Just keep in mind that the paper by Keskar et al. 'On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima', quoted by several of the posts above, has received some objections by other respectable researchers of the deep learning community.

UPDATE (Dec 2017):

There is a new paper by Yoshua Bengio & team, Three Factors Influencing Minima in SGD (Nov 2017); it is worth reading in the sense that it reports new theoretical & experimental results on the interplay between learning rate and batch size.

UPDATE (Mar 2021):

Of interest here is also another paper from 2018, Revisiting Small Batch Training for Deep Neural Networks (h/t to Nicolas Gervais), which runs contrary to the larger the better advice; quoting from the abstract:

The best performance has been consistently obtained for mini-batch sizes between m=2 and m=32, which contrasts with recent work advocating the use of mini-batch sizes in the thousands.

Flatboat answered 9/10, 2017 at 22:27 Comment(10)

It's does not really answer my question. I want the largest batch size possible in terms of my model, which will fit into my GPU memory. – Bigener 9/10, 2017 at 23:0

Understood. In practice, especially if you use a GPU, the powers of 2 requirement is so limiting that, even if you get an 'optimal' size of, say, 800, you never use it; what you do is start with an n (power of 2) and, if you get an OOM, try with n/2, then with n/4 etc (if not, you try 2*n) - see 4th bullet above – Flatboat 10/10, 2017 at 9:5

Going down with the size if a error occurs is a big nuisance when you're experimenting with hyperparameters and topologies. A generic formula would be great. Even if the result would be rounded to the power of 2. – Bigener 10/10, 2017 at 9:14

I don't see how your excerpts led you to the conclusion that larger is better. Maybe you could pinpoint the exact source that made you conclude this? – Neolamarckism 19/4, 2020 at 13:25

@NicolasGervais what about the very first bullet, "Larger batches provide a more accurate estimate of the gradient"?? – Flatboat 19/4, 2020 at 13:27

That might not be as meaningful as you seem to think. Especially in light of evidence that is more recent than any of your sources, which strongly argues against batch size over 32. – Neolamarckism 19/4, 2020 at 13:33

@NicolasGervais That's another matter (answer hasn't been updated since 2017), and not what you asked in the first place. Based on what has been quoted here, I cannot see any inconsistency, as you seem to imply. – Flatboat 19/4, 2020 at 13:36

@NicolasGervais that paper on small batch sizes has a lot of weaknesses. Besides the fact that it is not published in any peer reviewed venue, it does not cover much recent work on learning rate schedules. In particular it does not reference any of the work by Leslie N. Smith on one-shot training schedules with very high learning rates, the Super-Convergence paper in particular. Tuning the learning rate is essential to training performance, but the authors have punted in favor of a naive linear scaling as batch size increases. – Bromide 19/4, 2020 at 19:43

Don't get me wrong, it's an interesting theoretical tack to take. But it seems like a very narrow view to take in practice. – Bromide 19/4, 2020 at 19:44

On a practical side, I'm [re]training a shallow dnn on a machine with a single GPU. If the batch size is 2048, it takes ~20 min per epoch (~12 epochs to converge). If I set the batch size to 32, the estimated time to converge is 188 hours. On a CPU it's similarly unrealistic time wise. – Tool 23/6, 2021 at 11:27