How to calculate optimal batch size?
Asked Answered
B

7

46

Sometimes I run into a problem:

OOM when allocating tensor with shape

e.g.

OOM when allocating tensor with shape (1024, 100, 160)

Where 1024 is my batch size and I don't know what's the rest. If I reduce the batch size or the number of neurons in the model, it runs fine.

Is there a generic way to calculate optimal batch size based on model and GPU memory, so the program doesn't crash?

In short: I want the largest batch size possible in terms of my model, which will fit into my GPU memory and won't crash the program.

Bigener answered 9/10, 2017 at 20:25 Comment(1)
Honestly, from what you've posted just try with 512. If that doesn't work, then half again. You're limited to powers of 2 so keep reducing till it works. It isn't so much 'optimal' batch size as it is 'what fits in memory'.Warnerwarning
A
30

You can estimate the largest batch size using:

Max batch size= available GPU memory bytes / 4 / (size of tensors + trainable parameters)

Aires answered 9/10, 2017 at 23:33 Comment(12)
How do I get the size of tensors and the number trainable parameters? Aren't you missing the model size in the equation?Bigener
@gisek the model size is actually the no of training parameters, which in Keras you get with model.summary()Flatboat
@Flatboat I'm not sure if you're right. If I create a large netowork and feed it with batch_size=1, I also get the the same error.Bigener
Of course - it can certainly happen that the combination of your model size (trainable parameters) and input data size exhaust your memory even with batch_size = 1, especially if you have a small GPU...Flatboat
@Flatboat hehe, I didn't get that "no" stands for "number". Now it makes sense :)Bigener
What is size of tensors ? I am still confused about that part.Homeric
@Homeric Each layer has its tensor + one or more weight matrices (usually referred to as trainable parameters). For example: if you're feeding your network with 200x200 RGB images, then the size of your input tensor (in bytes) is [batch size] * 3 * 200 * 200 ( * 4 if you use 64bit integers)Aires
@Aires Theoretically your formula makes sense. Have you ever tested it empirically? I am observing the following: For Alexnet with 62 million parameters and a image size of 224x224x3 and a 6GB graphics card, I should be able to fit: (6 GB - (62 Million * 4 bytes)) / (224 * 224 * 3 * 4 bytes) = 9553 as max_batch_size. In practice I am not able to run training with more than batch_size = 512. With 1024 it already crashes. Second example: Resnet-50 has only 25 Million parameters. So I should get an even higher max_batch_size. In practice training crashes with batch_size=128. Please advise.Vuong
@Vuong You should take into account all the tensors, not just the inputAires
@Aires Could you please give an example what tensors you mean? I thought with all the trainable parameters I do take that into consideration? Please correct me if I am wrong.Vuong
@Vuong For each layer your model has to store an input placeholder, one or more weight matrices (trainable or otherwise) and an output placeholder (which may also be the next layer's input).Aires
Is it possible to include reference from which paper this was used?Ernaernald
F
47

From the recent Deep Learning book by Goodfellow et al., chapter 8:

Minibatch sizes are generally driven by the following factors:

  • Larger batches provide a more accurate estimate of the gradient, but with less than linear returns.
  • Multicore architectures are usually underutilized by extremely small batches. This motivates using some absolute minimum batch size, below which there is no reduction in the time to process a minibatch.
  • If all examples in the batch are to be processed in parallel (as is typically the case), then the amount of memory scales with the batch size. For many hardware setups this is the limiting factor in batch size.
  • Some kinds of hardware achieve better runtime with specific sizes of arrays. Especially when using GPUs, it is common for power of 2 batch sizes to offer better runtime. Typical power of 2 batch sizes range from 32 to 256, with 16 sometimes being attempted for large models.
  • Small batches can offer a regularizing effect (Wilson and Martinez, 2003), perhaps due to the noise they add to the learning process. Generalization error is often best for a batch size of 1. Training with such a small batch size might require a small learning rate to maintain stability because of the high variance in the estimate of the gradient. The total runtime can be very high as a result of the need to make more steps, both because of the reduced learning rate and because it takes more steps to observe the entire training set.

Which in practice usually means "in powers of 2 and the larger the better, provided that the batch fits into your (GPU) memory".

You might want also to consult several good posts here in Stack Exchange:

Just keep in mind that the paper by Keskar et al. 'On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima', quoted by several of the posts above, has received some objections by other respectable researchers of the deep learning community.

UPDATE (Dec 2017):

There is a new paper by Yoshua Bengio & team, Three Factors Influencing Minima in SGD (Nov 2017); it is worth reading in the sense that it reports new theoretical & experimental results on the interplay between learning rate and batch size.

UPDATE (Mar 2021):

Of interest here is also another paper from 2018, Revisiting Small Batch Training for Deep Neural Networks (h/t to Nicolas Gervais), which runs contrary to the larger the better advice; quoting from the abstract:

The best performance has been consistently obtained for mini-batch sizes between m=2 and m=32, which contrasts with recent work advocating the use of mini-batch sizes in the thousands.

Flatboat answered 9/10, 2017 at 22:27 Comment(10)
It's does not really answer my question. I want the largest batch size possible in terms of my model, which will fit into my GPU memory.Bigener
Understood. In practice, especially if you use a GPU, the powers of 2 requirement is so limiting that, even if you get an 'optimal' size of, say, 800, you never use it; what you do is start with an n (power of 2) and, if you get an OOM, try with n/2, then with n/4 etc (if not, you try 2*n) - see 4th bullet aboveFlatboat
Going down with the size if a error occurs is a big nuisance when you're experimenting with hyperparameters and topologies. A generic formula would be great. Even if the result would be rounded to the power of 2.Bigener
I don't see how your excerpts led you to the conclusion that larger is better. Maybe you could pinpoint the exact source that made you conclude this?Neolamarckism
@NicolasGervais what about the very first bullet, "Larger batches provide a more accurate estimate of the gradient"??Flatboat
That might not be as meaningful as you seem to think. Especially in light of evidence that is more recent than any of your sources, which strongly argues against batch size over 32.Neolamarckism
@NicolasGervais That's another matter (answer hasn't been updated since 2017), and not what you asked in the first place. Based on what has been quoted here, I cannot see any inconsistency, as you seem to imply.Flatboat
@NicolasGervais that paper on small batch sizes has a lot of weaknesses. Besides the fact that it is not published in any peer reviewed venue, it does not cover much recent work on learning rate schedules. In particular it does not reference any of the work by Leslie N. Smith on one-shot training schedules with very high learning rates, the Super-Convergence paper in particular. Tuning the learning rate is essential to training performance, but the authors have punted in favor of a naive linear scaling as batch size increases.Bromide
Don't get me wrong, it's an interesting theoretical tack to take. But it seems like a very narrow view to take in practice.Bromide
On a practical side, I'm [re]training a shallow dnn on a machine with a single GPU. If the batch size is 2048, it takes ~20 min per epoch (~12 epochs to converge). If I set the batch size to 32, the estimated time to converge is 188 hours. On a CPU it's similarly unrealistic time wise.Tool
A
30

You can estimate the largest batch size using:

Max batch size= available GPU memory bytes / 4 / (size of tensors + trainable parameters)

Aires answered 9/10, 2017 at 23:33 Comment(12)
How do I get the size of tensors and the number trainable parameters? Aren't you missing the model size in the equation?Bigener
@gisek the model size is actually the no of training parameters, which in Keras you get with model.summary()Flatboat
@Flatboat I'm not sure if you're right. If I create a large netowork and feed it with batch_size=1, I also get the the same error.Bigener
Of course - it can certainly happen that the combination of your model size (trainable parameters) and input data size exhaust your memory even with batch_size = 1, especially if you have a small GPU...Flatboat
@Flatboat hehe, I didn't get that "no" stands for "number". Now it makes sense :)Bigener
What is size of tensors ? I am still confused about that part.Homeric
@Homeric Each layer has its tensor + one or more weight matrices (usually referred to as trainable parameters). For example: if you're feeding your network with 200x200 RGB images, then the size of your input tensor (in bytes) is [batch size] * 3 * 200 * 200 ( * 4 if you use 64bit integers)Aires
@Aires Theoretically your formula makes sense. Have you ever tested it empirically? I am observing the following: For Alexnet with 62 million parameters and a image size of 224x224x3 and a 6GB graphics card, I should be able to fit: (6 GB - (62 Million * 4 bytes)) / (224 * 224 * 3 * 4 bytes) = 9553 as max_batch_size. In practice I am not able to run training with more than batch_size = 512. With 1024 it already crashes. Second example: Resnet-50 has only 25 Million parameters. So I should get an even higher max_batch_size. In practice training crashes with batch_size=128. Please advise.Vuong
@Vuong You should take into account all the tensors, not just the inputAires
@Aires Could you please give an example what tensors you mean? I thought with all the trainable parameters I do take that into consideration? Please correct me if I am wrong.Vuong
@Vuong For each layer your model has to store an input placeholder, one or more weight matrices (trainable or otherwise) and an output placeholder (which may also be the next layer's input).Aires
Is it possible to include reference from which paper this was used?Ernaernald
C
10

Use the summaries provided by pytorchsummary (pip install) or keras (builtin).

E.g.

from torchsummary import summary
summary(model)
.....
.....
================================================================
Total params: 1,127,495
Trainable params: 1,127,495
Non-trainable params: 0
----------------------------------------------------------------
Input size (MB): 0.02
Forward/backward pass size (MB): 13.93
Params size (MB): 4.30
Estimated Total Size (MB): 18.25
----------------------------------------------------------------

Each instance you put in the batch will require a full forward/backward pass in memory, your model you only need once. People seem to prefer batch sizes of powers of two, probably because of automatic layout optimization on the gpu.

Don't forget to linearly increase your learning rate when increasing the batch size.

Let's assume we have a Tesla P100 at hand with 16 GB memory.

(16000 - model_size) / (forward_back_ward_size)
(16000 - 4.3) / 13.93 = 1148.29
rounded to powers of 2 results in batch size 1024
Centrifuge answered 26/1, 2020 at 23:13 Comment(1)
summary() missing 1 required positional argument: 'input_size'Martymartyn
G
3

Here is a function to find batch size for training the model:

def FindBatchSize(model):
    """model: model architecture, that is yet to be trained"""
    import os, sys, psutil, gc, tensorflow, keras
    import numpy as np
    from keras import backend as K
    BatchFound= 16

    try:
        total_params= int(model.count_params());    GCPU= "CPU"
        #find whether gpu is available
        try:
            if K.tensorflow_backend._get_available_gpus()== []:
                GCPU= "CPU";    #CPU and Cuda9GPU
            else:
                GCPU= "GPU"
        except:
            from tensorflow.python.client import device_lib;    #Cuda8GPU
            def get_available_gpus():
                local_device_protos= device_lib.list_local_devices()
                return [x.name for x in local_device_protos if x.device_type == 'GPU']
            if "gpu" not in str(get_available_gpus()).lower():
                GCPU= "CPU"
            else:
                GCPU= "GPU"

        #decide batch size on the basis of GPU availability and model complexity
        if (GCPU== "GPU") and (os.cpu_count() >15) and (total_params <1000000):
            BatchFound= 64    
        if (os.cpu_count() <16) and (total_params <500000):
            BatchFound= 64  
        if (GCPU== "GPU") and (os.cpu_count() >15) and (total_params <2000000) and (total_params >=1000000):
            BatchFound= 32      
        if (GCPU== "GPU") and (os.cpu_count() >15) and (total_params >=2000000) and (total_params <10000000):
            BatchFound= 16  
        if (GCPU== "GPU") and (os.cpu_count() >15) and (total_params >=10000000):
            BatchFound= 8       
        if (os.cpu_count() <16) and (total_params >5000000):
            BatchFound= 8    
        if total_params >100000000:
            BatchFound= 1

    except:
        pass
    try:

        #find percentage of memory used
        memoryused= psutil.virtual_memory()
        memoryused= float(str(memoryused).replace(" ", "").split("percent=")[1].split(",")[0])
        if memoryused >75.0:
            BatchFound= 8
        if memoryused >85.0:
            BatchFound= 4
        if memoryused >90.0:
            BatchFound= 2
        if total_params >100000000:
            BatchFound= 1
        print("Batch Size:  "+ str(BatchFound));    gc.collect()
    except:
        pass

    memoryused= [];    total_params= [];    GCPU= "";
    del memoryused, total_params, GCPU;    gc.collect()
    return BatchFound
Gainless answered 4/4, 2019 at 7:2 Comment(1)
Can you please explain the code and why the if conditions point to a specific batch size? Does your code deal with the memory size of each sample?Unqualified
R
1

I ran into a similar GPU mem error which was solved by configuring the tensorflow session with the following:

# See https://www.tensorflow.org/tutorials/using_gpu#allowing_gpu_memory_growth
config = tf.ConfigProto()
config.gpu_options.allow_growth = True

see: google colaboratory `ResourceExhaustedError` with GPU

Radix answered 31/1, 2018 at 2:14 Comment(3)
Unfortunately, it changes nothing for a large network :(Bigener
Yes. In my case colaboratory launches with 12GB but with the option enabled it can grow to 52GBRadix
in tf2.0, you should from tensorflow.compat.v1 import ConfigProto firstPrimitive
S
0

Finding the maximum batch size is a cumbersome and often time-consuming process. However, there are some solutions here that propose approximate solutions. I came up with a method to iteratively find the maximum batch size my GPU can handle without running out of memory with a decreasing batch size until it runs successfully. Finally, it gives me the maximum batch size. If you want to use this method and you have a large dataset where preprocessing would take some time, I would recommend considering a small subset of the dataset and skipping the preprocessing steps. A comprehensive explanation of this process is available in the page "Calculate the maximum batch size".

Studio answered 9/1, 2024 at 11:3 Comment(0)
N
0

Following on @Ario's answer:

Use from torchinfo import summary instead of from torchsummary import summary. The torchsummary package has the following bug (at least for sequential data processing): the number of trainable parameters inside summary is different from sum(p.numel() for p in model.parameters() if p.requires_grad).

Nuncle answered 21/6, 2024 at 8:56 Comment(0)

© 2022 - 2025 — McMap. All rights reserved.