Will larger batch size make computation time less in machine learning?

Asked 2/2, 2016 at 16:12 Answered 11/10, 2020 at 16:59

Solved machine-learning neural-network conv-neural-network torch gradient-descent

I am trying to tune the hyper parameter i.e batch size in CNN.I have a computer of corei7,RAM 12GB and i am training a CNN network with CIFAR-10 dataset which can be found in this blog.

Now At first what i have read and learnt about batch size in machine learning:

let's first suppose that we're doing online learning, i.e. that we're using a minibatch size of 1. The obvious worry about online learning is that using minibatches which contain just a single training example will cause significant errors in our estimate of the gradient. In fact, though, the errors turn out to not be such a problem. The reason is that the individual gradient estimates don't need to be superaccurate. All we need is an estimate accurate enough that our cost function tends to keep decreasing. It's as though you are trying to get to the North Magnetic Pole, but have a wonky compass that's 10-20 degrees off each time you look at it. Provided you stop to check the compass frequently, and the compass gets the direction right on average, you'll end up at the North Magnetic Pole just fine.

Based on this argument, it sounds as though we should use online learning. In fact, the situation turns out to be more complicated than that.As we know we can use matrix techniques to compute the gradient update for all examples in a minibatch simultaneously, rather than looping over them. Depending on the details of our hardware and linear algebra library this can make it quite a bit faster to compute the gradient estimate for a minibatch of (for example) size 100 , rather than computing the minibatch gradient estimate by looping over the 100 training examples separately. It might take (say) only 50 times as long, rather than 100 times as long.Now, at first it seems as though this doesn't help us that much.

With our minibatch of size 100 the learning rule for the weights looks like:

where the sum is over training examples in the minibatch. This is versus
for online learning. Even if it only takes 50 times as long to do the minibatch update, it still seems likely to be better to do online learning, because we'd be updating so much more frequently. Suppose, however, that in the minibatch case we increase the learning rate by a factor 100, so the update rule becomes

That's a lot like doing separate instances of online learning with a learning rate of η. But it only takes 50 times as long as doing a single instance of online learning. Still, it seems distinctly possible that using the larger minibatch would speed things up.

Now i tried with MNIST digit dataset and ran a sample program and set the batch size 1 at first.I noted down the training time needed for the full dataset.Then i increased the batch size and i noticed that it became faster.
But in case of training with this code and github link changing the batch size doesn't decrease the training time.It remained same if i use 30 or 128 or 64.They are saying that they got 92% accuracy.After two or three epoch they have got above 40% accuracy.But when i ran the code in my computer without changing anything other than the batch size i got worse result after 10 epoch like only 28% and test accuracy stuck there in the next epochs.Then i thought since they have used batch size of 128 i need to use that.Then i used the same but it became more worse only give 11% after 10 epoch and stuck in there.Why is that??

Penology answered 2/2, 2016 at 16:12 Comment(1)

Yes, it will reduce the computation time. But, it will increase the amount of memory used. So, if your PC is already utilizing most of the memory, then do not go for large batch size, otherwise you can. – Wildebeest 4/4, 2019 at 6:57

Neural networks learn by gradient descent an error function in the weight space which is parametrized by the training examples. This means the variables are the weights of the neural network. The function is "generic" and becomes specific when you use training examples. The "correct" way would be to use all training examples to make the specific function. This is called "batch gradient descent" and is usually not done for two reasons:

It might not fit in your RAM (usually GPU, as for neural networks you get a huge boost when you use the GPU).
It is actually not necessary to use all examples.

In machine learning problems, you usually have several thousands of training examples. But the error surface might look similar when you only look at a few (e.g. 64, 128 or 256) examples.

Think of it as a photo: To get an idea of what the photo is about, you usually don't need a 2500x1800px resolution. A 256x256px image will give you a good idea what the photo is about. However, you miss details.

So imagine gradient descent to be a walk on the error surface: You start on one point and you want to find the lowest point. To do so, you walk down. Then you check your height again, check in which direction it goes down and make a "step" (of which the size is determined by the learning rate and a couple of other factors) in that direction. When you have mini-batch training instead of batch-training, you walk down on a different error surface. In the low-resolution error surface. It might actually go up in the "real" error surface. But overall, you will go in the right direction. And you can make single steps much faster!

Now, what happens when you make the resolution lower (the batch size smaller)?

Right, your image of what the error surface looks like gets less accurate. How much this affects you depends on factors like:

Your hardware/implementation
Dataset: How complex is the error surface and how good it is approximated by only a small portion?
Learning: How exactly are you learning (momentum? newbob? rprop?)

Secretarygeneral answered 2/2, 2016 at 20:9 Comment(4)

Is batch size related with accuracy?? – Penology 3/2, 2016 at 11:56

@setubasak When you have the right training algorithm and train long enough, the batch size should not have significant influence on the accuracy. – Secretarygeneral 3/2, 2016 at 12:2

will the larger batch size make the training time lesser if my hardware supports it?? – Penology 3/2, 2016 at 12:31

@setubasak Larger batch size means the computation of the error function takes longer, but overall you need less steps to get to the same accuracy. I am not aware of any systematic research about this. In fact, I think this question probably can't really be satisifable answered, because it depends on the factors I gave above. Just try it. Common batch sizes are 64, 128, 256. – Secretarygeneral 3/2, 2016 at 12:35

I'd like to add to what's been already said here that larger batch size is not always good for generalization. I've seen these cases myself, when an increase in batch size hurt validation accuracy, particularly for CNN working with CIFAR-10 dataset.

From "On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima":

The stochastic gradient descent (SGD) method and its variants are algorithms of choice for many Deep Learning tasks. These methods operate in a small-batch regime wherein a fraction of the training data, say 32–512 data points, is sampled to compute an approximation to the gradient. It has been observed in practice that when using a larger batch there is a degradation in the quality of the model, as measured by its ability to generalize. We investigate the cause for this generalization drop in the large-batch regime and present numerical evidence that supports the view that large-batch methods tend to converge to sharp minimizers of the training and testing functions—and as is well known, sharp minima lead to poorer generalization. In contrast, small-batch methods consistently converge to flat minimizers, and our experiments support a commonly held view that this is due to the inherent noise in the gradient estimation. We discuss several strategies to attempt to help large-batch methods eliminate this generalization gap.

Bottom-line: you should tune the batch size, just like any other hyperparameter, to find an optimal value.

Gypsophila answered 5/10, 2017 at 16:16 Comment(0)

The 2018 opinion retweeted by Yann LeCun is the paper Revisiting Small Batch Training For Deep Neural Networks, Dominic Masters and Carlo Luschi suggesting a good generic maximum batch size is:

32

With some interplay with choice of learning rate.

The earlier 2016 paper On Large-batch Training For Deep Learning: Generalization Gap And Sharp Minima gives some reason for not using big batches, which I paraphrase badly, as big batches are likely to get stuck in local (“sharp”) minima, small batches not.

Chair answered 11/10, 2020 at 16:59 Comment(0)

Recommended topics

Hot tags