I am trying to tune the hyper parameter i.e batch size in CNN.I have a computer of corei7,RAM 12GB and i am training a CNN network with CIFAR-10 dataset which can be found in this blog.
Now At first what i have read and learnt about batch size in machine learning:
let's first suppose that we're doing online learning, i.e. that we're using a minibatch size of 1. The obvious worry about online learning is that using minibatches which contain just a single training example will cause significant errors in our estimate of the gradient. In fact, though, the errors turn out to not be such a problem. The reason is that the individual gradient estimates don't need to be superaccurate. All we need is an estimate accurate enough that our cost function tends to keep decreasing. It's as though you are trying to get to the North Magnetic Pole, but have a wonky compass that's 10-20 degrees off each time you look at it. Provided you stop to check the compass frequently, and the compass gets the direction right on average, you'll end up at the North Magnetic Pole just fine.
Based on this argument, it sounds as though we should use online learning. In fact, the situation turns out to be more complicated than that.As we know we can use matrix techniques to compute the gradient update for all examples in a minibatch simultaneously, rather than looping over them. Depending on the details of our hardware and linear algebra library this can make it quite a bit faster to compute the gradient estimate for a minibatch of (for example) size 100 , rather than computing the minibatch gradient estimate by looping over the 100 training examples separately. It might take (say) only 50 times as long, rather than 100 times as long.Now, at first it seems as though this doesn't help us that much.
With our minibatch of size 100 the learning rule for the weights looks like:
where the sum is over training examples in the minibatch. This is versus
for online learning. Even if it only takes 50 times as long to do the minibatch update, it still seems likely to be better to do online learning, because we'd be updating so much more frequently. Suppose, however, that in the minibatch case we increase the learning rate by a factor 100, so the update rule becomes
That's a lot like doing separate instances of online learning with a learning rate ofη
. But it only takes 50 times as long as doing a single instance of online learning. Still, it seems distinctly possible that using the larger minibatch would speed things up.
Now i tried with MNIST digit dataset
and ran a sample program and set the batch size 1
at first.I noted down the training time needed for the full dataset.Then i increased the batch size and i noticed that it became faster.
But in case of training with this code and github link changing the batch size doesn't decrease the training time.It remained same if i use 30 or 128 or 64.They are saying that they got 92%
accuracy.After two or three epoch they have got above 40%
accuracy.But when i ran the code in my computer without changing anything other than the batch size i got worse result after 10 epoch like only 28% and test accuracy stuck there in the next epochs.Then i thought since they have used batch size of 128 i need to use that.Then i used the same but it became more worse only give 11% after 10 epoch and stuck in there.Why is that??