Batch training uses sum of updates? or average of updates?

Asked 18/7, 2017 at 7:55 Answered 18/7, 2017 at 8:6

I have few questions about batch training of neural networks.

First, when we update weights using batch training, the amount of change is accumulated gradients for the batch size. In this case, the amount of change is sum of the gradients? or average of the gradients?

If the answer is the sum of the gradients, the amount of change will be much bigger than online training, because the amounts are accumulated. In this case, I don't think the weights can be optimized well.

Otherwise, if the answer is the average of the gradients, then it seems very reasonable to optimize the weights well. However, in this case, we have to train much more times than online training because it updates the weight little only once for the batch size of data.

Second, whatever the answer of the first question is, when I use CNN sample codes of Tensorflow for MNIST as following, it can optimizer the weight so fast, so the training accuracy becomes above 90% even in the second step.

=======================================================================

train_step = tf.train.GradientDescentOptimizer(0.001).minimize(cross_entropy) accuracy = tf.reduce_mean(tf.cast(correct_prediction, "float")) for i in range(1000): batch = mnist.train.next_batch(100) if i%100 == 0: train_accuracy = sess.run(accuracy, feed_dict={x:batch[0], y_:batch[1], keep_prob: 1.0}) sess.run(train_step, feed_dict={x: batch[0], y_:batch[1], keep_prob:1.0})

========================================================================

Please explain how does Tensorflow optimize the weight so very fast.

Clair answered 18/7, 2017 at 7:55 Comment(0)

The answer to this question depends on your loss function.

If loss_element is your loss function for one element of the batch, then, the loss of your batch will be some function of all your individual losses.

For example if you choose to use tf.reduce_mean, then your loss is averaged on all the elements of your batch. And so is the gradient. If you use tf.reduce_sum, then your gradient will be the sum of all your gradients element-wise.

Angst answered 18/7, 2017 at 8:6 Comment(0)

This is the same to use sum of gradients or average gradient because you later have to find a good learning rate that will most likely take into account the division by the batch size in the average of gradient.

However, using the average over the batch has the advantage of having a comparable loss between two training using different batch size.

Binate answered 18/7, 2017 at 8:4 Comment(0)

Recommended topics

Hot tags