How does the epsilon hyperparameter affect tf.train.AdamOptimizer?
Asked Answered
Q

1

16

When I set epsilon=10e-8, AdamOptimizer doesn't work. When I set it to 1, it works just fine.

Quinze answered 5/4, 2017 at 3:5 Comment(0)
C
28

t <- t + 1

lr_t <- learning_rate * sqrt(1 - beta2^t) / (1 - beta1^t)

m_t <- beta1 * m_{t-1} + (1 - beta1) * g

v_t <- beta2 * v_{t-1} + (1 - beta2) * g * g

where g is gradient

variable <- variable - lr_t * m_t / (sqrt(v_t) + epsilon)

The epsilon is to avoid divide by zero error in the above equation while updating the variable when the gradient is almost zero. So, ideally epsilon should be a small value. But, having a small epsilon in the denominator will make larger weight updates and with subsequent normalization larger weights will always be normalized to 1.

So, I guess when you train with small epsilon the optimizer will become unstable.

The trade-off is that the bigger you make epsilon (and the denominator), the smaller the weight updates are and thus slower the training progress will be. Most times you want the denominator to be able to get small. Usually, the epsilon value greater than 10e-4 performs better.

The default value of 1e-8 for epsilon might not be a good default in general. For example, when training an Inception network on ImageNet a current good choice is 1.0 or 0.1. check here

Confrere answered 30/6, 2017 at 10:41 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.