Initial bias values for a neural network

Asked 3/7, 2017 at 10:58 Answered 24/6, 2020 at 12:24

Solved machine-learning tensorflow bias-neuron

I am currently building a CNN in tensorflow and I am initialising my weight matrix using a He normal weight initialisation. However, I am unsure how I should initialise my bias values. I am using ReLU as my activation function between each convolutional layer. Is there a standard method to initialising bias values?

# Define approximate xavier weight initialization (with RelU correction described by He)
def xavier_over_two(shape):
    std = np.sqrt(shape[0] * shape[1] * shape[2])
    return tf.random_normal(shape, stddev=std)

def bias_init(shape):
    return #???

Taratarabar answered 3/7, 2017 at 10:58 Comment(0)

Initializing the biases. It is possible and common to initialize the biases to be zero, since the asymmetry breaking is provided by the small random numbers in the weights. For ReLU non-linearities, some people like to use small constant value such as 0.01 for all biases because this ensures that all ReLU units fire in the beginning and therefore obtain and propagate some gradient. However, it is not clear if this provides a consistent improvement (in fact some results seem to indicate that this performs worse) and it is more common to simply use 0 bias initialization.

source: http://cs231n.github.io/neural-networks-2/

Yeseniayeshiva answered 3/7, 2017 at 12:13 Comment(1)

Thanks exactly what I was looking for! – Taratarabar 3/7, 2017 at 13:12

Be aware of the specific case of the last layer's bias. As Andrej Karpathy explains in his Recipe for Training Neural Networks:

init well. Initialize the final layer weights correctly. E.g. if you are regressing some values that have a mean of 50 then initialize the final bias to 50. If you have an imbalanced dataset of a ratio 1:10 of positives:negatives, set the bias on your logits such that your network predicts probability of 0.1 at initialization. Setting these correctly will speed up convergence and eliminate “hockey stick” loss curves where in the first few iteration your network is basically just learning the bias.

Surmullet answered 24/6, 2020 at 12:24 Comment(1)

An explanation for the examples Karpathy made can be found here – Osteophyte 5/5, 2023 at 19:39

Recommended topics

Hot tags