Proper way to implement biases in Neural Networks

Asked 29/6, 2018 at 23:48 Answered 15/11, 2018 at 12:3

matrix machine-learning neural-network dot-product bias-neuron

I can make a neural network, I just need a clarification on bias implementation. Which way is better: Implement the Bias matrices B1, B2, .. Bn for each layer in their own, seperate matrix from the weight matrix, or, include the biases in the weight matrix by adding a 1 to the previous layer output (input for this layer). In images, I am asking whether this implementation:

Or this implementation:

Is the best. Thank you

Threecolor answered 29/6, 2018 at 23:48 Comment(5)

The gradients for the bias are often simpler to calculate than the gradients for the normal nodes. In Andrew Ng's original ML course (2012), I remember including the biases in the matrix with the column of 1s but in his 2016 course he has them separate. I assuming that it is more performent to keep them separate as matrix-multiplication is worse than quadratic time. I'm not sure if that changes when you have GPUs. You could try look into the source code for popular libraries and see how they're doing it. – Ampere 13/11, 2018 at 11:20

Mathematically, the two are equivalent. – Jael 13/11, 2018 at 22:48

In terms of computational cost, in any sane implementation doing it either way won't make a difference: if the lower layer has N neurons and next, upper layer has M neurons, then the computational cost of adding M biases will be dwarfed by the M*MxN operation of passing the activity in the lower layer via the weights to the upper. – Jael 13/11, 2018 at 22:48

IMO, the only substantial distinction is in terms of code readability and maintainability. Having implemented several backprop and RBM variants from scratch, I would argue having the biases separate from the weights results in much cleaner code, especially if you want to play around with learning rules or different initializations. However, as prefixed, that is just an opinion. – Jael 13/11, 2018 at 22:50

Correction: M * M x N should read N * N x M in the comment above. – Jael 13/11, 2018 at 22:56

I think the best way is to have two separate matrices, one for the weitghts and one for the bias. Why? :

I don't believe there is an increase on the computational load since W*x and W*x + b should be equivalent running on GPU. Mathematically and computationally they are equivalent.
Greater modularity. Let's say you want to initialize the weights and the bias using different initializers (ones, zeros, glorot...). By having two separate matrices this is straightforward.
Easier to read and maintain.

Ursula answered 15/11, 2018 at 9:28 Comment(0)

include the biases in the weight matrix by adding a 1 to the previous layer output (input for this layer)

This seems to be what is implemented here: Machine Learning with Python: Training and Testing the Neural Network with MNIST data set in the paragraph "Networks with multiple hidden layers".

I don't know if it's the best way to do it though. (Maybe not related but still: in the mentioned example code, it worked with sigmoid, but failed when I replaced it with ReLU).

Rainie answered 13/11, 2018 at 11:15 Comment(0)

In my opinion I think implementing the bias matrices separately for each layer is the way to go. This will create a lot of hyper-parameters that your model will have to learn but it will give your model more freedom to converge.

For more information read this.

Bolduc answered 15/11, 2018 at 12:3 Comment(0)

Recommended topics

Hot tags