Does bias in the convolutional layer really make a difference to the test accuracy?

Asked 22/8, 2018 at 3:35 Answered 14/12, 2020 at 20:24

Solved python tensorflow deep-learning conv-neural-network bias-neuron

I understand that bias are required in small networks, to shift the activation function. But in the case of Deep network that has multiple layers of CNN, pooling, dropout and other non -linear activations, is Bias really making a difference? The convolutional filter is learning local features and for a given conv output channel same bias is used.

This is not a dupe of this link. The above link only explains role of bias in small neural network and does not attempt to explain role of bias in deep-networks containing multiple CNN layers, drop-outs, pooling and non-linear activation functions.

I ran a simple experiment and the results indicated that removing bias from conv layer made no difference in final test accuracy. There are two models trained and the test-accuracy is almost same (slightly better in one without bias.)

model_with_bias,
model_without_bias( bias not added in conv layer)

Are they being used only for historical reasons?

If using bias provides no gain in accuracy, shouldn't we omit them? Less parameters to learn.

I would be thankful if someone who have deeper knowledge than me, could explain the significance(if- any) of these bias in deep networks.

Here is the complete code and the experiment result bias-VS-no_bias experiment

batch_size = 16
patch_size = 5
depth = 16
num_hidden = 64

graph = tf.Graph()

with graph.as_default():

  # Input data.
  tf_train_dataset = tf.placeholder(
    tf.float32, shape=(batch_size, image_size, image_size, num_channels))
  tf_train_labels = tf.placeholder(tf.float32, shape=(batch_size, num_labels))
  tf_valid_dataset = tf.constant(valid_dataset)
  tf_test_dataset = tf.constant(test_dataset)

  # Variables.
  layer1_weights = tf.Variable(tf.truncated_normal(
      [patch_size, patch_size, num_channels, depth], stddev=0.1))
  layer1_biases = tf.Variable(tf.zeros([depth]))
  layer2_weights = tf.Variable(tf.truncated_normal(
      [patch_size, patch_size, depth, depth], stddev=0.1))
  layer2_biases = tf.Variable(tf.constant(1.0, shape=[depth]))
  layer3_weights = tf.Variable(tf.truncated_normal(
      [image_size // 4 * image_size // 4 * depth, num_hidden], stddev=0.1))
  layer3_biases = tf.Variable(tf.constant(1.0, shape=[num_hidden]))
  layer4_weights = tf.Variable(tf.truncated_normal(
      [num_hidden, num_labels], stddev=0.1))
  layer4_biases = tf.Variable(tf.constant(1.0, shape=[num_labels]))

  # define a Model with bias .
  def model_with_bias(data):
    conv = tf.nn.conv2d(data, layer1_weights, [1, 2, 2, 1], padding='SAME')
    hidden = tf.nn.relu(conv + layer1_biases)
    conv = tf.nn.conv2d(hidden, layer2_weights, [1, 2, 2, 1], padding='SAME')
    hidden = tf.nn.relu(conv + layer2_biases)
    shape = hidden.get_shape().as_list()
    reshape = tf.reshape(hidden, [shape[0], shape[1] * shape[2] * shape[3]])
    hidden = tf.nn.relu(tf.matmul(reshape, layer3_weights) + layer3_biases)
    return tf.matmul(hidden, layer4_weights) + layer4_biases

  # define a Model without bias added in the convolutional layer.
  def model_without_bias(data):
    conv = tf.nn.conv2d(data, layer1_weights, [1, 2, 2, 1], padding='SAME')
    hidden = tf.nn.relu(conv ) # layer1_ bias is not added 
    conv = tf.nn.conv2d(hidden, layer2_weights, [1, 2, 2, 1], padding='SAME')
    hidden = tf.nn.relu(conv) # + layer2_biases)
    shape = hidden.get_shape().as_list()
    reshape = tf.reshape(hidden, [shape[0], shape[1] * shape[2] * shape[3]])
    # bias are added only in Fully connected layer(layer 3 and layer 4)
    hidden = tf.nn.relu(tf.matmul(reshape, layer3_weights) + layer3_biases)
    return tf.matmul(hidden, layer4_weights) + layer4_biases

  # Training computation.
  logits_with_bias = model_with_bias(tf_train_dataset)
  loss_with_bias = tf.reduce_mean(
    tf.nn.softmax_cross_entropy_with_logits(labels=tf_train_labels, logits=logits_with_bias))

  logits_without_bias = model_without_bias(tf_train_dataset)
  loss_without_bias = tf.reduce_mean(
    tf.nn.softmax_cross_entropy_with_logits(labels=tf_train_labels, logits=logits_without_bias))

  # Optimizer.
  optimizer_with_bias = tf.train.GradientDescentOptimizer(0.05).minimize(loss_with_bias)
  optimizer_without_bias = tf.train.GradientDescentOptimizer(0.05).minimize(loss_without_bias)

  # Predictions for the training, validation, and test data.
  train_prediction_with_bias = tf.nn.softmax(logits_with_bias)
  valid_prediction_with_bias = tf.nn.softmax(model_with_bias(tf_valid_dataset))
  test_prediction_with_bias = tf.nn.softmax(model_with_bias(tf_test_dataset))

  # Predictions for without
  train_prediction_without_bias = tf.nn.softmax(logits_without_bias)
  valid_prediction_without_bias = tf.nn.softmax(model_without_bias(tf_valid_dataset))
  test_prediction_without_bias = tf.nn.softmax(model_without_bias(tf_test_dataset))

num_steps = 1001

with tf.Session(graph=graph) as session:
  tf.global_variables_initializer().run()
  print('Initialized')
  for step in range(num_steps):
    offset = (step * batch_size) % (train_labels.shape[0] - batch_size)
    batch_data = train_dataset[offset:(offset + batch_size), :, :, :]
    batch_labels = train_labels[offset:(offset + batch_size), :]
    feed_dict = {tf_train_dataset : batch_data, tf_train_labels : batch_labels}
    session.run(optimizer_with_bias, feed_dict=feed_dict)
    session.run(optimizer_without_bias, feed_dict = feed_dict)
  print('Test accuracy(with bias): %.1f%%' % accuracy(test_prediction_with_bias.eval(), test_labels))
  print('Test accuracy(without bias): %.1f%%' % accuracy(test_prediction_without_bias.eval(), test_labels))

Output:

Initialized

Test accuracy(with bias): 90.5%

Test accuracy(without bias): 90.6%

Phebe answered 22/8, 2018 at 3:35 Comment(3)

Biases are needed for convolutional layers for the same reason why they are needed for other layers. #2481150 – Twobit 22/8, 2018 at 6:6

I understand that bias are required in small networks, to shift the activation function. But in the case of Deep network that has layers of CNN, and other non -linear activations, is Bias making a difference? Omitting the bias term in the above almost make no difference. – Phebe 23/8, 2018 at 1:34

I am new to tensorflow, but having just done a Conv2d NN using the coursera course deepmind.ai, I was under the impression that tensorflow automatically takes care of the bias: Andrew Nugyen: "You don't need to worry about bias variables as you will soon see that TensorFlow functions take care of the bias." Perhaps you are seeing the same performance (slightly worse with bias) because they both have bias, you just are giving the one with bias an additional set of duplicate bias terms. If you look at the nn.conv2d method, you see it contains a bias which is added after the convolution. – Mortality 3/4, 2019 at 15:21

Biases are tuned alongside weights by learning algorithms such as gradient descent. biases differ from weights is that they are independent of the output from previous layers. Conceptually bias is caused by input from a neuron with a fixed activation of 1, and so is updated by subtracting the just the product of the delta value and learning rate.

In a large model, removing the bias inputs makes very little difference because each node can make a bias node out of the average activation of all of its inputs, which by the law of large numbers will be roughly normal. At the first layer, the ability for this to happens depends on your input distribution. On a small network, of course you need a bias input, but on a large network, removing it makes almost no difference.

Although in a large network it has no difference, it still depends on network architecture. For instance in LSTM:

Most applications of LSTMs simply initialize the LSTMs with small random weights which works well on many problems. But this initialization effectively sets the forget gate to 0.5. This introduces a vanishing gradient with a factor of 0.5 per timestep, which can cause problems whenever the long term dependencies are particularly severe. This problem is addressed by simply initializing the forget gates bias to a large value such as 1 or 2. By doing so, the forget gate will be initialized to a value that is close to 1, enabling gradient flow.

See also:

Aircraft answered 23/8, 2018 at 14:47 Comment(4)

I understand Bias's role in neural network. However, in conv layer where we want to just learn local features such as edges, patterns, moments e.t.c that are dependent on the previous layer, do we really need bias? Any explanation for the results comparing test accuracy with and without bias? – Phebe 23/8, 2018 at 15:0

@Phebe In a large model, removing the bias inputs makes very little difference because each node can make a bias node out of the average activation of all of its inputs, which by the law of large numbers will be roughly normal. At the first layer, the ability for this to happens depends on your input distribution. For MNIST for example, the input's average activation is roughly constant. On a small network, of course you need a bias input, but on a large network, removing it makes almost no difference. (But, why would you remove it?) – Aircraft 23/8, 2018 at 15:12

My point is exactly what you have written "On a small network, of course you need a bias input, but on a large network, removing it makes almost no difference" If it makes no difference, why add it. removing the bias means less parameters to learn, less training time. There is 100 or to even thousand (in deeper architecture) less parameters to learn. – Phebe 23/8, 2018 at 15:26

@Aparajuli, In todays NN architecture hundreds of biases are negligible compare to millions of parameters. Unfortunately, I couldn't find mathematical reason. – Aircraft 23/8, 2018 at 19:11

In most networks you have a batchnorm layer after the conv layer, which has a bias. So if you have a batchnorm layer there is no gain. See: Can not use both bias and batch normalization in convolution layers

Otherwise, from a math perspective you are learning different functions. However, it turns out that in particular if you have a very complex network for a simple problem, you might achieve almost the same thing without biases than with biases but ending up using more parameters. In my experience, using a factor of 2-4 more parameters than needed rarely hurts performance in deep learning - in particular if you regularize. So, it is hard to notice any difference. However, you might try to use few channels (I don't think depth of the network matters as much as number of channels of the convolution) and see if bias make a difference. I would guess so.

Expedition answered 14/12, 2020 at 20:24 Comment(0)

Recommended topics

Hot tags