What does the property losses of the Bayesian layers of TensorFlow Probability represent?
Asked Answered
M

1

11

I am running the example code on Bayesian Neural Network implemented using Tensorflow Probability.

My question is about the implementation of the ELBO loss used for variational inference. The ELBO equals the summation of two terms, namely neg_log_likelihood and kl implemented in the code. I have difficulty understanding the implementation of the kl term.

Here is how the model is defined:

with tf.name_scope("bayesian_neural_net", values=[images]):
  neural_net = tf.keras.Sequential()
  for units in FLAGS.layer_sizes:
    layer = tfp.layers.DenseFlipout(units, activation=FLAGS.activation)
    neural_net.add(layer)
  neural_net.add(tfp.layers.DenseFlipout(10))
  logits = neural_net(images)
  labels_distribution = tfd.Categorical(logits=logits)

Here is how the 'kl' term defined:

kl = sum(neural_net.losses) / mnist_data.train.num_examples

I am not sure what neural_net.losses is returning here, since there is no loss function defined for neural_net. Clearly, there will be some values returned by neural_net.losses, but I don't know what is the meaning of returned value. Any comments on this?

My guess is the L2 norm, but I am not sure. If that is the case, we are still missing something. According to the VAE paper, appendix B, the authors derived the KL term when the prior is standard normal. It turns out to be pretty close to an L2 norm of the variational parameters except there are additional log variance terms and a constant term. Any comments on this?

Marinara answered 27/4, 2018 at 14:40 Comment(0)
P
4

The losses attribute of a TensorFlow Keras Layer represents side-effect computation such as regularizer penalties. Unlike regularizer penalties on specific TensorFlow variables, here, the losses represent the KL divergence computation. Check out the implementation here as well as the docstring's example:

We illustrate a Bayesian neural network with variational inference, assuming a dataset of features and labels.

  import tensorflow_probability as tfp
  model = tf.keras.Sequential([
      tfp.layers.DenseFlipout(512, activation=tf.nn.relu),
      tfp.layers.DenseFlipout(10),
  ])
  logits = model(features)
  neg_log_likelihood = tf.nn.softmax_cross_entropy_with_logits(
      labels=labels, logits=logits)
  kl = sum(model.losses)
  loss = neg_log_likelihood + kl
  train_op = tf.train.AdamOptimizer().minimize(loss)

It uses the Flipout gradient estimator to minimize the Kullback-Leibler divergence up to a constant, also known as the negative Evidence Lower Bound. It consists of the sum of two terms: the expected negative log-likelihood, which we approximate via Monte Carlo; and the KL divergence, which is added via regularizer terms which are arguments to the layer.

Perturbation answered 27/4, 2018 at 17:31 Comment(1)
hey Dustin, in the TFP examples Bayesian loss is computed as neg_log_likelihood = -tf.reduce_mean(input_tensor=labels_distribution.log_prob(labels)) which seems to be more explicitly in line with ELBO loss than softmax xent. Are those equivalent?Voltz

© 2022 - 2024 — McMap. All rights reserved.