What is the purpose of the Tensorflow Gradient Tape?
Asked Answered
F

3

91

I watched the Tensorflow Developer's summit video on Eager Execution in Tensorflow, and the presenter gave an introduction to "Gradient Tape." Now I understand that Gradient Tape tracks the automatic differentiation that occurs in a TF model.

I was trying to understand why I would use Gradient Tape? Can anyone explain how Gradient Tape is used as a diagnostic tool? Why would someone use Gradient Tape versus just Tensorboard visualization of weights.

So I get that the automatic differentiation that occurs with a model is to compute the gradients of each node--meaning the adjustment of the weights and biases at each node, given some batch of data. So that is the learning process. But I was under the impression that I can actually use a tf.keras.callback.TensorBoard() call to see the tensorboard visualization of training--so I can watch the weights on each node and determine if there are any dead or oversaturated nodes.

Is the use of Gradient Tape only to see if some gradients go to zero or get really big, etc? Or is there some other use of the Gradient Tape?

Fibrosis answered 28/12, 2018 at 2:39 Comment(0)
F
76

Having worked on this for a while, after posting the initial question, I have a better sense of where Gradient Tape is useful. Seems like the most useful application of Gradient Tape is when you design a custom layer in your keras model for example--or equivalently designing a custom training loop for your model.

If you have a custom layer, you can define exactly how the operations occur within that layer, including the gradients that are computed and also calculating the amount of loss that is accumulated.

So Gradient Tape will just give you direct access to the individual gradients that are in the layer.

Here is an example from Aurelien Geron's 2nd edition book on Tensorflow.

Say you have a function you want as your activation.

 def f(w1, w2):
     return 3 * w1 ** 2 + 2 * w1 * w2

Now if you want to take derivatives of this function with respec to w1 and w2:

w1, w2 = tf.Variable(5.), tf.Variable(3.)
with tf.GradientTape() as tape:
    z = f(w1, w2)

gradients = tape.gradient(z, [w1, w2])

So the optimizer will calculate the gradient and give you access to those values. Then you can double them, square them, triple them, etc., whatever you like. Whatever you choose to do, then you can add those adjusted gradients to the loss calculation for the backpropagation step, etc.

Fibrosis answered 3/6, 2019 at 0:46 Comment(2)
Why do we write the gradients = tape.gradient(z, [w1, w2]) line outside the with block? I've seen examples with it within the block, other outside.Marathon
Hmm, I think that everything within the context manager is an expression for AD. So if you include that gradients = tape.gradient(z, [w1, w2]) within the block, then you might end up automatic differentiating the function that computes the tape, rather than the expression that you want. In this case we just want the derivatives of z, and so keep those expressions within the tape. Then TF can do whatever optimizations on the compute graph to accelerate the gradient computation.Fibrosis
D
84

With eager execution enabled, Tensorflow will calculate the values of tensors as they occur in your code. This means that it won't precompute a static graph for which inputs are fed in through placeholders. This means to back propagate errors, you have to keep track of the gradients of your computation and then apply these gradients to an optimiser.

This is very different from running without eager execution, where you would build a graph and then simply use sess.run to evaluate your loss and then pass this into an optimiser directly.

Fundamentally, because tensors are evaluated immediately, you don't have a graph to calculate gradients and so you need a gradient tape. It is not so much that it is just used for visualisation, but more that you cannot implement a gradient descent in eager mode without it.

Obviously, Tensorflow could just keep track of every gradient for every computation on every tf.Variable. However, that could be a huge performance bottleneck. They expose a gradient tape so that you can control what areas of your code need the gradient information. Note that in non-eager mode, this will be statically determined based on the computational branches that are descendants of your loss but in eager mode there is no static graph and so no way of knowing.

Deirdre answered 1/1, 2019 at 12:9 Comment(0)
F
76

Having worked on this for a while, after posting the initial question, I have a better sense of where Gradient Tape is useful. Seems like the most useful application of Gradient Tape is when you design a custom layer in your keras model for example--or equivalently designing a custom training loop for your model.

If you have a custom layer, you can define exactly how the operations occur within that layer, including the gradients that are computed and also calculating the amount of loss that is accumulated.

So Gradient Tape will just give you direct access to the individual gradients that are in the layer.

Here is an example from Aurelien Geron's 2nd edition book on Tensorflow.

Say you have a function you want as your activation.

 def f(w1, w2):
     return 3 * w1 ** 2 + 2 * w1 * w2

Now if you want to take derivatives of this function with respec to w1 and w2:

w1, w2 = tf.Variable(5.), tf.Variable(3.)
with tf.GradientTape() as tape:
    z = f(w1, w2)

gradients = tape.gradient(z, [w1, w2])

So the optimizer will calculate the gradient and give you access to those values. Then you can double them, square them, triple them, etc., whatever you like. Whatever you choose to do, then you can add those adjusted gradients to the loss calculation for the backpropagation step, etc.

Fibrosis answered 3/6, 2019 at 0:46 Comment(2)
Why do we write the gradients = tape.gradient(z, [w1, w2]) line outside the with block? I've seen examples with it within the block, other outside.Marathon
Hmm, I think that everything within the context manager is an expression for AD. So if you include that gradients = tape.gradient(z, [w1, w2]) within the block, then you might end up automatic differentiating the function that computes the tape, rather than the expression that you want. In this case we just want the derivatives of z, and so keep those expressions within the tape. Then TF can do whatever optimizations on the compute graph to accelerate the gradient computation.Fibrosis
J
12

I think the most important thing to say in answer to this question is simply that GradientTape is not a diagnostic tool. That's the misconception here.

GradientTape is a mathematical tool for automatic differentiation (autodiff), which is the core functionality of TensorFlow. It does not "track" the autodiff, it is a key part of performing the autodiff.

As the other answers describe, it is used to record ("tape") a sequence of operations performed upon some input and producing some output, so that the output can be differentiated with respect to the input (via backpropagation / reverse-mode autodiff) (in order to then perform gradient descent optimisation).

Jaquesdalcroze answered 15/11, 2020 at 2:54 Comment(2)
But why do we need the with tf.GradientTape() as tape block with the function inside? and we don't just use z.gradient(w1)?Marathon
Because calculating gradients require that you store the values of all nodes in the calculation graph, which for optimisation reasons is not something you usually want to do.Jaquesdalcroze

© 2022 - 2024 — McMap. All rights reserved.