how is backpropagation the same (or not) as reverse automatic differentiation?
Asked Answered
N

2

16

The Wikipedia page for backpropagation has this claim:

The backpropagation algorithm for calculating a gradient has been rediscovered a number of times, and is a special case of a more general technique called automatic differentiation in the reverse accumulation mode.

Can someone expound on this, put it in layman's terms? What is the function being differentiated? What is the "special case"? Is it the adjoint values themselves that are used or the final gradient?

Update: since writing this I have found that this is covered in the Deep Learning book, section 6.5.9. See https://www.deeplearningbook.org/ . I have also found this paper to be informative on the subject: "Stable architectures for deep neural networks" by Haber and Ruthotto.

Nutter answered 6/5, 2014 at 3:48 Comment(0)
O
5

"What is the function being differentiated? What is the "special case?""

The most important distinction between backpropagation and reverse-mode AD is that reverse-mode AD computes the vector-Jacobian product of a vector valued function from R^n -> R^m, while backpropagation computes the gradient of a scalar valued function from R^n -> R. Backpropagation is therefore a special case of reverse-mode AD for scalar functions.

When we train neural networks, we always have a scalar-valued loss function, so we are always using backpropagation. This is the function being differentiated. Since backprop is a subset of reverse-mode AD, then we are also using reverse-mode AD when we train a neural network.

"Is it the adjoint values themselves that are used or the final gradient?"

The adjoint of a variable is the gradient of the loss function with respect to that variable. When we do neural network training, we use the gradients of the parameters (like weights, biases, etc) with respect to the loss to update the parameters. So we do use the adjoints, but only the adjoints of the parameters (which are equivalent to the gradient of the parameters).

Oconnell answered 28/1, 2020 at 17:33 Comment(0)
L
7

In Neural Network training, we want to find a set of weights w that minimizes the error E(N(w,x)-y). (x is the training input, y is the training output, N is the network and E is some error function).

The standard way to do an optimization like this, is gradient descent, which uses the derivative of the network, N' say. We could represent the network as a matrix product and do this manually with matrix calculus, but we can also write (automatic) algorithms.

Backpropagation is a special such algorithm, which has certain advantages. For example it makes it easy to take the derivative only with respect to a selected sample of weights, as is needed for stochastic gradient descent. It also specifies how feed-forward (actual network values) are saved so that they are easily accessible for calculating the needed derivatives.

You should be able to find the exact code for the specific algorithm in text books as well as online.

Lineup answered 21/5, 2014 at 9:56 Comment(0)
O
5

"What is the function being differentiated? What is the "special case?""

The most important distinction between backpropagation and reverse-mode AD is that reverse-mode AD computes the vector-Jacobian product of a vector valued function from R^n -> R^m, while backpropagation computes the gradient of a scalar valued function from R^n -> R. Backpropagation is therefore a special case of reverse-mode AD for scalar functions.

When we train neural networks, we always have a scalar-valued loss function, so we are always using backpropagation. This is the function being differentiated. Since backprop is a subset of reverse-mode AD, then we are also using reverse-mode AD when we train a neural network.

"Is it the adjoint values themselves that are used or the final gradient?"

The adjoint of a variable is the gradient of the loss function with respect to that variable. When we do neural network training, we use the gradients of the parameters (like weights, biases, etc) with respect to the loss to update the parameters. So we do use the adjoints, but only the adjoints of the parameters (which are equivalent to the gradient of the parameters).

Oconnell answered 28/1, 2020 at 17:33 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.