I have a question about Theano's implementation. How the theano get the gradient of every loss function by the following function(T.grad)? Thank you for your help.
gparams = T.grad(cost, self.params)
I have a question about Theano's implementation. How the theano get the gradient of every loss function by the following function(T.grad)? Thank you for your help.
gparams = T.grad(cost, self.params)
Edit: this answer was wrong in saying that Theano uses Symbolic Differentiation. My apologies.
Theano implements reverse mode autodiff, but confusingly they call it "symbolic differentiation". This is misleading because symbolic differentiation is something quite different. Let's look at both.
Symbolic differentiation: given a graph representing a function f(x)
, it uses the chain rule to compute a new graph representing the derivative of that function f'(x)
. They call this "compiling" f(x)
. One problem with symbolic differentiation is that it can output a very inefficient graph, but Theano automatically simplifies the output graph.
Example:
"""
f(x) = x*x + x - 2
Graph =
ADD
/ \
MUL SUB
/ \ / \
x x x 2
Chain rule for ADD=> (a(x)+b(x))' = a'(x) + b'(x)
Chain rule for MUL=> (a(x)*b(x))' = a'(x)*b(x) + a(x)*b'(x)
Chain rule for SUB=> (a(x)-b(x))' = a'(x) - b'(x)
The derivative of x is 1, and the derivative of a constant is 0.
Derivative graph (not optimized yet) =
ADD
/ \
ADD SUB
/ | | \
MUL MUL 1 0
/ | | \
1 x x 1
Derivative graph (after optimization) =
ADD
/ \
MUL 1
/ \
2 x
So: f'(x) = 2*x + 1
"""
Reverse mode autodiff: works in two passes through the computation graph, first going forward through the graph (from the inputs to the outputs), and then backwards using the chain rule (if you are familiar with backpropagation, this is exactly how it computes gradients).
See this great post for more details on various automatic differentiation solutions and their pros&cons.
Look up Automatic differentiation and there the backwards mode that is used to efficiently evaluate gradients.
Theano is, as far as I can see, a hybrid between the code-rewriting and operator based approach. It uses operator overloading in python to construct the computational graph, then optimizes it and generates from that graph (optimized) sequences of operations to evaluate the required inkds of derivatives.
© 2022 - 2024 — McMap. All rights reserved.
cost
is probably a concatenation of such operations. The gradient is obtained by a simple application of the chain rule and the knowledge of the derivatives of the atomic operations. – Angularity