How does theano implement computing every function's gradient?
Asked Answered
G

2

5

I have a question about Theano's implementation. How the theano get the gradient of every loss function by the following function(T.grad)? Thank you for your help.

 gparams = T.grad(cost, self.params) 
Gerius answered 3/2, 2015 at 12:52 Comment(1)
Almost every operator you can use in theano contains information about its own derivative. In your case, cost is probably a concatenation of such operations. The gradient is obtained by a simple application of the chain rule and the knowledge of the derivatives of the atomic operations.Angularity
T
10

Edit: this answer was wrong in saying that Theano uses Symbolic Differentiation. My apologies.

Theano implements reverse mode autodiff, but confusingly they call it "symbolic differentiation". This is misleading because symbolic differentiation is something quite different. Let's look at both.

Symbolic differentiation: given a graph representing a function f(x), it uses the chain rule to compute a new graph representing the derivative of that function f'(x). They call this "compiling" f(x). One problem with symbolic differentiation is that it can output a very inefficient graph, but Theano automatically simplifies the output graph.

Example:

"""
f(x) = x*x + x - 2
Graph =
          ADD
         /   \
        MUL  SUB
       /  \  /  \
       x  x  x  2

Chain rule for ADD=> (a(x)+b(x))' = a'(x) + b'(x)
Chain rule for MUL=> (a(x)*b(x))' = a'(x)*b(x) + a(x)*b'(x)
Chain rule for SUB=> (a(x)-b(x))' = a'(x) - b'(x)
The derivative of x is 1, and the derivative of a constant is 0.

Derivative graph (not optimized yet) =
          ADD
         /   \
       ADD    SUB
      /  |    |  \
   MUL  MUL   1   0
  /  |  |  \
 1   x  x   1

Derivative graph (after optimization) =
          ADD
         /   \
       MUL    1
      /   \
     2     x

So: f'(x) = 2*x + 1
"""

Reverse mode autodiff: works in two passes through the computation graph, first going forward through the graph (from the inputs to the outputs), and then backwards using the chain rule (if you are familiar with backpropagation, this is exactly how it computes gradients).

See this great post for more details on various automatic differentiation solutions and their pros&cons.

Transposition answered 22/8, 2016 at 14:33 Comment(0)
C
1

Look up Automatic differentiation and there the backwards mode that is used to efficiently evaluate gradients.

Theano is, as far as I can see, a hybrid between the code-rewriting and operator based approach. It uses operator overloading in python to construct the computational graph, then optimizes it and generates from that graph (optimized) sequences of operations to evaluate the required inkds of derivatives.

Checkered answered 3/2, 2015 at 14:49 Comment(2)
Are you sure it's backwards mode autodiff? Theano's documentation mentions "compiling" the graph representing the function and outputting another graph. Sounds like symbolic differentiation to me?Transposition
Symbolic differentiation produces the formula of the derivative. Automatic/algorithmic differentiation produces a procedure to evaluate the derivative. The graph is the result of the parser and the same in both cases. The difference is the size of the output, AD increases the size of the evaluation function by about a factor of 3. Symbolic differentiation can have exponential growth of string length.Checkered

© 2022 - 2024 — McMap. All rights reserved.