Understanding backpropagation in PyTorch
Asked Answered
M

3

6

I am exploring PyTorch, and I do not understand the output of the following example:

# Initialize x, y and z to values 4, -3 and 5
x = torch.tensor(4., requires_grad = True)
y = torch.tensor(-3., requires_grad = True)
z = torch.tensor(5., requires_grad = True)

# Set q to sum of x and y, set f to product of q with z
q = x + y
f = q * z

# Compute the derivatives
f.backward()

# Print the gradients
print("Gradient of x is: " + str(x.grad))
print("Gradient of y is: " + str(y.grad))
print("Gradient of z is: " + str(z.grad))

Output

Gradient of x is: tensor(5.)
Gradient of y is: tensor(5.)
Gradient of z is: tensor(1.)

I have little doubt that my confusion originates with a minor misunderstanding. Can someone explain in a stepwise manner?

Medievalism answered 28/9, 2021 at 20:10 Comment(0)
C
10

I can provide some insights on the PyTorch aspect of backpropagation.

When manipulating tensors that require gradient computation (requires_grad=True), PyTorch keeps track of operations for backpropagation and constructs a computation graph ad hoc.

Let's look at your example:

q = x + y 
f = q * z

Its corresponding computation graph can be represented as:

  x   -------\
              -> x + y = q ------\
  y   -------/                    -> q * z = f
                                 /
  z   --------------------------/ 

Where x, y, and z are called leaf tensors. The backward propagation consists of computing the gradients of x, y, and y, which correspond to: dL/dx, dL/dy, and dL/dz respectively. Where L is a scalar value based on the graph output f. Each operation performed needs to have a backward function implemented (which is the case for all mathematically differentiable PyTorch builtins). For each operation, this function is effectively used to compute the gradient of the output w.r.t. the input(s).

The backward pass would look like this:

dL/dx <------\    
  x   -----\  \ 
            \ dq/dx 
             \  \ <--- dL/dq-----\
              -> x + y = q ----\  \
             /  /               \ df/dq
            / dq/dy              \  \ <--- dL/df ---
  y   -----/  /                   -> q * z = f
dL/dy <------/                   /  /
                                / df/dz
  z   -------------------------/  /
dL/dz <--------------------------/

The "d(outputs)/d(inputs)" terms for the first operator are: dq/dx = 1, and dq/dy = 1. For the second operator they are df/dq = z, and df/dz = q.

Backpropagation comes down to applying the chain rule: dL/dx = dL/dq * dq/dx = dL/df * df/dq * dq/dx. Intuitively we decompose dL/dx in the opposite way than what backpropagation actually does, which to navigate bottom up.

Without shape considerations, we start from dL/df = 1. In reality dL/df has the shape of f (see my other answer linked below). This results in dL/dx = 1 * z * 1 = z. Similarly for y and z, we have dL/dy = z and dL/dz = q = x + y. Which are the results you observed.


Some answers I gave to related topics:

Chariness answered 28/9, 2021 at 23:33 Comment(0)
S
4

I hope you understand that When you do f.backward(), what you get in x.grad is dfdx.

In your case . So, simply (with preliminary calculus)

If you put your values for x, y and z, that explains the outputs.

But, this isn't really "Backpropagation" algorithm. This is just partial derivatives (which is all you asked in the question).

Edit: If you want to know about the Backpropagation machinery behind it, please see @Ivan's answer.

Spider answered 28/9, 2021 at 20:29 Comment(1)
Thank you. I am working my way through a PyTorch course on DataCamp and this was one of the examples they used.Medievalism
S
1

you just got to understand what are the operations and what are the partial derivatives you should use to come at each, for example:

x = torch.tensor(1., requires_grad = True)
q = x*x
q.backward()

print("Gradient of x is: " + str(x.grad))

will give you 2, because the derivative of x*x is 2*x.

if we take your exemple for x, we have:

q = x + y
f = q * z

which can be modified as:

f = (x+y)*z = x*z+y*z

if we take the partial derivative of f in function of x, we endup with just z.

To come at this result you have to consider all other variables a constant and apply the derivative rules you already know.

But keep in mind, the process that pytorch executes to get these results are not symbolic or numeric differentiation, is Automatic differentiation, which is a computational method to efficiently get the gradients.

Take a closer look at:

https://www.cs.toronto.edu/~rgrosse/courses/csc321_2018/slides/lec10.pdf

Sextant answered 29/9, 2021 at 16:54 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.