I can provide some insights on the PyTorch aspect of backpropagation.
When manipulating tensors that require gradient computation (requires_grad=True
), PyTorch keeps track of operations for backpropagation and constructs a computation graph ad hoc.
Let's look at your example:
q = x + y
f = q * z
Its corresponding computation graph can be represented as:
x -------\
-> x + y = q ------\
y -------/ -> q * z = f
/
z --------------------------/
Where x
, y
, and z
are called leaf tensors. The backward propagation consists of computing the gradients of x
, y
, and y
, which correspond to: dL/dx
, dL/dy
, and dL/dz
respectively. Where L
is a scalar value based on the graph output f
. Each operation performed needs to have a backward function implemented (which is the case for all mathematically differentiable PyTorch builtins). For each operation, this function is effectively used to compute the gradient of the output w.r.t. the input(s).
The backward pass would look like this:
dL/dx <------\
x -----\ \
\ dq/dx
\ \ <--- dL/dq-----\
-> x + y = q ----\ \
/ / \ df/dq
/ dq/dy \ \ <--- dL/df ---
y -----/ / -> q * z = f
dL/dy <------/ / /
/ df/dz
z -------------------------/ /
dL/dz <--------------------------/
The "d(outputs)/d(inputs)"
terms for the first operator are: dq/dx = 1
, and dq/dy = 1
. For the second operator they are df/dq = z
, and df/dz = q
.
Backpropagation comes down to applying the chain rule: dL/dx = dL/dq * dq/dx = dL/df * df/dq * dq/dx
. Intuitively we decompose dL/dx
in the opposite way than what backpropagation actually does, which to navigate bottom up.
Without shape considerations, we start from dL/df = 1
. In reality dL/df
has the shape of f
(see my other answer linked below). This results in dL/dx = 1 * z * 1 = z
. Similarly for y
and z
, we have dL/dy = z
and dL/dz = q = x + y
. Which are the results you observed.
Some answers I gave to related topics: