Autograd For Systems Interviews
For a systems role, I do not need a model-zoo tour. I need a concrete explanation of reverse-mode autodiff, how a tiny engine stores the computation graph, and why tensor placement and data movement matter once this becomes PyTorch on a GPU.
Thirty-second answer. Reverse-mode autodiff builds a graph during the forward pass. Each operation stores a local rule for propagating an upstream gradient to its inputs. Backward seeds the final scalar output with gradient 1, walks nodes in reverse topological order, and accumulates gradients into parameters by the chain rule.
The Scalar Value Object
A micrograd-style engine starts with a scalar `Value` object:
Value:
data
grad
parents / previous nodes
operation label
local backward function
Operations create new `Value` nodes and remember how to backpropagate through themselves. For example:
- If `z = x + y`, then `dz/dx = 1` and `dz/dy = 1`.
- If `z = x * y`, then `dz/dx = y` and `dz/dy = x`.
- If `z = x^n`, then `dz/dx = n * x^(n-1)`.
- If `z = relu(x)`, then the local gradient is `1` when `x > 0` and `0` otherwise.
Why Topological Order
The graph is a DAG. A node's gradient may receive contributions from multiple downstream paths, so backward must visit children before parents. A topological traversal records a legal dependency order; reversing it gives the backpropagation order.
Gradients accumulate with `+=`, not assignment, because a single value can influence the output along multiple paths:
d = a * b
e = a + d
f = d * e
Here `d` affects `f` directly and through `e`, and `a` affects the result both directly and through `d`.
Why Seed The Output With 1
If the final scalar loss is `L`, then `dL/dL = 1`. Backward starts from that identity and repeatedly applies the chain rule through local operations.
loss.grad = 1.0
for node in reversed(topo):
node._backward()
Before a new backward pass, gradients must be reset. Otherwise, gradients from the previous training step remain and contaminate the next update.
From Value To MLP
A tiny neural network stack can be built from the same scalar object:
Neuron:
weights: list[Value]
bias: Value
activation: dot(weights, inputs) + bias
Layer:
list[Neuron]
MLP:
list[Layer]
parameters(): recursively collect weights and biases
The training loop is conceptually simple: forward pass, scalar loss, zero gradients, backward pass, gradient descent update, repeat.
PyTorch Bridge
PyTorch scales this idea from scalar toy graphs to tensor operations and optimized kernels. The review vocabulary I want warm:
- tensors, shapes, broadcasting
- `requires_grad`, `grad_fn`, `.backward()`, `.grad`
- `nn.Module`, `parameters()`, `zero_grad()`
- optimizer loop: forward, loss, backward, step
- CPU vs GPU tensor placement
- why host-device data movement can dominate runtime
- why PyTorch calls CUDA kernels and libraries underneath
Systems Angle
The systems questions are not only "what is backprop?" They are also: where does the data live, when does it move, what kernels are launched, what gets fused, and whether the computation is bandwidth or compute limited.
For GPU execution, moving tensors back and forth between CPU and GPU inside a training loop can erase the benefit of acceleration. The right mental model is ownership and placement: keep tensors on the device across repeated operations, move only at boundaries, and profile before guessing.
Review Questions
- What is a computation graph?
- Why does reverse-mode need reverse topological order?
- Why does each operation store a local backward closure?
- Why do gradients accumulate with `+=`?
- Why does `zero_grad()` exist?
- How does a toy `Neuron -> Layer -> MLP` map to a small PyTorch-like API?
- Why can CPU/GPU data movement dominate simple models?