Gary Gao

Backpropagation is used to calculate the gradient of the loss function with respect to the parameters. Technically, it is the combination of the forward press, which computes the activations, and the backward press, which computes the gradients.

Foreward Press

A foreward press is the process of passing data sequentially through layers.

Take, for example, this model: In this model, is composed of multiple individual mappings $f^D$ called layers in the network. The model works by first starting with the initial value $x^0 = x$ and then iteratively applying mappings onto it.

The individual mappings can be expressed in the equation below:

Where the final value is:

In this equation:

$x^d$ is an intermediary value known as an activation.
$D$ represents the depth of the model.

Backward Press

After the foreward press, the gradient is then calculated through the backward press.

In the backward press, the gradient at each layer is calculated by working backward from the final layer. In other words, they are calculated recursively from the output layer back through the input layer.

This can be seen in the equation below:

In this equation:

is the Jacobian matrix representing how the output d changes with respect to the input.

Therefore, the gradients of layer are computed with respect to the activations of the previous layer $d$.

It is also important to note the vanishing gradient problem. Learn more about the vanishing gradient problem here: [[The Vanishing Gradient Problem]].