The goal of training is to find the optimal trainable parameters that minimizes the loss through optimization.
One popular method of optimization is gradient descent. The model minimizes loss through gradient descent by calculating the gradient of the loss function at each step and then moving in the opposite direction.
Here is a step by step overview of how gradient descent works.
Overview of gradient-descent
Gradient descent starts by initializing the trainable parameters with a random value $w_0$. It will iteratively improve on this random initialization step by step through gradient steps.
At each gradient step, the model calculates the gradient of the loss function with respect to the parameters. In other words, the model finds the direction that increases the loss the most.
The gradient is calculated through backpropagation. Learn more about backpropagation here: [[Backpropagation]].
To minimize the loss, the model simply adjusts the parameters to move in the opposite direction of the calculated gradient.
This process can be expressed through the following equation:
In this equation:
- $w_n+1$ is the trainable parameters at the next gradient step $n+1$.
- $w_n$ is the trainable parameters at the current gradient step $n$.
- $\eta$ is the learning rate, which is a hyperparameter that controls the size of each gradient step.
The learning rate ($\eta$) determines how big each gradient step is. In other words, it determines how quickly the gradient descent is done. This is important for the following reasons:
- If the learning rate is too small, the model might get stuck in an early local minimum.
- If the learning rate is too big, the the model might bounce around a global minimum but can't descend further in.
Stochastic Gradient Descent (SGD)
The losses calculated in gradient descent is essentially the average of the individual losses of each sample in the dataset. Having to calculate the gradient for every single data point in the dataset is extremely computationally heavy and inefficient.
Stochastic Gradient Descent (SGD) is a more efficient variation of the gradient descent that adjusts the trainable parameters based on the gradients of smaller random subsets of data known as batches.
In other words, instead of summing the individual losses over every data point in the entire dataset, SGD calculates the gradient using a partial sum. Even though the gradient might be noisier and less accurate, it is still an unbiased estimator.
The most popular form of SGD is the Adam Optimizer function.