Training a model essentially involves adjusting the trainable parameters to minimize the loss through an optimizer.
Loss
The first step in training a model would be choosing the correct loss function.
For regression tasks, the Mean Squared Error (MSE) is often used. The MSE essentially calculates the loss by summing up the square difference between the predicted value and the ground truth. Learn more about the MSE here: [[Mean Squared Error]].
For density modeling, the loss function is simply the opposite of the sum of the log-probabilities. For example, a high log probability set would correlated with the model thinking that the data is very likely. Therefore, it would have a very small loss.
For classification tasks, the Cross Entropy Loss is usually used. Learn more about the Cross Entropy Loss here: [[Cross Entropy Loss]].
The Contrastive Loss is often used in setups where the goal is to measure the similarities and differences between data samples. Learn more about the Contrastive Loss here: [[Contrastive Loss]].
It is important to know that loss is used in training as a simplified proxy of the true performance of the model. It is implemented primarily to make the process of optimization easier. The true performance of a model should be measured with other metrics.
Optimization
The next step in training a model would be optimization. This is the process of adjusting the model's trainable parameters to minimize the loss.
A common way to do this is through gradient descent, where the model minimizes the loss by calculating the gradient of the loss function at each step and then moving in the opposite direction of that gradient. Learn more about gradient descent here: [[Gradient-Descent and SGD]].
The model calculates the gradient of the loss function with respect to the parameters through a process called backpropagation. Learn more about backpropagation here: [[Backpropagation]].
A historic problem tied with backpropagation is the vanishing gradient problem. Learn more about the vanishing gradient problem here: [[The Vanishing Gradient Problem]]
Capacity
During training, it is important to consider the capacity of a model. This is essentially the model's ability to capture patterns. Finding the right capacity is important in preventing overfitting and underfitting.
Learn more about the model's capacity here: [[Capacity]].
Training Protocols
As seen through the loss and the optimizer, training often follows certain protocols. These protocols can be developed further to include validation sets, hyperparameter testing, and fine-tuning.
Learn more about training protocols here: [[Training Protocols]].
It is also important to tokenize the input sequence for NLP models before training. Learn more about tokenization here: [[Tokenization]]
Autoregressive Models
A common model architecture that can be used to demonstrate training concepts is the Autoregressive Model. The Autoregressive Model predicts each element of a sequence based on the previous elements. It is used in familiar models such as NLP or Computer Vision.
Learn more about Autoregressive Models here: [[Autoregressive Models]].
Additionally, navigate through the links below to learn more about the DL training process: