Essential Deep-Learning Theory

April 8, 2024

I'm sure we have all been, at some point in time, unable to understand an article on substack or a paper on arXiv because the theory or the terminology is confusing. Through this short article, I hope to demystify, to a certain degree, some simple theory and jargon behind deep learning. At the very least, if you encounter some new obscure theory in the future, you should be able to know where to look.

Since the theory of deep learning is very multivariable calculus/linear algebra focused, having some basic knowledge of those would be great!

In order to add credibility, I loosely based the structure of this under the Little Book of Deep Learning - a generally accepted benchmark for deep learning theory.

Part 1: Machine Learning:

Deep learning is technically under statistical machine learning as they are both based on the ability to learn representation from data.

Machine learning works by having a parametric model that has certain trainable parameters that can be adjusted based on the training dataset.

How good the model performs is formalized with a loss. Therefore, training a model is essentially adjusting the trainable parameters to minimize the loss.

A key concept to know is capacity. This is essentially the model's ability to see patterns. Not making much sense - don't worry, it will soon. A model with a high capacity can capture lots of patterns within the data - sometimes a little too much. When the dataset is very small and the model has a very high capacity, a phenomenon known as overfitting can occur, where the model begins to learn characteristics specific to the training data. On the other end of the spectrum, a model with a low capacity could sometimes result in underfitting, where the model can't capture enough details and therefore has a very high loss. Something else that is helpful to know is that fitting the data essentially means training the model on the data.

Machine learning can be organized into three main categories:

  • Regression: These models predicts a continuous value given the input data.
  • Classification: These models predicts the class that certain inputs belongs in. It usually does so by predicting a set of unnormalized scores corresponding to the categories for each input, turning these scores into probabilities through a softmax function, and then picking the category corresponding to the maximum score. If this seems confusing, no worries - I will go into more depth about this later on.
  • Density Modeling: This is a form of unsupervised learning aimed at understanding the probability distribution of the data.

Now that we have a nice understanding of basic machine learning, we are ready to dive into some of the fundamental topics of deep learning.

Part 2: Deep Learning Basics:

Let me start off by defining what a tensor is. A tensor is essentially a very generalized matrix or vector. Formally, it is a collection of scalars organized along a set number of dimensions. What you need to know is that it will be the most common way data is expressed in deep learning. Input data, trainable parameters, and activations (intermediate results passed along between neurons - will make more sense I promise) are all stored in tensors. This is mainly due to the fact that the shape representation of tensors is stored separately from the actual content, which makes it computationally easy for the computer to apply transformations on it.

Remember: training a model is basically modifying the trainable parameters to minimize the loss through an optimizer. Lets talk about losses and then optimizers.

Lets first go over a few common types of losses without going too much into the mathematics of it:

  • For simple regression tasks, the mean squared loss is a great choice. Its basically just summing the squared differences between the ground truths and the predicted values.
  • For classification tasks, we often use the cross entropy loss. This is a bit more complicated, but it essentially applies the logarithmic function on the probability normalized by the softmax function (remember!) so that loss is made bigger for values that is further from 1.
  • There is also the contrastive loss, which comes in very handy for setups where the goal is to measure the similarity or differences between data samples. Two points of the same class would be moved closer on a shared latent space representation while those that are different would be moved further apart.

It is important to note that while loss is great for measuring performance for the optimizer, it is simply a proxy for the final result - meaning that the performance of a model is rarely presented through its loss.

Now, lets talk about optimization. One common way of optimization, or finding the best trainable parameters to minimize the loss, is through gradient descent.

Gradient descent starts by initializing the trainable parameters with random values and then improves on it with small gradual steps. At each step, the model calculates the gradient (fastest path - remember?) of the loss function. In order to minimize the loss, the model simply moves the parameters in the opposite direction (so the direction that decreases loss the most).

There is an important hyperparameter called the learning rate, which determines how big each step defined above is. If the step is too small, the model might get stuck in a local minimum too quickly. If the step is too big, the model might bounce around the minimum - unable to descend further into it.

One problem of gradient descent is that having to calculate the gradient for every single datapoint in the dataset is very computationally expensive. Therefore, there is another optimization method known as Stochastic Gradient Descent which updates the trainable parameters based on the gradients of small, random subsets of the data known as batches. In other words, instead of calculating the entire loss, it calculates a partial sum of losses.

The most popular optimizer used is actually a form of SGD called the Adam optimizer.