Autoregressive models predict each element of a sequence based on the previous elements. Some popular Autoregressive models include NLP or Computer Vision models.
Autoregressive models predict the next token by modeling the probability of the token based on previous sequences of tokens using the chain rule. In other words, the probability of each token depends on the previous tokens in the sequence.
The model is represented as a function $f$ that takes in a sequence of previous tokens and produces a vector of logits: an unnormalized set of scores representing the probabilities of the next token.
Training an autoregressive model involves minimizing the cross-entropy loss. Performance of an Autoregressive model is measured through perplexity. This is the average number of choices the model has when predicting the next token. A lower perplexity usually corresponds to a higher confidence.
Causal Autoregressive Structure
Traditional autoregressive models are often inefficient because the model takes in a new input for every time step. This is very inefficient for sequences that are very long.
The solution is to predict the logits for all the time steps at once. The causal structure ensures that the model's predictions are only based on the past time steps, thereby keeping its autoregressive nature.
RNN, LSTMs, N-grams
Two common examples of autoregressive models are the Recurrent Neural Network (RNN) and the Long Short Term Model (LSTM). Transformers, such as the Generative Pre-trained Transformer (GPT), are also autoregressive by nature.
While commonly associated with autoregressive models, the n-gram model is different since it has a fixed context size of $n-1$, thereby the name n-gram. Autoregressive models considers all previous tokens.
Autoregressive Generative Models
Autoregressive Generative Models generate new sequences by choosing the next token based on the previous tokens and the probability distribution they have learned. This is a process known as sampling.