GPU Basics
The Graphical Processing Unit (GPU) was originally created for real-time image synthesis, or the task of rendering images quickly. This required the GPU to have a highly parallel architecture - meaning that it could handle many tasks simultaneously.
Deep learning computations often require performing multiple operations simultaneously. This means that the GPU's highly parallel architecture is very useful for DL.
A GPU has thousands of smaller processing units working in parallel, each with its own fast memory. This means that it can store and access data very quickly. This is accomplished through cache memory, which is a temporary small and fast storage unit.
The Tensor Processing Unit
As the influence of Artificial Intelligence increased, GPUs started getting equipped with tensor cores, which accelerated operations on tensors. Specialized chips such as the Tensor Processing Unit (TPU) are also produced.
Learn more about the definition of tensors here: [[Tensors]].
The Bottleneck
The main bottleneck of the GPU isn't processing power - it is the process of copying data from the GPU to the CPU. This process takes more time than the computations.
For the GPU to process input data, it needs to copy over that data from the CPU's memory. After completing the computations, the GPU then needs to transfer the processed information back to the CPU's memory.
To optimize the performance of the GPUs, data can be organized into batches. This reduces the need to copy data from and to the CPU for every specific sample.