Rise of GPUs and GPU architecture
👷♂️ Software Architecture Series — Part 24.
In 1965, Gordon Moore, the co-founder of Intel, published a paper in which he observed that the number of transistors on an integrated circuit at minimum cost had increased by a factor of two between 1960 and 1965. Based on this observation, he further predicted that the number of transistors on an integrated circuit will double every two years with minimal rise in cost.
His observation was not a scientific one but based on intuitive understanding of this field and over the course of time his observation has stood the test of time. And the microprocessors of Intel Pentium family further cemented his legacy by standing true to his observations. These microprocessors based on a single central processing unit (CPU), drove rapid performance increases and cost reductions in computer applications for more than two decades, which allowed application software to provide more functionality, have better user interfaces, and generate more useful results. However, customer demands are rarely satisfied, and expectation increases at each turn of the wheel. The ever-increasing demand for performance improvement and cutthroat competition in this field, led to a positive cycle for computer industry. Software developers started relying on the underlying hardware to increase the speed of the application. However, since 2003, due to energy consumption and heat-dissipation issues, processor’s clock speed started hitting the upper limit. Vendors started switching towards models where multiple processing units, processor cores, are used in each chip to increase the processing power. And this changed the course of programming constructs since then.
Traditionally, software developers approached programming in sequential fashion, as if programs are executed by a human sequentially stepping through the code. However, a sequential program will only run on one of the processor cores, which has practically hit the ceiling in current times. But the demand for improvement in performance in software application has been constant. Hence, in modern approach to programming, software developers have started to explore the concept of parallel programming, which basically means to increase the processing power of a computer by leveraging CPU cores in parallel. With new microprocessors acting as parallel computer, the shift towards parallel programming has dramatically increased, but there is still a long way to go.
CPUs are primarily optimized for sequential code performance. They employ sophisticated control logic to execute instructions from a single thread in parallel or out of order while maintaining the appearance of sequential execution. CPUs are equipped with large cache memories to reduce the latency of accessing instructions and data. These cache memories are essential for handling large, complex applications efficiently. However, they do not directly contribute to peak calculation speed. Instead, they optimize the execution of sequential code by minimizing latency. This design philosophy prioritizes maximizing the performance of sequential tasks. General-purpose multicore CPUs typically have a smaller number of large processor cores. These cores are designed to deliver strong sequential code performance, which is crucial for tasks that cannot be effectively parallelized.
In contrast, many-core GPUs are designed with a focus on parallel processing power. With general purpose microprocessors hitting their performance limit, the computer industry started adopting GPUs, which are designed with a large number of smaller cores optimized for parallel processing tasks. This design makes them highly efficient for tasks that can be parallelized, such as graphics rendering, scientific simulations, machine learning, etc. Since 2003, GPUs have outpaced CPUs in terms of floating-point performance, as the ratio between the peak floating-point calculation throughput of GPUs and CPUs has been steadily increasing, reaching around 10 to 1 by 2009. Many industries, including scientific research, finance, healthcare, and entertainment, have benefited from the acceleration provided by GPUs. For example, complex simulations, data analytics, and deep learning training can be completed much faster on GPUs compared to traditional CPUs.
Consider this, playing a video game at 60 frames per second (fps) with a resolution of 1080 x 1920 (Full HD or 1080p) entails rendering over 2 million pixels on the screen for each frame. Rendering involves numerous computational tasks, including matrix multiplication and vector transformation, which are essential for rendering 3D graphics, applying visual effects, and performing various calculations for physics simulations, lighting, and shading. Unlike traditional CPUs, which typically have a few cores optimized for sequential processing tasks, modern GPUs are equipped with thousands of cores that can perform calculations simultaneously.
The high number of cores in modern GPUs enables them to perform parallel processing tasks such as matrix multiplication and vector transformations rapidly. This parallelism is essential for meeting the demands of real-time rendering in video games, where numerous calculations must be performed for each pixel on the screen within a tight timeframe to maintain smooth gameplay at 60 fps or higher.
The performance of GPUs is often measured in teraflops (trillions of floating-point operations per second), which quantifies their computational capability. Teraflops indicate the number of floating-point operations a GPU can handle in a single second, reflecting its processing power.
For example, a GPU like the NVIDIA RTX 4090 may have over 16,000 cores and deliver teraflops of computational performance. This immense parallel processing capability allows modern GPUs to efficiently handle the complex computations required for rendering high-resolution graphics, applying advanced visual effects, and executing compute-intensive tasks such as AI inference, scientific simulations, and cryptocurrency mining.
However, this does not mean the end of road for the general-purpose CPUs. While CPUs excel at tasks requiring strong sequential code performance, GPUs are optimized for parallel processing tasks, particularly numeric computations. Therefore, leveraging both CPUs and GPUs in an application can offer the best of both worlds’ solution i.e., efficient execution of sequential tasks on CPUs and accelerated processing of numerically intensive tasks on GPUs.
The CUDA (Compute Unified Device Architecture) programming model, introduced by NVIDIA in 2007, was step towards this direction, as it facilitates the joint execution of CPU and GPU tasks within an application. CUDA allows developers to offload computationally intensive portions of their code to the GPU while keeping the rest of the application running on the CPU. This approach, often referred to as heterogeneous computing, maximizes overall performance by utilizing the strengths of both types of processors. By utilizing CUDA or similar programming models, software developers can take advantage of the parallel processing power of GPUs without sacrificing the existing benefits of sequential execution given by CPUs.
CUDA programs consist of kernel functions, written in CUDA C/C++, which are special functions executed in parallel by multiple threads on the GPU. Kernel functions are typically designed to perform a specific computation or task on individual data elements. Each thread in a kernel executes the same code, but with different data inputs. Threads in CUDA are organized into blocks, which are further organized into a multi-dimensional grid. Blocks are groups of threads that execute concurrently on a single streaming multiprocessor within the GPU. Each block within a grid (collection of blocks that execute the same kernel function) has its own unique block index and is limited in size by hardware constraints, such as the maximum number of threads per block and the maximum amount of shared memory per block. The grid structure allows for flexible organization of threads and blocks to adapt to the requirements of different computational tasks. For example, a two-dimensional grid may be suitable for image processing tasks where data is naturally arranged in rows and columns. This allows CUDA programs to scale efficiently across different GPU architectures and hardware configurations. By adjusting the number of threads per block and the number of blocks in the grid, CUDA programs can effectively utilize the available resources on GPUs of varying capabilities.
CUDA provides a runtime API that allows developers to manage GPU resources, launch kernel functions, and transfer data between the CPU and GPU. It abstracts many low-level GPU operations, making it easier for developers to write GPU-accelerated applications. NVIDIA provides a set of libraries and tools that leverage CUDA for specific tasks, such as cuBLAS for linear algebra computations, cuFFT for fast Fourier transforms, cuDNN for deep learning, and more.
The release of CUDA not only impacted software design change but also propelled the hardware industry to adopt designs that support parallel computing workloads directly, without relying on the graphics interface. One of the key advantages of CUDA was its compatibility with familiar programming languages such as C/C++, making it easier for developers to leverage the power of GPUs for parallel programming. By providing a familiar programming environment, CUDA enabled programmers to transition seamlessly from traditional CPU programming to GPU-accelerated computing. Moreover, it eliminated the need for developers to use graphics APIs such as OpenGL for computing applications. Instead, CUDA programs could directly access the parallel computing capabilities of the GPU through the dedicated parallel programming interface.
Let us now delve into architecture of a modern GPU inspired by CUDA:
A GPU is organized into an array of highly threaded streaming multiprocessors, and each of these multiprocessors has a number of streaming processors that share control logic and instruction cache. The processors are responsible for executing parallel threads and managing their execution. The number of streaming multiprocessors in a building block can vary from one generation of CUDA GPUs to another. GPU comes equipped with up to 4 gigabytes of graphics double data rate (GDDR) DRAM (Dynamic random-access memory), which serves as global memory, which is used for storing data that can be accessed by all threads in parallel.
GDDR DRAM served a dual purpose: for graphics applications, it primarily functioned as frame buffer memory, storing video images, and texture information for 3D rendering. However, in the context of computing tasks, GDDR DRAM acted as very-high-bandwidth, off-chip memory. Although it offered significantly higher bandwidth (but limited) compared to typical system memory found on CPU motherboards, it also tended to have slightly higher latency. The higher latency will cause a slowdown in the execution of instructions that depend on data stored in DRAM. The limited bandwidth of DRAM can restrict the rate at which data can be transferred between the memory and the processor, especially when dealing with large datasets or memory-intensive applications.
The long latency and limited bandwidth of DRAM can significantly impact the performance of processing units such as CPU cores or GPU streaming multiprocessors. These units rely on fast access to data stored in memory to perform computations efficiently. When the memory subsystem becomes a bottleneck, processing units may spend more time waiting for data to be fetched from memory, reducing their overall utilization and efficiency.
This trade-off was acceptable for graphics processing, where high throughput is crucial for rendering complex scenes, but it was still sufficiently fast for many parallel computing tasks where data throughput is paramount. Hence, by leveraging the large number of streaming processors and the high bandwidth of GDDR DRAM, GPUs could execute a vast number of parallel threads simultaneously.
The NVIDIA G80 chip which was based on CUDA GPU architecture had a memory bandwidth of 86.4 GB/s, allowing for fast access to data stored in the GDDR DRAM. Additionally, it had an 8 GB/s communication bandwidth with the CPU. While the communication bandwidth between the GPU and CPU may seem lower compared to memory bandwidth, it is not as limiting as it may appear at first instance. The PCI Express bandwidth is comparable to the CPU front-side bus bandwidth to the system memory, minimizing communication bottlenecks. Moreover, communication bandwidth is expected to increase in the future with advancements in CPU bus bandwidth. The original G80 chip was massively parallel, featuring 128 Streaming Processors (SPs) organized into 16 Streaming Multiprocessors (SMs), with each SM containing 8 SPs. Each SP had a multiply-add (MAD) unit and an additional multiply unit, contributing to a total computational power of over 500 gigaflops. On top of it, each SP was massively threaded, capable of running thousands of threads per application. A typical application could run between 5000 to 12,000 threads simultaneously on this chip. The GT200, a later iteration, exceeded 1 teraflop with 240 SPs and supports 1024 threads per SM, enabling even higher thread counts of up to about 30,000 threads.
To mitigate the impact of memory bottlenecks, modern processors employ on-chip cache memories. Caches help reduce the number of variables that need to be accessed from the main memory (DRAM) by storing frequently accessed data and instructions closer to the processor cores, which reduces the need to access data from slower off-chip memory. This improves the overall system performance by exploiting the principle of locality, where programs tend to access the same set of memory locations repeatedly within a short period of time.
To balance memory size and access speed, modern processors employ a hierarchical structure of caches. This hierarchy consists of multiple cache levels, each with varying sizes, latencies, and access speeds. The numbering convention for cache levels reflects their proximity to the processor. Caches closer to the processor core have lower latency and higher bandwidth but are smaller in size compared to caches farther away.
For instance, L1 Cache is the cache directly attached to a processor core. It operates at a speed close to that of the processor, offering low latency and high bandwidth. However, due to its proximity, it is small in size, typically ranging from 16 to 64 KB. L1 caches are dedicated to each processor core and store frequently accessed data and instructions. L2 caches are larger than L1 caches and are often shared among multiple processor cores or streaming multiprocessors in GPU architectures. They typically range in size from a few hundred kilobytes to a few megabytes. Although larger in size, L2 caches have higher latency compared to L1 caches. Some high-end processors feature an additional level of cache known as L3 cache. L3 caches are even larger than L2 caches, potentially ranging from hundreds of megabytes to a few gigabytes in size. They provide further caching for data shared among multiple processor cores or streaming multiprocessors, which helps in reducing memory access latency and improving overall system performance.
Starting with NVIDIA’s Pascal architecture and continuing with subsequent generations, such as Volta, Turing, and Ampere, there is a shift towards using High-Bandwidth Memory (HBM) or HBM2. These memory technologies represent a departure from traditional GDDR DRAM and offer several advantages over traditional GDDR DRAM by integrating DRAM modules tightly with the GPU in the same package. This localisation allows for shorter data paths between the memory and the processing units and result in reduced power consumption because less energy is lost due to resistance and capacitance. Moreover, they operate at lower voltage compared to GDDR DRAM.
Internally, HBM/HBM2 utilizes a stacked memory architecture, where multiple DRAM dies are vertically stacked on top of each other and interconnected using through-silicon vias (TSVs). The integration of DRAM modules in close proximity to the GPU enables higher levels of parallelism in memory access operations. This translates to faster data transfer rates and improved overall system performance, particularly in memory-bound applications such as high-resolution gaming, deep learning, and scientific simulations.
End Note: The level of parallelism supported by GPUs is bound to increase in coming days as applications will continue to demand increase in speed. We may think that the current speed of applications is more than optimal, but the future mass marketing applications are supposed to be supercomputing applications, deriving insights on a much deeper level than the current ones. For instance, consider fields like molecular biology, climate modeling, material science, traffic flow optimization, genomic analysis, etc. which have definite potential to be benefited from advanced computational modeling and simulation. Computational modeling and simulations enable researchers to test a wide range of hypotheses and scenarios, beyond the capabilities of traditional instrumentation. As computing speed continues to improve, the size and complexity of these systems that can be modeled will increase. Additionally, simulations will be able to simulate longer reaction times within tolerable response times, which will allow for more accurate and realistic representations of the processes.