The big news at this year’s virtual GPU Technology Conference (GTC) was Nvidia’s release of the A100 (Fig. 1). The A100 GPU with its 54.2 billion transistors can get a bit toasty with a max thermal design power (TDP) of 400 W, and a large array of modules can be connected using the built-in NVLinks. Each module supports 600 GB/s of NVLink bandwidth in addition to the 64 GB/s PCI Express (PCIe) Gen 4 interfaces. The PCIe interfaces support SR-IOV.
The A100 GPU is based on the company’s new Ampere architecture that provides a significant performance boost compared to the earlier V100 Volta and Turing architectures. Included is 40 GB of HBM2 on-module memory for the A100 with a memory bandwidth of 1555 GB/s. There’s also a 40-MB L2 cache that’s almost seven times that of the V100.
Seven GPU processing clusters (GPCs) and seven texture processing clusters (TPCs) are incorporated into the A100 GPU, along with 16 streaming multiprocessors (SMs) per GPC (Fig. 2). The ten 512-bit memory controllers support five HBM2 memory stacks. The SMs support all data types. A new shared-memory-based barrier unit provides asynchronous barriers to handle new copy instructions. The systems support 32 threads/warp and 64 warps/SM.
Usually, multiple A100s are tied to the NVLink interface, enabling very large models to be run across an array of chips. A new feature is Multi-Instance GPU (MIG), which allows the opposite to occur by splitting up the GPU resources into dedicated and protected islands of computation. Up to seven instances can be defined running CUDA applications. CUDA 11 is Nvidia’s latest programming environment.
Each MIG instance has separate and isolated paths through the entire memory system. Other resources like the on-chip crossbar ports, L2 cache banks, memory controllers, and DRAM address buses are also allocated to these logical islands. This provides predictable throughput and latency. L2 cache allocation and DRAM utilization will not be affected by the operation of other instances. Error and fault isolation are maintained within each instance.
Next: 9.7 TeraFLOPS performance