Sparsity augments AI acceleration on Nvidia’s A100 GPU

May 24, 2020 //By William G. Wong, Electronic Design
Sparsity Augments AI Acceleration on Nvidia’s A100 GPU
The A100 platform pushes the limits of machine learning from the edge to the enterprise. Peak performance for 64-bit floating point is 9.7 teraFLOPs (TFLOPS). At the other end of the spectrum is the INT4 performance of 1248 TOPS.

The big news at this year’s virtual GPU Technology Conference (GTC) was Nvidia’s release of the A100 (Fig. 1). The A100 GPU with its 54.2 billion transistors can get a bit toasty with a max thermal design power (TDP) of 400 W, and a large array of modules can be connected using the built-in NVLinks. Each module supports 600 GB/s of NVLink bandwidth in addition to the 64 GB/s PCI Express (PCIe) Gen 4 interfaces. The PCIe interfaces support SR-IOV.  

1. The A100, based on Nvidia’s Ampere architecture, has a bandwidth of 1.5 TB/s.
1. The A100, based on Nvidia’s Ampere architecture, has a bandwidth of 1.5 TB/s.

The A100 GPU is based on the company’s new Ampere architecture that provides a significant performance boost compared to the earlier V100 Volta and Turing architectures. Included is 40 GB of HBM2 on-module memory for the A100 with a memory bandwidth of 1555 GB/s. There’s also a 40-MB L2 cache that’s almost seven times that of the V100.

Seven GPU processing clusters (GPCs) and seven texture processing clusters (TPCs) are incorporated into the A100 GPU, along with 16 streaming multiprocessors (SMs) per GPC (Fig. 2). The ten 512-bit memory controllers support five HBM2 memory stacks. The SMs support all data types. A new shared-memory-based barrier unit provides asynchronous barriers to handle new copy instructions. The systems support 32 threads/warp and 64 warps/SM.

2. The streaming multiprocessor (SM) is the building block for the A100 system.
2. The streaming multiprocessor (SM) is the building block for the A100 system.

 

 

Usually, multiple A100s are tied to the NVLink interface, enabling very large models to be run across an array of chips. A new feature is Multi-Instance GPU (MIG), which allows the opposite to occur by splitting up the GPU resources into dedicated and protected islands of computation. Up to seven instances can be defined running CUDA applications. CUDA 11 is Nvidia’s latest programming environment.

Each MIG instance has separate and isolated paths through the entire memory system. Other resources like the on-chip crossbar ports, L2 cache banks, memory controllers, and DRAM address buses are also allocated to these logical islands. This provides predictable throughput and latency. L2 cache allocation and DRAM utilization will not be affected by the operation of other instances. Error and fault isolation are maintained within each instance.

Next: 9.7 TeraFLOPS performance


Vous êtes certain ?

Si vous désactivez les cookies, vous ne pouvez plus naviguer sur le site.

Vous allez être rediriger vers Google.