The chip’s architecture offers near-linear scaling of training systems performance, allowing performance scaling of Gaudi-based systems from a single-device to systems using hundreds of Gaudi processors.
The scalability is provided by Gaudi’s on-chip integration of RDMA over Converged Ethernet (RoCE v2) functionality inside the AI processor using standard Ethernet. Users are able to use standard Ethernet switching for scaling-up and scaling-out AI training systems. Ethernet switches are multi-sourced, and offer almost unlimited scalability in speeds and port-count, as seen in datacenters to support compute and storage systems. In contrast, GPU-based systems rely on proprietary system interfaces, that can limit scalability and choice.
The Gaudi processor includes 32GB of HBM-2 memory and is currently offered in two forms:
HL-200 – a PCIe card supporting eight ports of 100Gb Ethernet;
HL-205 – an OCP-OAM compliant mezzanine card, supporting 10 ports of 100Gb Ethernet or 20 ports of 50Gb Ethernet.
Haban will also introduce an 8-Gaudi system called HLS-1, which includes eight HL-205 Mezzanine cards, with PCIe connectors for external Host connectivity and 24 100Gbps Ethernet ports for connecting to off-the-shelf Ethernet switches, allowing scaling-up in a standard 19’’ rack by populating multiple HLS-1 systems.
The Gaudi Processor is fully programmable and customizable, incorporating a second-generation Tensor Processing Core (TPC) cluster, along with development tools, libraries, and a compiler, that collectively deliver a comprehensive and flexible solution. Habana Labs’ SynapseAI software stack consists of a rich kernel library and open toolchain for customers to add proprietary kernels.