LONDON – Following the launch of its AI inference chip last year, Habana Labs (Tel-Aviv, Israel) has unveiled an AI training chip built on the same architecture that can outpace the incumbent technology by a substantial margin, and features on-chip RoCE (remote direct memory access over Converged Ethernet) communications for scalability.
While the company’s inference chip, Goya, set records for ResNet-50 inference back in September 2018, the new training chip, Gaudi, offers similar high performance. Gaudi can process 1650 images per second at a batch size of 64 when training a ResNet-50 network, which Habana claims is a new world record for this benchmark. This throughput is delivered at 140W power consumption, also a substantial advantage versus competing solutions, according to the company.
Impressive, but is Habana’s architecture designed specifically to beat the ResNet-50 benchmark, or will it offer similar throughput advantages for other types of neural networks?
“There is nothing in the architecture that limits it to be a ResNet-50 machine, not at all,” said Eitan Medina, Habana Labs’ Chief Business Officer. “A company like Facebook wouldn’t have spent the time to integrate Goya [into its Glow machine learning compiler] if it was just a ResNet-50 machine… Our customers are implementing Goya in anything from vision processing to recommendation systems – many, many types of applications.”
The new Gaudi training chip joins the Goya inference chip in the Habana portfolio. Similar to Goya, Gaudi also has eight VLIW SIMD (very long instruction word, single instruction multiple data) vector processor cores, which it calls tensor processor cores (TPC), that are specially designed for AI workloads. The differences include the types of data supported – while both are mixed precision, Goya focuses on integer multipliers, while Gaudi has higher emphasis on more precise data formats such as BF16 and FP32.
The high-level architecture of Habana Labs’ Gaudi processor. (Source: Habana Labs)
“The TPC core itself is also different. By now this is a second generation of the TPC core, a VLIW machine designed from scratch with our instruction set,” said Medina. “The training chip also has a different type of memory. With Goya, we have DDR4 memory, with Gaudi, we have four HBM2 (high bandwidth memory, second generation) memories. So, it’s a different balance of throughput and on-chip memory compared to Goya.”
Gaudi comes on an OCP (Open Computer Project) accelerator model-compatible mezzanine accelerator card with 32GB of HBM2 memory (HL-205), or in an eight-card supercomputer box for datacentres (HLS-1). While the exact amount of on-chip memory wasn’t disclosed – other than describing it as substantial – Medina said, “The training solution has an incredible amount of throughput to the HBM2s, so we are not that sensitive to on-chip memory size, and we designed the specialised memory controller to deliver 1 TB/s of throughput, that’s very high throughput for any scale of processor.”
Habana Labs HLS-1 system combines eight Gaudi accelerator cards. (Source: Habana Labs)
Habana’s software stack, SynapseAI, directly interfaces with deep learning frameworks such as TensorFlow, PyTorch, and Caffe2. There is also a complete programming toolchain for the TPC.