Nvidia unveiled its next-generation GPU architecture — named Hopper, alongside the new flagship GPU using the Hopper architecture, the H100. Perhaps surprisingly, Nvidia has not opted to go down the trendy chiplets route favored by Intel and AMD for their mammoth GPUs. While the H100 is the first GPU to use HBM3, its compute die is monolithic, 80 billion transistors in 814mm2 built on TSMC’s 4N process. Memory and compute are packaged via TSMC’s CoWoS 2.5D packaging.
Named for US computer science pioneer Grace Hopper, the Nvidia Hopper H100 will replace the Ampere A100 as the company’s flagship GPU for AI and scientific workloads. It will offer between 3x and 6x the raw performance of the A100 (4 PFLOPS of FP8 performance, or 60 TFLOPS of FP64). As the first GPU with HBM3 technology, its memory bandwidth is a staggering 3 TB/s, and it’s also the first GPU to support PCIe Gen5. The chip has nearly 5 TB/s of external connectivity. To put this into context, twenty H100 GPUs could sustain the equivalent of the entirety of global internet traffic today.
The new Nvidia Hopper H100 GPU – Nvidia’s new flagship GPU for data center AI and scientific workloads (Source: Nvidia)
The Hopper architecture has a few tricks up its sleeve for AI processing and scientific workloads.
The first is a new transformer engine. Transformer networks, already the de facto standard for natural language processing today, are showing promise in many other AI applications, including protein folding and even in computer vision. Today, they power many conversational AI applications. The trouble with transformer networks is that they are enormous — billions or trillions of parameters — which makes them extremely computationally expensive to train. Training a decent-sized transformer today can take months, depending on the computing power at your disposal.
Nvidia has invented a new low-precision format, FP8, for its Hopper tensor cores. The new Hopper tensor engine can apply mixed FP16 and FP8 formats to speed up transformer training where appropriate. The challenge is to know when to switch to lower precision to speed up throughput, in a way that maintains the accuracy of the end result. Nvidia has come up with strategies that can do this dynamically during training.
Combine the tensor engine with the other improvements Hopper brings, and the result is an up to 9x reduction in time to train for transformer networks — in Nvidia’s example, from 7 days with A100 to 20 hours with H100 for the 395-billion parameter Mixture of Experts network. For Megatron-530B, with 530 billion parameters, H100 outperforms A100 by as much as 30x.
Time to train the Mixture of Experts transformer network for H100 versus A100 (Source: Nvidia)
Another neat trick is the addition of new instructions to accelerate dynamic programming. Dynamic programming is a technique used by popular scientific algorithms including Floyd-Warshall (for route optimization) and Smith-Waterman (for DNA sequence alignment), and many more. In general, dynamic programming means algorithms are broken into smaller sub-problems which are easier to solve. The answers to sub-problems are stored for re-use to avoid having to recalculate them.
Hopper’s DPX instructions are tailored for operations like these. Until now, these workloads have largely been run on CPUs and FPGAs. With the H100, Floyd-Warshall can be run 40x faster than on a CPU.
The H100 also features second-generation multi-instance GPU (mig) technology. Mig allows large data center GPUs to be effectively broken into multiple smaller GPUs. These mini-instances can be used to run multiple workloads simultaneously on the same chip. Next-gen mig offers secure multitenant configurations in cloud environments across each GPU instance so that computing power can be securely divided between different users or cloud tenants.
In yet another first for the H100, Nvidia claims the chip is the first GPU with confidential computing capabilities. The idea is to protect sensitive or private data, even while it is in use (and therefore decrypted). Today’s confidential computing schemes are CPU-based, so not practical for AI or high performance computing (HPC) at scale.
Nvidia’s confidential computing scheme uses hardware and software to create a trusted execution environment through a confidential virtual machine. Data transfers between CPU and GPU, and between GPUs, are encrypted and decrypted at full PCI line rate. H100 also has a hardware firewall that secures the workload in its memory and the compute engines, so that no-one other than the owner of the trusted execution environment with the key can see the data or the code.
The H100 is also first to use Nvidia’s fourth-generation NVLink communications technology. When scaling to multiple GPUs, the communication between GPUs is often a bottleneck. A new NVLink switch can create networks of up to 256x H100 GPUs, 32x bigger than before, with 11x higher bandwidth than Quantum InfiniBand technology.
Superchips and Supercomputers
Nvidia also unveiled several “superchips.” The Grace CPU superchip is a module with two Grace CPU dies on it; the combination is a 144-Arm-core single-socket CPU behemoth with 1 TB/s of memory bandwidth for hyperscale data center AI and scientific computing. This is a class above current data center CPUs on the market. This module consumes 500W.
There is also the Grace Hopper superchip: a Grace CPU plus a Hopper GPU.
Nvidia’s “superchips” combine two Grace CPUs or a Grace CPU and a Hopper GPU (Source: Nvidia)
The enabling technology here is a brand-new memory-coherent chip-to-chip interface, NVLink-C2C, which enables a 900 GB/s link between dies. It can be used at the PCB, MCM, Interposer or wafer level.
During his GTC keynote, Nvidia CEO Jensen Huang mentioned that NVLink-C2C will be made available to other customers and partners who want to implement custom chips that connect to Nvidia’s platforms. The company said separately it would support the UCIe chiplet-to-chiplet standard supported by Intel, AMD, Arm and others, though it didn’t say how, or when. (UCIe is a developing open platform to enable an off-the-shelf chiplet ecosystem).
Both the Grace CPU superchip and the Grace Hopper superchip will ship in the first half of next year.
There will of course be scaled-up systems based on the H100, including DGX-H100 (eight H100 chips, 0.5 PFLOPS of FP64 compute) and the new DGX-Superpod, which is 32x DGX-H100 nodes for 1 ExaFLOPS of AI performance (FP8).
As a sister to its AI supercompute Selene, which is based on the A100, Nvidia will build a new supercomputer called Eos, comprising 18x DGX-Superpods. This 18-ExaFLOPS beast will have 4600x H100 GPUs, 360 NVlink switches, and 500 Quantum InfiniBand switches. It will be used by Nvidia’s AI research teams.
Eos is expected to come online later this year and Nvidia expects it will be the number one AI supercomputer at that time.
>> This article was originally published on our sister site, EE Times.
- Extending the RISC-V architecture with domain specific accelerators
- Hardware accelerators serve AI applications
- Startup packs 1000 RISC-V cores into AI accelerator chip
- TOPS vs. real world performance: Benchmarking performance for AI accelerators
For more Embedded, subscribe to Embedded’s weekly email newsletter.