Overcoming the AI memory bottleneck - Embedded.com

Overcoming the AI memory bottleneck

AI skeptics argue that the inability to accelerate data movement between processor and memory is holding back useful real-world applications.

Skeptics of artificial intelligence have criticized the memory bottleneck that exists in the current technology, arguing that the inability to accelerate the data movement between processor and memory is holding back useful real-world applications.

AI accelerators used to train AI models in data centers require the highest memory bandwidth available. While storing an entire model in a processor would eliminate off-chip memory from the equation, it isn’t a feasible solution, as the largest models measure in the billions or trillions of parameters.

Where yesterday’s systems were memory-constrained, today’s data center architectures use a variety of techniques to overcome memory bottlenecks.

High-bandwidth memory

A popular solution is to use high-bandwidth memory (HBM), which involves connecting a 3D stack of four, eight, or 12 DRAM dies to the processor via a silicon interposer. The latest version of the technology, HBM2E, features faster signaling rates per pin than its predecessor, up to 3.6 Gb/s per pin, thereby boosting bandwidth. Samsung and SK Hynix each offer eight-die HBM2E stacks for a total of 16-GB capacity, providing 460-GB/s bandwidth (compared with 2.4 GB/s for DDR5 and 64 GB/s for GDDR6, according to SK Hynix). HBM3 is set to push speeds and capacities even higher.

The latest version of Nvidia’s flagship data center GPU, the A100, provides 80 GB of HBM2E performance with 2 TB/s of memory bandwidth. The A100 incorporates five 16-GB stacks of DRAM, joining a 40-GB version that uses HBM2 for a total bandwidth of 1.6 TB/s. The difference between the two yields a threefold increase in AI model training speed for the deep-learning recommendation model, a known memory hog.

Meanwhile, data center CPUs are leveraging HBM bandwidth. Intel’s next-generation Xeon data center CPUs, Sapphire Rapids, will introduce HBM to the Xeon family. They are Intel’s first data center CPUs to use new AMX instruction extensions designed specifically for matrix multiplication workloads like AI. They will also be able to use either off-chip DDR5 DRAM or DRAM plus HBM.“Typically, CPUs are optimized for capacity, while accelerators and GPUs are optimized for bandwidth,” said Arijit Biswas, an Intel senior principal engineer, during a recent Hot Chips presentation. “However, with the exponentially growing model sizes, we see constant demand for both capacity and bandwidth without tradeoffs. Sapphire Rapids does just that by supporting both, natively.”

Nvidia’s A100 data center GPU with six stacks of HBM2E memory (only five stacks are used, for yield reasons) (Source: Nvidia)

The approach is enhanced through memory tiering, “which includes support for software-visible HBM plus DDR, and software transparent caching that uses HBM as a DDR-backed cache,” Biswas added.

However, the HBM versions come at the cost of die area, Sapphire Rapids’ chief engineer, Nevine Nassif, told EE Times .

“The [HBM-compatible] die is slightly different,” she said. “There’s also an HBM controller that is different than the DDR5 controller. On the version of Sapphire Rapids without HBM, there’s an area of the die where we added accelerators for crypto, compression, etc. All of those go away — except for the data-streaming accelerator — and the HBM controller goes in instead.

“On top of that, we had to make some changes to the mesh to support the bandwidth requirements of HBM,” Nassif added.

Beyond CPUs and GPUs, HBM is popular for data center FPGAs. For example, Intel’s Stratix and the Xilinx Versal FPGAs come in HBM versions, and some AI ASICs also use it. Tencent-backed data center AI ASIC developer Enflame Technology uses HBM for its DTU 1.0 device, which is optimized for cloud AI training. The 80-TFLOPS (FP16/BF16) chip uses two HBM2 stacks, providing 512-GB/s bandwidth connected through an on-chip network.

Enflame’s DTU 1.0 data center AI accelerator chip has two stacks of HBM2 memory. (Source: Enflame Technology)

Performance per dollar

While HBM offers extreme bandwidth for the off-chip memory needed for data center AI accelerators, a few notable holdouts remain.

Graphcore’s comparison of capacity and bandwidth for different memory technologies. While others try to solve both with HBM2E, Graphcore uses a combination of host DDR memory plus on-chip SRAM on its Colossus Mk2 AI accelerator chip. (Source: Graphcore)

Graphcore is among them. During his Hot Chips presentation, Graphcore CTO Simon Knowles noted that faster computation in large AI models requires both memory capacity and memory bandwidth. While others use HBM to boost both capacity and bandwidth, tradeoffs include HBM’s cost, power consumption, and thermal limitations.

Graphcore’s second-generation intelligence processing unit (IPU) instead uses its large 896 MiB of on-chip SRAM to support the memory bandwidth required to feed its 1,472 processor cores. That’s enough to avoid the higher bandwidth needed to offload DRAM, Knowles said. To support memory capacity, AI models too big to fit on-chip use low-bandwidth remote DRAM in the form of server-class DDR. That configuration is attached to the host processor, allowing mid-sized models to be spread over SRAM in a cluster of IPUs.

Given that the company promotes its IPU on a performance-per-dollar basis, Graphcore’s primary reason to reject HBM appears to be cost.

“The net cost of HBM integrated with an AI processor is greater than 10× the cost of server-class DDR per byte,” Knowles said. “Even at modest capacity, HBM dominates the processor module cost. If an AI computer can use DDR instead, it can deploy more AI processors for the same total cost of ownership.”

According to Knowles, 40 GB of HBM effectively triples the cost of a packaged reticle-sized processor. Graphcore’s cost breakdown of 8 GB of HBM2 versus 8 GB of DDR4 reckons that the HBM die is double the size of a DDR4 die (comparing a 20-nm HBM with an 18-nm DDR4, which Knowles argued are contemporaries), thereby increasing manufacturing costs. Then there is the cost of TSV etching, stacking, assembly, and packaging, along with memory and processor makers’ profit margins.

Graphcore’s cost analysis for HBM2 versus DDR4 memory has the former costing 10× more than the latter. (Source: Graphcore)

“This margin stacking does not occur for the DDR DIMM, because the user can source that directly from the memory manufacturer,” Knowles said. “In fact, a primary reason for the emergence of a pluggable ecosystem of computer components is to avoid margin stacking.”

Going wider

Emerging from stealth mode at Hot Chips, Esperanto Technologies offered yet another take on the memory bottleneck problem. The company’s 1,000-core RISC-V AI accelerator targets hyperscaler recommendation model inference rather than the AI training workloads mentioned above.

Dave Ditzel, Esperanto’s founder and executive chairman, noted that data center inference does not require huge on-chip memory. “Our customers did not want 250 MB on-chip,” Ditzel said. “They wanted 100 MB — all the things they wanted to do with inference fit into 100 MB. Anything bigger than that will need a lot more.”

Ditzel added that customers prefer large amounts of DRAM on the same card as the processor, not on-chip. “They advised us, ‘Just get everything onto the card once, and then use your fast interfaces. Then, as long as you can get to 100 GB of memory faster than you can get to it over the PCIe bus, it’s a win.’”

Comparing Esperanto’s approach with other data center inference accelerators, Ditzel said that others focus on a single giant processor consuming the entire power budget. Esperanto’s approach — multiple low-power processors mounted on dual M.2 accelerator cards — better enables use of off-chip memory, the startup insists. Single-chip competitors “have a very limited number of pins, so they have to go to things like HBM to get very high bandwidth on a small number of pins — but HBM is really expensive, hard to get, and high-power,” Ditzel said.

Esperanto claims to have solved the memory bottleneck by using six smaller chips rather than a single large chip, leaving pins available to connect to LPDDR4x chips. (Source: Esperanto Technologies)

Esperanto’s multichip approach makes more pins available for communication with off-chip DRAM. Alongside six processor chips, the company uses 24 inexpensive LPDDR4x DRAM chips designed for cellphones, running at low voltage with “about the same energy per bit as HBM,” Ditzel said.

“Because [LPDDR4x] is lower-bandwidth [than HBM], we get more bandwidth by going wider,” he added. “We go to 1,500 bits wide on the memory system on the accelerator card, [while one-chip competitors] cannot afford a 1,500-bit–wide memory system, because for every data pin, you’ve got to have a couple of power and a couple of ground pins, and it’s just too many pins.

“Having dealt with this problem before, we said, ‘Let’s just split it up,’” said Ditzel.

The total memory capacity of 192 GB is accessed via 822-GB/s memory bandwidth. The total across all 64-bit DRAM chips works out to a 1,536-bit–wide memory system, split into 96× 16-bit channels to better handle memory latency. It all fits into a power budget of 120 W.

Pipelining weights

Wafer-scale AI accelerator company Cerebras Systems has devised a memory bottleneck solution at the far end of the scale. At Hot Chips, the company announced MemoryX, a memory extension system for its CS-2 AI accelerator system aimed at high-performance computing and scientific workloads. MemoryX seeks to enable training huge AI models with a trillion or more parameters.

Cerebras Systems’ MemoryX, an off-chip memory expansion for its CS-2 wafer-scale engine system, behaves as though it was on-chip. (Source: Cerebras Systems)

MemoryX is a combination of DRAM and flash storage that behaves as if on-chip. The architecture is promoted as elastic and is designed to accommodate between 4 TB and 2.4 PB (200 billion to 120 trillion parameters) — sufficient capacity for the world’s biggest AI models.

To make its off-chip memory behave as if on-chip, Cerebras optimized MemoryX to stream parameter and weight data to the processor in a way that eliminates the impact of latency, said Sean Lie, the company’s co-founder and chief hardware architect.

“We separated the memory from the compute, fundamentally disaggregating them,” he said. “And by doing so, we made the communication elegant and straightforward. The reason we can do this is that neural networks use memory differently for different components of the model. So we can design a purpose-built solution for each type of memory and for each type of compute.”

As a result, those components are untangled, thereby “simplify[ing] the scaling problem,” said Lie.

During training, latency-sensitive activation memory must be immediately accessed. Hence, Cerebras kept activations on-chip.

click for full size image

Cerebras uses pipelining to remove latency-sensitive communication during AI training. (Source: Cerebras Systems)

Cerebras stores weights on MemoryX, then streams them onto the chip as required. Weight memory is used relatively infrequently without back-to-back dependencies, said Lie. This can be leveraged to avoid latency and performance bottlenecks. Coarse-grained pipelining also avoids dependencies between layers; weights for a layer start streaming before the previous layer is complete.

Meanwhile, fine-grained pipelining avoids dependencies between training iterations; weight updates in the backward pass are overlapped with the subsequent forward pass of the same layer.

“By using these pipelining techniques, the weight-streaming execution model can hide the extra latency from external weights, and we can hit the same performance as if the weights were [accessed] locally on the wafer,” Lie said.

This article was originally published on EE Times.

>> This article was originally published on our sister site, EE Times Europe.

Related Contents:

For more Embedded, subscribe to Embedded’s weekly email newsletter.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.