How designers are taking on AI’s memory bottleneck -

How designers are taking on AI’s memory bottleneck

Where yesterday’s systems were memory-constrained, today’s data center architectures use a variety of techniques to overcome memory bottlenecks.

Among the criticisms aimed by skeptics  at current AI technology is that the memory bottleneck – caused by inability to accelerated the data movement between processor and memory – is holding back useful real-world applications.

AI accelerators used to train AI models in data centers require the highest memory bandwidth available. In an ideal world, an entire model could be stored in a processor, an approach that would eliminate off-chip memory from the equation. That’s not possible given that the largest models measure in the billions or trillions of parameters.

Where yesterday’s systems were memory-constrained, today’s data center architectures use a variety of techniques to overcome memory bottlenecks.

High-bandwidth memory

A popular solution is to use high bandwidth memory (HBM), which involves connecting a 3D stack of 4, 8 or 12 DRAM die to the processor via a silicon interposer. The latest version of the technology, HBM2E, features faster signaling rates per pin than its predecessor, up to 3.6 Gb/s per pin, thereby boosting bandwidth. Samsung and SK Hynix each offer eight-die HBM2E stacks for a total of 16 GB capacity, providing 460 GB/s bandwidth (this compares to 2.4 GB/s for DDR5 and 64 GB/s for GDDR6, SK Hynix says). HBM3 is set to push speeds and capacities even higher.

Nvidia’s A100 data center GPU with six stacks of HBM2E memory (only five stacks are used, for yield reasons) (Source: Nvidia)

The latest version of Nvidia’s flagship data center GPU, the A100, provides 80 GB of HBM2E performance with 2 TB/s of memory bandwidth. The A100 incorporates five 16-GB stacks of DRAM, joining a 40-GB version which uses HBM2 for a total bandwidth of 1.6 TB/s. The difference between the two yields a three-fold increase in AI model training speed for the deep learning recommendation model, a known memory hog.

Meanwhile, data center CPUs are leveraging HBM bandwidth. Intel’s next-generation Xeon data center CPUs, Sapphire Rapids, will introduce HBM to the Xeon family. They are Intel’s first data center CPUs to use new AMX instruction extensions designed specifically for matrix multiplication workloads like AI. They will also be able to use either off-chip DDR5 DRAM, or DRAM plus HBM.

“Typically CPUs are optimized for capacity, while accelerators and GPUs are optimized for bandwidth,” said Arijit Biswas, an Intel senior principal engineer, during a recent Hot Chips presentation. “However, with the exponentially growing model sizes, we see constant demand for both capacity and bandwidth without tradeoffs. Sapphire Rapids does just that by supporting both, natively.” The approach is further enhanced through memory tiering, “which includes support for software-visible HBM plus DDR, and software transparent caching that uses HBM as a DDR-backed cache,” Biswas added.

The HBM versions come at the cost of die area, Nowever, Sapphire Rapids’ chief engineer Nevine Nassif told EE Times .

“The [HBM-compatible] die is slightly different,” Nassif noted. “There’s also an HBM controller that is different than the DDR5 controller. On the version of Sapphire Rapids without HBM, there’s an area of the die where we added accelerators for crypto, compression, etc. All of those go away–except for the data streaming accelerator–and the HBM controller goes in instead. On top of that, we had to make some changes to the mesh to support the bandwidth requirements of HBM,” she added.

Beyond CPUs and GPUs, HBM is also popular for data center FPGAs. For example, Intel’s Stratix and the Xilinx Versal FPGAs come in HBM versions, and some AI ASICs also use it. Tencent-backed data center AI ASIC developer Enflame uses HBM for its DTU 1.0 device, which is optimized for cloud AI training. The 80 Tflops (FP16/BF16) chip uses two HBM2 stacks, providing 512 GB/s bandwidth connected through an on-chip network.

Enflame’s DTU 1.0 data center AI accelerator chip has two stacks of HBM2 memory (Source: Enflame)

Performance per dollar

While HBM offers extreme bandwidth for the off-chip memory needed for data center AI accelerators, there remain a few notable holdouts.

Graphcore is among them. During his Hot Chips presentation, Graphcore CTO Simon Knowles noted that faster computation in large AI models requires both memory capacity and memory bandwidth. While others use HBM to boost both capacity and bandwidth, tradeoffs include HBM’s cost, power consumption and thermal limitations.

Graphcore’s comparison of capacity and bandwidth for different memory technologies. While others try to solve both with HBM2E, Graphcore uses a combination of host DDR memory plus on-chip SRAM on its Colossus Mk2 AI accelerator chip (Source: Graphcore)

Graphcore’s second-generation intelligence processing unit (IPU) instead uses its large on-chip 896 MiB of SRAM to support memory bandwidth required to feed its 1,472 processor cores. That’s enough to avoid higher bandwidth needed to offload DRAM, Knowles said. To support memory capacity, AI models too big to fit on-chip use low-bandwidth remote DRAM in the form of server-class DDR. That configuration is attached to the host processor, allowing mid-size models to be spread over SRAM in a cluster of IPUs.

Given that the company promotes its IPU on a performance-per-dollar basis, Graphcore’s primary reason to reject HBM appears to be cost.

“The net cost of HBM integrated with an AI processor is greater than 10x the cost of server class DDR, per byte,” he said. “Even at modest capacity, HBM dominates the processor module cost. If an AI computer can use DDR instead, it can deploy more AI processors for the same total cost of ownership.”

Graphcore’s cost analysis for HBM2 vs. DDR4 memory has the former costing 10 times more than the latter. (Source: Graphcore)

According to Knowles, 40 GB of HBM effectively triples the cost of a packaged reticle-sized processor. Graphcore’s cost breakdown of 8 GB of HBM2 versus 8 GB of DDR4 reckons the HBM die is double the size of an DDR4 die (comparing a 20-nm HBM to an 18-nm DDR4, which Knowles argued are contemporaries), thereby increasing manufacturing costs. Then there is the cost of TSV etching, stacking, assembly and packaging, along with memory and processor makers’ profit margins.

“This margin stacking does not occur for the DDR DIMM, because the user can source that directly from the memory manufacturer,” Knowles said. “In fact, a primary reason for the emergence of a pluggable ecosystem of computer components is to avoid margin stacking.”

Going wider

Emerging from stealth mode at Hot Chips, Esperanto offered yet another take on the memory bottleneck problem. The company’s 1000-core RISC-V AI accelerator targets hyper-scaler recommendation model inference rather than AI training workloads mentioned above.

Dave Ditzel, Esperanto’s founder and executive chairman, noted that data center inference does not require huge on-chip memory. “Our customers did not want 250 MB on-chip,” Ditzel said. “They wanted 100 MB – all the things they wanted to do with inference fit into 100 MB. Anything bigger than that will need a lot more.”

Ditzel added that customers prefer large amounts of DRAM on the same card as the processor, not on-chip. “They advised us: ‘Just get everything onto the card once and then use your fast interfaces. Then, as long as you can get to 100 GB of memory faster than you can get to it over the PCIe bus, it’s a win’.”

Comparing Esperanto’s approach to other data center inference accelerators, Ditzel said others focus on a single giant processor consuming the entire power budget. Esperanto’s approach – multiple low-power processors mounted on dual M.2 accelerator cards – better enables use of off-chip memory, the startup insists. Single-chip competitors “have a very limited number of pins, and so they have to go things like HBM to get very high bandwidth on a small number of pins, but HBM is really expensive, and hard to get, and high power,” he said.

Esperanto claims to have solved the memory bottleneck by using six smaller chips rather than a single large chip, leaving pins available to connect to LPDDR4x chips (Source: Esperanto)

Esperanto’s multi-chip approach makes more pins available for communication with off-chip DRAM. Alongside six processor chips, the company uses 24 inexpensive LPDDR4x DRAM chips designed for cellphones, running at low voltage with “about the same energy per bit as HBM,” Ditzel said.

“Because [LPDDR4x] is lower bandwidth [than HBM], we get more bandwidth by going wider,” he added. “We go to 1,500 bits wide on the memory system on the accelerator card [while one-chip competitors] cannot afford a 1,500-bit-wide memory system because, for every data pin, you’ve got to have a couple of power and a couple of ground pins, and it’s just too many pins.

“Having dealt with this problem before, we said, Let’s just split it up.”

The total memory capacity of 192 GB is accessed via 822 GB/s memory bandwidth. The total across all 64-bit DRAM chips works out to a 1536-bit wide memory system, split into 96x 16-bit channels to better handle memory latency. It all fits into a power budget of 120 W.

Pipelining weights

Wafer-scale AI accelerator company Cerebras has devised a memory bottleneck solution at the far end of the scale. At Hot Chips, the company announced MemoryX, a memory extension system for its CS-2 AI accelerator system aimed high-performance computing and scientific workloads. MemoryX seeks to enable training huge AI models with a trillion or parameters.

Cerebras’ MemoryX system is an off-chip memory expansion for its CS-2 wafer-scale engine system, which behaves as though it was on-chip (Source: Cerebras)

MemoryX is a combination of DRAM and flash storage which behaves as if on-chip. The architecture is promoted as elastic, and is designed to accommodate between 4 TB and 2.4 PB (200 billion to 120 trillion parameters), sufficient capacity for the world’s biggest AI models.

To make its off-chip memory behave as if on-chip, Cerebras optimized MemoryX to stream parameter and weight data to the processor in a way that eliminates the impact of latency, said Sean Lie, Cerebras’ co-founder and chief hardware architect.

“We separated the memory from the compute, fundamentally disaggregating them,” he said. “And by doing so, we made the communication elegant and straightforward. The reason we can do this is that neural networks use memory differently for different components of the model. So we can design a purpose-built solution for each type of memory, and for each type of compute….”

As a result, those components are untangled, thereby “simplify[ing] the scaling problem,” said Lie.

Cerebras uses pipelining to remove latency-sensitive communication during AI training. (Source: Cerebras)

During training, latency-sensitive activation memory must be immediately accessed. Hence, Cerebras kept activations on-chip.

Cerebras stores weights on MemoryX, then streams them onto the chip as required. Weight memory is used relatively infrequently without back-to-back dependencies, said Lie. This can be leveraged to avoid latency and performance bottlenecks. Coarse-grained pipelining also avoids dependencies between layers; weights for a layer start streaming before the previous layer is complete.

Meanwhile, fine-grained pipelining avoids dependencies between training iterations; weight updates in the backward pass are overlapped with the subsequent forward pass of the same layer.

“By using these pipelining techniques, the weight streaming execution model can hide the extra latency from external weights, and we can hit the same performance is if the weights were [accessed] locally on the wafer,” Lie said.

>> This article was originally published on our sister site, EE Times.

Related Contents:

For more Embedded, subscribe to Embedded’s weekly email newsletter.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.