Coinciding with the Hot Chips conference, startup Esperanto emerged from stealth mode this week with the highest performance commercial RISC-V chip to date – a thousand-core AI accelerator designed for hyper-scale data centers. While the chip can be run in a number of voltage and power profiles between 10 and 60 W, its “sweet spot” is 20 W of power per chip, a configuration that allows six chips to be mounted on a Glacier Point accelerator card, keeping total consumption under 120 W. Total performance from six chips is approximately 800 TOPS.
Esperanto’s ET-SoC-1 is billed as having the most RISC-V cores ever built on a single chip: 1,093. The count includes 1,088 ET-Minion custom RISC-V cores that serve as energy efficient AI acceleration engines. Also included are four ET-Maxion RISC-V cores and a RISC-V service processor. The entire design is geared toward energy efficiency.
Ahead of Hot Chips, EE Times spoke with industry veteran Dave Ditzel, Esperanto’s founder and executive chairman. (Ditzel’s credentials include co-authoring with David Patterson the seminal paper, “The Case for the Reduced Instruction Set Computer” published in 1980.)
Dave Ditzel (Source: Esperanto)
“We are the first to put a thousand RISC-V cores on a single chip,” Ditzel said. “People have talked about many-core CPU for years, but we haven’t seen a lot of that. Most of the RISC-V stuff that’s out there is for embedded.
“We said, ‘Let’s show ‘em that RISC-V can do high-end… We’ll show ‘em what really seasoned CPU designers can do here’.”
Ditzel’s team of CPU designers were able to tease details out of hyper-scale data center operators about their requirements.
“They did not want a training chip, they don’t have a problem with training,” Ditzel said. AI training is often an offline problem, and hyper-scalers’ huge x86 CPU capacity is not always at peak load. Hence, that capacity can be used for training when available. “Their real problem is inference,” Ditzel added. “That’s what drives their advertising. They need an answer in 10 milliseconds or less.”
Hence, accelerating the recommendation inference engine for online advertising became a focus of the data center chip. Hyper-scalers’ requirements for accelerating this type of model were fairly explicit.
“Our customers wanted 100 megabytes of memory on-chip – all the things they wanted to do with inference fit into 100 megabytes,” he said. Customers also wanted an external interface for off-chip memory. “The real issue is how much you can hold on the accelerator card,” Ditzel explained. “Think of the card as the unit of compute, not the chip. Once you can get memory on the card, you can access things much faster than going across the PCIe bus to the host.”
click for full size image
Esperanto fit six dual M.2 cards, each with one chip, onto a Glacier Point accelerator card. (Source: Esperanto)
The on-chip memory system has L1, L2 and L3 caches and a full main memory system with register files for a total of just over 100 MB. The on-card memory system can hold most weights and activations in the model in around 100 GB.
Recommendation models are notoriously difficult to accelerate, which is among the reasons they still run on existing CPU servers.
“When you’re picking out of 100 million customers and what they’ve been buying recently, you’ve got to access this… memory on the card, and you’re doing all kinds of random memory accesses, so caches don’t work. You really need more of a classic computer,” Ditzel said. The “x86 servers handle good amounts of memory and they have pre-fetching, and general purpose CPUs handle that workload very well. It’s been tough for any accelerators to break into the recommendation business because of that.”
Also required is support for INT8 along with FP16 and FP32 data types. The requirement for floating point maths stems from both the need to maintain the highest possible prediction accuracy and the lack of inclination to port or rewrite programs for lower precision math. Ditzel said leading x86 server chip makers only recently added 8-bit vector extensions to server CPUs.
“Most of the inference going on in [a hyper-scale data center] on their million x86 servers is still 32-bit float,” he said.
Esperanto’s chip on a dual M.2 card is designed to fit into accelerator slots within existing x86 CPU server infrastructure. That results in a power limit of 120 W, requiring air cooling.
Ditzel said Esperanto’s design doesn’t compete directly with internal efforts such as Google TPUs or Amazon Web Services’ Inferentia. Hyper-scalers “are trying to get the whole community to build accelerator chips for them. A lot of these companies believe in open computing and the [Open Compute Project].” Hence, “they buy OCP servers and they would like standardized stuff to go in there. If there’s competition, they love it… they are trying to encourage competition and show people what’s possible.”
Still, the startup insists big data center operators need external suppliers for accelerator chips. “It’s still always a make-versus-buy decision.” For example, one Esperanto customer lacked access to internally developed chips being used by another division. “If you beat what they have, entry into any one of these companies is possible.”
Esperanto has taken the opposite approach to competitors’ giant power-hungry chip accelerators, offering a lower-power chip that can be used in multiples. The approach addresses memory bandwidth requirements since more pins can be used for memory I/O without having to resort to expensive HBM.
Esperanto’s hardware is also designed as a general-purpose computer; despite the focus on recommendation models, the chip can accelerate parallel processing, according to Ditzel. A six-chip accelerator card includes about 6,000 parallel cores, and each core can execute two threads, which can be “thrown at any arbitrary problem.”
Another trick up Esperanto’s sleeve is an aggressive energy efficient design. Customer requirements set the power budget at 120 W total while the maximum space established on a Glacier Point card was six chips, or 20 W per chip. By comparison, AI inference accelerators operate at more than ten times that amount.
Esperanto addressed the issue from several angles. Clock frequency was reduced to an optimum level of about 1 GHz. Supply voltage was reduced to around 0.4 V, beyond the limit of SRAMs. Switching capacitance was aided by using lean RISC-V cores with the smallest commercially-viable instruction set to reduce the number of transistors. An advanced but stable process technology, TSMC 7nm, was chosen.
click for full size image
Esperanto identified a “sweet spot” for operation at around 1 GHz. (Source: Esperanto)
Esperanto’s chip includes 1,088 ET-Minion cores, which process the AI workload. The cores are 64-bit, in-order RISC-V processors with Esperanto’s own AI-optimized vector and tensor unit taking up much of the chip real estate. Floating point MACs dominate the configuration. Unusually, integer MACs have twice the processing width of floating point (per customer requirements, Ditzel noted). Also supported are vector transcendental instructions such as sigmoid functions common in deep learning models. Since the cores run in a single low-voltage domain, more transistors were used with SRAM in the small L1 cache to ensure robust performance.
click for full size image
Esperanto’s chip contains 1,088 ET-Minion cores (click on image to enlarge) (Source: Esperanto)
Each core is capable of 128 GOPS per GHz. A custom multi-cycle tensor instruction performs large matrix multiplications with a separate controller taking over and running up to 512 cycles using the full 512-bit width. This allows the single tensor instruction to perform more than 64,000 arithmetic operations before the controller fetches the next RISC-V instruction. That reduces instruction bandwidth since the bulk of the workload uses the tensor instruction. Hence, only one instruction per 512 clock cycles is required.
Eight ET-Minion cores constitute a “neighborhood,” and modified instructions take advantage of their physical proximity. Another feature called “cooperative loads” allows cores to transfer data directly from each other without a cache fetch. That configuration saves power. The eight cores also share a large L2 cache for energy efficiency.
Zooming out again, four 8-core neighborhoods make up a “Minion Shire,” with 34 shires on each chip, totaling 1,088 cores. (Computation with only 1,024 cores to improve yield is also possible, Ditzel said). Four ET-Maxion cores, each with performance roughly comparable to an Arm A-72, are intended for future standalone operation, rather than the current accelerator configuration.
Threshold voltage variation is mitigated by providing each Shire its own voltage supply so that individual voltages can be fine-tuned.
Each chip has four 64-bit DDR interfaces – actually, each interface represents four 16-bit channels – for a total of 96x 16-bit channels. The design uses LPDDR4x developed as low-power memory for smartphones. Energy per bit is roughly equivalent to HBM, but maintaining the total at 1,536 bits across the memory interface for the six-chip accelerator card yields higher total memory bandwidth.
Esperanto mounted its chips on dual-socket M.2 cards; six fit onto an OCP Glacier Point v2 accelerator card (three front, three back). That delivers about 800 TOPS with the chips running at 1 GHz. They can also be mounted on low profile (half-height, half-length) PCIe cards that increase each chip’s power budget to around 60 W. The chips can operate between 300 MHz and 2 GHz, depending on the application.
Based on hardware emulation results, Ditzel asserted six Esperanto chips on a Glacier Point card can outperform competitors. The startup’s advantage is pronounced for recommendation benchmarks when the memory system design and performance- per-watt figures are considered, a consequence of the focus on a low-voltage design.
Future versions could include a scaled-down version of ET-SoC-1 for edge applications. Ditzel said the current version should launch in “the next couple of months”.
>> This article was originally published on our sister site, EE Times.
- AI-enabled SoCs handle multiple video streams
- Xilinx targets data center offload with ‘composable’ hardware
- Reduced operation set computing (ROSC) for NNA functional coverage
- Hybrid architecture speeds AI, vision workloads
- Hardware accelerators serve AI applications
For more Embedded, subscribe to Embedded’s weekly email newsletter.