As semiconductor design teams race to capitalize on “More than Moore”, new architecture options and challenges are blossoming. Take hyperscaler hardware, where a spectrum of workloads – database analytics, AI, microservices, video encoding, and high complexity computational algorithms – demands a spectrum of processor solutions. Performance, power and cost are still critical, but now architects own delivery. There is no one “best” architecture; processors must be designed to best serve specific classes of workload and price/performance profiles.
Architecture challenges in multi-core
The AWS Graviton2 has 64 Arm Neoverse N1 cores tiled in a coherent mesh network on a single die. Other designs have already extended to multi-die with cache-coherent connections between die. Multi-die implementations open up room for further growth and potential for reducing cost on less advanced processes. While these new architecture options expand possibilities, they also create new design challenges. Among many choices, which architectures are really going to deliver higher throughput for the right workloads at the right price-points?
One question here is how distributed system cache in a coherent mesh network should be partitioned against physical memory for a target class of applications. Optimizing these choices, and even which CPU cores best meet the need, requires running real workloads with cycle-level accuracy. High-level models simply aren’t accurate enough for this purpose.
Figure: Differing I/O latencies in a multi-die implementation. (Source: Cadence)
Communication latencies across an array of processors in a coherent mesh will be relatively uniform within a single die, but latencies can vary significantly between die in a multi-die implementation (see Figure). As such, designs grow a variety of architectures that could be employed in the future—fully-connected meshes, hub and spoke memory systems or others in both 2D and 3D structures in which one chiplet provides a big system cache and main memory access. Also, other chiplets in the stack communicate through with one other and with main memory through the hub.
Exploring all these options effectively depends very much on accurately modeling performance against realistic workloads. Modeling and analysis can only be explored in the RTL domain using emulation and prototyping.
A different kind of problem for a server architect is OS compatibility. You can boot any Linux distribution, hypervisor or Windows on most laptops straight out of the box. To boot on Arm-based servers, this responsibility is divided between the server builders and Arm.
Arm has developed a compliance suite called SystemReady to standardize a set of minimum requirements to get past this and other compliance issues. PCIe compliance is an especially important component because it directly delivers or underlies the primary I/O for a lot of server interface protocols for fast storage, fast networking and off-die coherent interfaces. Of special importance here is remote server boot over PCIe. Arm provides this compliance suite as software running on the UEFI (BIOS) layer. Cadence has been working with Arm for a couple of years to reduce the tests to a minimal bare-metal test suite with a PCIe traffic generation library which will emulate faster than the UEFI test-suite, for quick tun-around hardware debug.
One more challenge for server developers is that PCIe uses a strongly ordered memory model. Arm supports a loosely ordered memory model, allowed under the standard. But only strong ordering guarantees no deadlocks. Under loose ordering, the hardware/firmware developer must provide that guarantee. Unfortunately, this can’t be checked through compliance. The integrator must prove the design is deadlock-safe, through extensive use-case testing, again on an emulator or prototyping system.
An approach using Cadence System verification IP enables engineers to get a system-level test suite up and running within half a day that can validate PCIe integration against SystemReady requirements. This methodology can also be used to demonstrate boot of SUSE Linux and Windows from a flash memory device model connected to PCIe, which is creating a lot of interest in the advanced server community.
Not just for servers
Arm Neoverse platforms aren’t just designed for high-end servers. The family is already moving into other cloud applications and communications infrastructure, all the way out to the edge. In some of these applications, multi-core architectures are already important. In most such applications (automotive for example), out-of-the-box support for a range of open and commercial OSes is essential.
I believe that tools to automate the generation of system-level content and verify compliance to system-level objectives will have broad applicability across many markets. The EDA industry needs to think beyond the traditional single-interface single-protocol scope of Verification IP (VIP) towards a new era of multi-interface multi-protocol system-level VIP.
|Paul Cunningham is senior vice president and general manager of the system verification group at Cadence Design Systems. His product responsibilities include logic simulation, emulation, prototyping, formal, VIP, and debug. Prior to this, he was responsible for Cadence’s frontend digital design tools including logic synthesis and design-for-test. Paul joined Cadence in 2011 through the acquisition of Azuro, a startup developing concurrent physical optimization and useful skew clock tree synthesis technologies, where he was a co-founder and CEO. Paul holds a Master’s Degree and a Ph.D. in Computer Science from the University of Cambridge, UK.|
- Trends driving the future of high-performance computing (HPC)
- Understanding the COM-HPC standard for modular system designs
- PICMG releases COM-HPC carrier board design guide
- RISC-V chiplet strategy targets high performance computing
- Embedding HPC: A rocket in your pocket
For more Embedded, subscribe to Embedded’s weekly email newsletter.