Performance and power efficiency are key advantages, but they're also challenging as the number of cores increases.
Although multicore technology offers a game-changing opportunity for improvements in processing performance and power efficiency, it also brings about many new design and programming challenges. As the processor industry moves forward, the “three Ps,” power efficiency , performance , and programmability , are the yardstick by which various architectures will be judged. Interestingly, power efficiency and performance are not only the biggest opportunities afforded by multicore technology but also the most significant challenges as we scale the number of cores beyond today's single-digit designs. Another daunting challenge involves the programming model required for efficient use of multiple cores on a chip. Here's where organizations, such as the Multicore Association, will help alleviate some of these challenges.
Application demand for computing cycles in virtually every domain, from the embedded systems market to the desktop PC, continues to increase unabatedly. Modern video workloads, for example, require 10 to 100 times more compute power than that of a few years ago due to increasing resolutions (from standard definition to HD), more sophisticated compression algorithms (MPEG2 to H.264), and greater numbers of channels.
Unfortunately, the delivered performance of conventional, sequential processors, and digital signal processors hasn't kept pace with this demand. Reasons for this widening gap include diminishing returns from single-processor mechanisms such as caching and pipelining, wire delays, and power envelopes. Similarly, custom silicon is too expensive to build and FPGAs are too difficult to program.
Along with the wide range of application domains that do (or will) use multicore technologies, comes a wide range of definitions and implementations of multicore. In fact, there are so many definitions and implementations that it suffices to say that multicore refers to a single chip containing multiple visibly distinct processing engines, each with independent control (or program counters). In a sense, this can be viewed as a multiple-instruction-multiple-data (MIMD) style of computation. But even simplified as this definition is, multicore implementations can take many different forms; Figure 1 provides just a few examples.
Closing Moore's Gap
Moore's Law states that the number of transistors will double every 18 months. Over the past few years, it's become more obvious that the traditional processing model has run out of steam. Specifically, performance kept pace with Moore's Law until 2002 due to techniques such as pipelining, superscalar design, and multithreading (simultaneous, coarse grain, or fine grain). The performance scaling fell apart in 2002 due to three factors, namely diminishing returns from single CPU mechanisms such as caching and pipelining, the power envelopes (both active and leakage related), and wire delay.
A popular example of the break in performance scaling was demonstrated by the Pentium 4, which was first implemented in the same technology as the Pentium 3 (0.18-micron). Even though the Pentium 4 had 50% more transistors than the Pentium 3, its performance, based on the SPECint 2000, was only 15% greater. This situation introduced Moore's Gap, which is the increasing difference between the exponentially growing number of transistors on a single chip and the delivered performance of the chip (see Figure 2).
Two available factors will help to close Moore's Gap. First, contemporary applications and workloads of today have ample parallelism. In addition to the video example given, many other applications exist that demonstrate parallelism properties. These include networking (like IP forwarding), wireless (Viterbi decode and FIR filters), security firewalls (AES), and automotive (engine control).
The second factor is that multiple full-fledged cores can be integrated on the same chip. Not only is this made practical by today's process technologies, but in terms of energy savings, this level of integration is significantly more efficient than using discrete components on a board. For example, transferring a 32-bit value over a 32-wire, 1-mm long channel consumes about 5 pJ of energy in 90-nm technology. Transferring the same value between discrete chips can consume as much as 500 pJ. Furthermore, the latency of the on-chip transfer is an order of magnitude less, going from tens of nanoseconds down to a fraction of a nanosecond. Likewise, the bandwidth of the intra-chip communication can run up to several Tbits/s (measured in terms of bisection bandwidth), whereas between chips the achievable bandwidth is about 100 Gbits/s, without resorting to expensive technologies.
Taken together, the two factorsplentiful application parallelism and integration efficiencyenable us to harness the “power of n ,” where n cores can yield an n-fold increase in performance. Because n cores can yield n times the performance, multicore technology can put performance back on the same trajectory as Moore's curve, thereby closing Moore's Gap. However, be aware that this only applies to applications with inherent parallelism; multicore performance on a single sequential application might be worse than that of a high-powered sequential CPU.
Multicore increases performance
The best way to appreciate the benefits gained by a multicore implementation over a single core device is to compare how the transistors would be used. For example, suppose we have a single core in 90-nm technology that occupies chip area A and that the processor and cache portions each occupy half the chip's area. Let's call this the base case. In 65-nm technology, the architects have twice the number of transistors for the same area A . To take advantage of the transistor doubling, architects can keep the same CPU architecture and triple the cache size (Case 1). Alternatively, the architects can double the number of cores and maintain the same cache size for each core (Case 2).
We can apply a simplified model to demonstrate the performance implications of the two alternatives. Assume that the miss rate for some workload for the base case is 1% and the latency to main memory for a cache miss is 100 cycles. Also assume that each of the instructions ideally executes in one cycle. Starting with the base case, the instructions per clock (IPC) will be 0.5 (IPC can be calculated as 1/(1 + 0.01 × 100) = 0.5). Assuming that the workload has ample parallelism, the IPC doubles by applying Case 2 (2 × 1/(1 + 0.01 × 100) = 1) since the rate of executing instructions is twice that of the base case.
On the other hand, assuming the memory latency hasn't changed, the IPC of Case 1 will be 0.62 or 1/(1 + 0.006 × 100), which shows lower performance than the multicore alternative (Case 2). This calculation is derived from a commonly used rule of thumb for cache miss rates that suggests that the miss rate follows the square-root rule. (More accurate models can be found in a 1989 article in ACM Transactions on Computer Systems .1) Thus the cache miss rate for Case 1 will be 1%/sqrt(3) = 0.6%. Case 2 also helps to demonstrate that caches offer diminishing returns.
If the multicore argument holds true, we should be able to reduce cache size and put more cores on the same die for better performance. Let's try three cores, each with one-third the cache size of the base case. The cache miss rate will be 1% × sqrt(3) = 1.7%, and the IPC increases further to 1.1, or 3 × 1/(1 + 0.017 × 100). However, utilizing the same die size, does the trend continue? With four cores, there's barely room for cache, but assume that we can squeeze in a cache that's one-sixteenth the size of the base case for each core. The miss rate will be 1% × sqrt(16) = 4%, and the IPC drops to 0.8, or 4 × 1/(1 + 0.04 × 100. Notice that in our example, Case 3 has the optimal number of cores and cache sizes. This demonstrates that the balance of resources in a core is of great importance in multicore architectures. See Moritz, et al. for more information.2
Increasing power efficiency
The past decade was one in which operating frequency was synonymous with performance. A significant proportion of the performance gains of sequential processors came from frequency increases, and MHz and GHz became touted metrics of performance.
It's a well-known fact that power consumption increases nonlinearly with frequency and to demonstrate this, let's look at an example using a simple 32-bit multiplier. The circuit for this multiplier is synthesized using a commercial synthesis tool to target a given frequency and its power is measured, as shown in Figure 3.
The experiment assumes 90-nm technology. Notice that as the frequency target is increased from 250 to 650 MHz (keeping the voltage constant), the synthesizer is able to meet the frequency goal, and the power consumed by the circuit increases more or less linearly.
For our given technology, the synthesizer was unable to create a higher frequency circuit at the given voltage. However, keeping the same circuit, we can increase the voltage and achieve a proportionally higher frequency because for digital CMOS circuits, frequency is proportional to voltage. The problem is that power relates to the square of the voltage, which results in the steep power increase as the frequency increases beyond 650 MHz.
This discussion implies that increasing the frequency by increasing the voltage is extremely wasteful of power. As a matter of fact, because frequency was increased by a proportional increase in voltage, power is proportional to the cube of the frequency. In other words, a 1% increase in frequency results in a 3% increase in power. Based on this fact and the prior arguments, it can also be demonstrated that a multicore device running at a fraction of the operating frequency of a sequential processor (and achieving the same or better performance) can outperform and be significantly more power-efficient than one very fast device. However, keep in mind that although these discussions point out that multicore is beneficial, it's only beneficial for parallel workloads and is no better (and probably slightly worse) for sequential applications.
Multicore's performance challenge
While performance and power efficiency are the major advantages of the multicore versus single-processor approach, they're also challenging to improve as the number of cores increases beyond the single digits and moves to tens or hundreds of cores. The main performance challenge of dealing with a greater number of cores relates to the network that connects the various cores to each other and to main memory.
Current multicore systems rely on buses or rings for their interconnect. These don't scale and, therefore, will become the performance bottleneck. One common interconnect topology is a bus shared equally between all the cores that are attached to it. Arbitration for use of the bus is a centralized function and only one core is allowed to use it in any given cycle. Alternatively, a ring is a one-dimensional connection between cores and arbitration is a local function in each of the switches.
The mesh, a two-dimensional packet-switched extension, is yet another interconnect topology. It works well with the 2D VLSI (very large-scale integration) technology and is also the most effective interconnect when scaling to larger numbers of cores. The mesh scales to 64 cores and beyond because its bisection bandwidth increases as more cores are added, and its latency increases only as the square root of the number of cores. (Bisection bandwidth is defined as the bandwidth between two equal halves of the multicore). Contrast this with the bus topology that can only handle up to about eight cores, whereas the ring is viable up to about 16 cores. For both the bus and the ring, the bisection bandwidth is fixed even as more cores are added, and the latency increases in proportion to the number of cores.
In addition to the latencies associated with arbitration through the interconnect, there are potential performance issues due to the memory bandwidth bottleneck related to both instructions and data. As discussed earlier, multicore can solve one aspect of the memory bandwidth problem by distributing caches along with the processors within the cores. Similarly, the main memory bandwidth can be increased by implementing multiple main memory ports and a large number of memory banks. Thus, in multicore, the so-called memory bandwidth bottleneck isn't really a memory problem. Rather, it's an interconnect issue and the problem lies in how the memory units get interfaced to the cores.
The interface between memory banks and the cores is the interconnect, which includes both the on-chip network and the pins. Previous solutions to the issue of scalable interconnection networks apply to the memory issue as well. Packets simply transport memory instead of processor-to-processor data and so on-chip networks that are scalable ease the memory bandwidth problem. Pins that allow a multicore processor to access off-chip DRAM are also part of the interconnection network. High-speed serial memory interfaces (such as FB-DIMMs and XDR DRAM) will provide a significant short-term boost in the available pin bandwidth for memory. In the long term, however, off-chip bandwidth will continue to be significantly more expensive than on-chip bandwidth, and so our programming models need to adapt to replace off-chip memory-based communication with direct on-chip processor-to-processor communication.
This discussion relates to the hardware aspects of inter-core communications. Interestingly, in a complete system, communication latency is rarely an interconnect wire problem. Rather it's a “last-mile” issue. In fact, latency usually comes from cache coherence or messaging protocol overhead, and processor-to-network interface overhead. For example, in a system that implements the Message Passing Interface (MPI), we've measured a fixed software protocol overhead of about 1000 cycles. This is depicted in Figure 4 which shows the end-to-end latency (in cycles) to send a message between a pair of cores with various amounts of payload (one word to over a million words).
MPI was designed for communication between computers over a local area network and incurs way too much overhead and squanders the extremely low inter-core latency that multicore offers. Multicore requires much lighter-weight communications protocols with significantly lower latencies. The multicore challenge will be to reduce the protocol overhead to a few tens of cycles, thereby exploiting the few nanosecond inter-core communication latency. This is one of the challenges currently being tackled by the Multicore Association. Specifically, this industry consortium is working on a standardized API for communication and synchronization between closely distributed cores and/or processors in embedded systems. This communication API (CAPI) will target systems that span multiple dimensions of heterogeneity (such as core, interconnect, memory, operating-system, software-tool chain, and programming-language heterogeneity).
Kill if less than linear
Because several cores make up a multicore chip, reducing the power of each core becomes paramount and requires a rethinking of CPU architecture. Existing CPUs don't make good cores. The good news is that the potential exists for power efficient multicore. Reducing the frequency alone won't suffice. We must look to other sources to increase power efficiency. A useful rule of thumb for VLSI design is that area equates to power. The more area consumed by a core, the more power that's consumed.
What are ways by which we can reduce the area occupied by a core, and thereby its power? One approach is to be judicious about how area is spent in multicore. Recall the “power of n ” for multicore, where n cores can yield n times the performance for parallel applications. The power of n offers the path to a recipe for how to apportion area to processing and memory resources. Because the power of n says that we can increase multicore performance linearly with area by using that area for more cores, using any area for increasing the size of a resource within a core must be justified by a proportional increase in the core's performance. For example, if we increase a resource's size within a core by allocating it an additional 1% of the core area, we can justify this increase only if the performance of the core also increases by at least 1%. If the performance increase is less than 1%, then the area would have been better spent on increasing the number of cores.
We capture this insight into a simple rule called the “kill rule for multicore for power efficient design,” or “kill if less than linear.” The rule states that a resource in a core must be increased in area only if there's a proportional performance increase for the core. In an example applying this rule, let's start with a baseline multicore design containing 100 cores, each with a 512-byte data cache. Assume that the cache occupies 1% of the core's area. Because power relates to area, we use the kill rule to find the design (in our case, the data cache size and resulting number of cores) that yields the highest performance, keeping the area constant to that of 100 cores with a 512-byte data cache.
For the baseline design, assume that the IPC can be calculated as 0.11. The aggregate IPC for the entire chip of 100 cores is therefore 11. If the cache per core is increased from 512 bytes to 2 kbytes, the resulting area of the cache is 4% of the total area. Only 97 cores will now fit in the same area. The cache miss rate decreases, and the IPC for each core increases to 0.38; the IPC for the entire chip with the 2-kbyte cache becomes 0.38 × 97 = 37.
Now, let's double the cache size to 4 kbytes, which occupies 8% of the chip area. The miss rate decreases further, and the IPC increases to 0.50, a 32% increase over the 2-kbyte cache. By the kill rule, this is a good tradeoff, because the 32% performance increase is greater than the 4% area increase over the 2-kbyte cache. This trend continues until we try to implement a cache greater than 8 kbytes. To go from 8 to 16 kbytes, notice that an additional 10% of the chip area must be devoted to data cache. The resulting increase in the core's performance is only 4%. This is a bad tradeoff, as the 4% performance increase is less than the 10% area increase (chip IPC reduces from 49 to 44).
This last example demonstrates that multicore designs must carefully allocate chip area among a core's resources. The existing single-processor design approach of building ever increasing caches can be counterproductive to power efficiency and performance. Conversely, this example also demonstrates that simply increasing the number of cores without creating the right balance of resources within a core (and hence the core size) can be a wasteful exercise. There's a danger that the number of cores will become the new MHz paradigm; the kill rule demonstrates that the number of cores can become yet another meaningless indicator of performance.
A twist on inter-core communications
In 90-nm technology, the transfer of a 32-bit word between neighboring cores over a 1-mm channel expends only 3 pJ. This is on the same order of magnitude as a 32-bit add operation or a register file access but is more than an order of magnitude less than the 50 pJ expended in a 32-kbyte cache read operation and over two orders of magnitude less than the 500 pJ expended in a typical off-chip memory access. What's surprising is that it might be “cheaper” to recompute results or communicate data on-chip, rather than to store results in memory. In other words, the power efficiency of multicore encourages a migration from memory-oriented computation models to communication centric models.
Many multicore designs require the cores to communicate with each other by writing into and reading out of main memory or higher cache levels. Worse still, communication is sometimes a side effect of cache coherence protocols. Energy is inefficiently expended both in coherence protocols (for example, widespread snooping and broadcasting) and in cache and memory accesses for the coherence state information and the data. This is one reason that ASICs have significantly lower power than processors. They generally minimize the use of large centralized memories, and prefer local registers or recomputation. Further, these local registers provide direct communication in a pipelined stream-like fashion between compute entities and avoid the use of shared memories to pass values.
Markus Levy is the founder and president of the Embedded Microprocessor Benchmark Consortium and serves as the president of the Multicore Association. He's worked for Intel as a senior applications engineer and customer training specialist. You can reach him at firstname.lastname@example.org.
Anant Agarwal is a professor of electrical engineering and computer sscience at MIT and a member of the CSAIL Laboratory. He is also the founder and chief technology officer of Tilera Corp. Agarwal holds a Ph.D. in electrical engineering from Stanford University and a bachelor's from IIT Madras (1982). He can be reached at .
2. Moritz, Csaba Andras, Donald Yeung, and Anant Agarwal. “SimpleFit: A Framework for Analyzing Design Tradeoffs in Raw Architectures,” IEEE Transactions on Parallel and Distributed Systems , July 2001.