Overcoming the embedded CPU performance wall
Figure 3 shows these effects for two different temperatures. The first curve with T=300K is the presented exponential equation on threshold voltage. The second curve with T=330K
is an estimation taking into account variation on threshold voltage as a
result of incrementing temperature. In this way, the abscissa still
represents nominal threshold voltage but real threshold voltage on the
transistor is biased toward lower values by the effect of temperature,
thus having a higher effect on leakage current.
Leakage current also depends on gate insulator thickness. With very thin gate dielectrics, electrons can tunnel across the insulation generating tunneling currents and leading to high power consumption. This effect is very important in current semiconductor technology processes given the actual sizes in use of 32nm and below for gate lengths.
Of course, the core of a processor is not the only component on a chip that consumes energy. Memories, for example, also consume a considerable amount of energy and modern processors dedicate a large area of the die to incorporate several levels of cache memory.
Engineers apply several design techniques to reduce leakage current or the activity factor of the memory (the A factor in the power dissipation equation shown) and in this way they mitigate power consumption.
For example, the hierarchical organization in levels of cache not only improves data access time, it also helps in reducing power consumed, since smaller, nearer caches require less energy than larger, further ones. With this organizational solution it is possible to reduce power while preserving performance. In line with this idea, another commonly used solution is to organize memory into banks for efficiency. In this case it is possible to activate only the bank being accessed and thereby save energy.
However, looking for higher performance is not always the right thing to do. Sometimes it is adequate to reduce power at the cost of some throughput. There are processors dedicated to specific applications that are always doing the same kind of calculations, for example DSPs. Audio processing, digital filters, or data compression algorithms are typical applications on these devices, where assessments are characterized by how much energy an operation requires and how long it takes for these processors to make such calculations.
A processor that initially takes more time than another executing an algorithm but that consumes less power can, in the end, be more energy efficient. A metric employed for measuring this efficiency is MIPS/W (Million Instructions Per Second-per-Watt). Although metric MIPS has to be taken with care, in general devices with higher MIPS/W are considered more efficient and this is especially interesting for embedded devices, particularly battery-powered devices. Indeed, at this time there is increasing interest and pressure to have energy efficient processors in the world of servers and data centers.
Transmission delays on a chip
The other main factor limiting increasing density of transistors and frequency on a chip is wire transmission delays. The very high frequencies on the order of gigahertz used in modern processors means that a clock cycle occurs every fraction of a nanosecond. This small cycle time is becoming a problem for signal propagation.
Reducing feature size on a chip has enabled a decrease in gate length and capacitance on transistors and so increases clock rates, overcoming capacity bound constraints. But wires on a chip are becoming slower due to higher resistance and capacitance. The width and height of wires now are smaller and this results in higher resistance due to a smaller wire area.
With smaller area and hence less wire surface, surface-related capacitance decreases but the distance between neighboring wires is also being reduced and this produces a higher coupling capacitance. Coupling capacitance increases at a faster pace than surface capacitance decreases, thus counteracting its effect and producing a combined effect of higher overall wire capacitance.
Wire transmission delay is directly proportional to the product of its resistance and capacitance, Rw x Cw, so with each new technology shrinking feature size we get higher wire delays. With faster clock rates and slower wire transmission velocity, the distance that a signal can travel and hence the chip area that can be reached in a single clock cycle are reduced, leading to a new situation in which the constraint now is communication bound.
Given a concrete micro-architecture this would not be a big problem since circuit size would decrease in quadratic proportion. But in order to make the most of smaller transistor size and get higher IPC, designers develop more complex micro-architectures, making deeper pipelines, adding more execution units, and using large micro-architectural structures. Now, higher delays in communications across the chip put a practical limit on the size and even the placement of these structures, and on the maximum operating frequency.
As an example, the design of the misprediction pipeline used in the Intel Pentium 4 required twice as many stages as the Pentium III pipeline. With higher clock rate and wire delays, pipeline has to be divided into smaller pieces and do less work during each pipeline stage. But wire delays had become so large that two of the stages of the Pentium 4 pipeline were extra stages required to drive signals from one stage to the following one in order to have enough time to perform the required computation, since much of the clock cycle time was spent by the signal in reaching the next stage.
A similar example of how wire delay affects a design can be found on the Advanced Microcontroller Bus Architecture (AMBA) specification from ARM. The Advanced System Bus (ASB), introduced in the first AMBA specification and designed to interconnect high-performance system modules, uses bidirectional buses and a master/slave architecture.
On its second AMBA specification, the Advanced High-performance Bus (AHB) was introduced to improve support for higher performance and as a replacement for ASB. In this new bus specification, apart from other features, bidirectional buses have been substituted for a multiplexed bus scheme. Initially this modification would seem to add unnecessary wires and complexity to the circuit. But the effect of wire delays in very high performance systems sometimes makes it necessary to introduce repeater drivers (as seen in the Pentium 4 case). This is possible in the unidirectional buses that make up a combined multiplexed bus but it is very hard in bidirectional buses.
The challenges ahead
We have seen the two main restrictions that technology imposes to continue applying Moore's law and improving performance on a processor. But technology is constantly evolving. Scaling down feature sizes has enabled increased density of transistors and frequency, and designers are still managing to shrink transistor size and increase the number of transistors on a chip to more than a billion.
Predictions were that semiconductor technology processes would reach 35nm gate lengths in 2014, but actually they’ve been manufacturing at 22nm since 2011. Power dissipation and transmission delay problems are motivating everyone in the industry to investigate new materials for making transistors, and new organizational and architectural solutions are already being applied in modern processors. High-k gate oxides (k refers to the dielectric constant of a material) are replacing the silicon dioxide gate dielectric used for decades, allowing thinner insulators and controlling leakage currents.
New use of low-k dielectrics makes it possible to reduce coupling capacitance and therefore transmission delays. Traditional micro-architectures implementing a single and large monolithic core are evolving to simpler multicore micro-architectures to allow mainly local communications and thus avoid large delays.
Recently, some chip manufacturers, such as Intel, have announced three-dimensional integrated circuits. Its new Ivy Bridge family of processors, the successor to the Sandy Bridge family, is based on a new tri-gate transistor technology that boosts processing power while reducing the amount of energy needed.
Using 3-D transistors instead of the previous planar structure transistors, pipeline stages can be vertically stacked on top of each other, effectively reducing the distance between blocks and eliminating wire delay effects. According to Intel, its 22nm 3-D Tri-Gate transistors consume less than half the power when operated at the same clock frequencies as planar transistors on 32nm chips, exceeding what is typically achieved from one process generation to the next.
Multicore architectures are evolving quickly. For example, Tilera has developed the first 100-core processor on a single chip! To achieve such a level of integration Tilera combines a processor with a communications switch that their designers call a “tile.” By combining such tiles the company is able to build a piece of silicon creating a mesh network. Processors are usually connected to each other through a bus, but as the number of processors increases this bus quickly becomes a bottleneck. With a Tilera tiled mesh, every processor gets a switch and they all talk to each other as in a peer-to-peer network. Besides, each tile can independently run an real time operating system. Alternatively you can take multiple tiles together to run an operating system like SMP Linux.
And investigations are being conducted to develop amazing graphene transistors, each of which is made from a sheet of carbon just one atom thick. Theoretically, these transistors will get very high operating frequencies, toward 1 THz (1000 GHz), and it will be even possible to manufacture them on flexible substrates. There are still lots of challenges for this technology, though, and we will probably have to wait several years to see these advances become reality.
The problem now facing the industry is how to take full advantage of this huge parallel processing power. But the embedded software industry is already developing powerful tools to help build the new and complex many-core applications world.
Proposals like OpenMP and MPI for shared and distributed memory architectures, or OpenCL (Open Computing Language), the open standard for parallel programming of heterogeneous systems, are very promising. With OpenCL you can develop software for systems with a mix of multicore CPUs, GPUs, and even DSPs. But probably the biggest challenge is to change programmers’ mindsets to learn how to write highly parallel and reliable software in these systems.
Julio Díez is a software engineer with fifteen years of experience mainly in the embedded world. He has spent the last six years developing communication and security software for embedded systems, including the first secure communication system in its class for the Spanish NSA. He is interested in, multicore architectures, operating systems, software design, and parallel programming. He holds a bachelor’s degree in telecommunications engineering from Technical University of Madrid, Spain. You can reach him at firstname.lastname@example.org.