Overcoming the embedded CPU performance wall - Embedded.com

Overcoming the embedded CPU performance wall


The physical limitations of current semiconductor technology have made it increasingly difficult to achieve frequency improvements in embedded processors, and so designers are turning to parallelism in multicore architectures to achieve the high performance required for current designs. This article explains these silicon limitations and how they affect CPU performance, and indicates how engineers are overcoming this situation with multicore design.

Current status of multicore SoC design and use
The last few years there has been an increase in microprocessor architectures featuring multi-threading or multicore CPUs. They are now the rule for desktop computers, and are becoming common even for CPUs in the high-end embedded market. This increase is the result of processor designers desire to achieve higher performance. But silicon technology has reached its limit for performance. The solution to the need for ever increasing processing power depends on architectural solutions like replicating core processors inside microprocessor-based systems-on-chip (SoC's).

Moore's law states that the number of transistors that can be fit onto a square inch of silicon doubles every two years, as the size of transistors shrinks. It was postulated by Gordon E. Moore in 1965, who at that time was Fairchild Semiconductor's Director of R&D and later co-founder of Intel.

Although the word “law” is used to describe his projection, Moore's prediction is not a law of physics, but a conjecture based on empirical observation of the technology in the 60's and 70's. In the short history of modern computing, there have been many guesses and predictions with no few mistakes. And that makes Moore's law more impressive considering it has been accurate since it was first postulated right up to present time – and it is expected to hold for at least another decade.

Moore's law continues to hold because the ability to shrink the size of the components on a chip has enabled designers to continuously increase density of transistors in processors, memories, etc. With smaller transistors you can add more functional units to your processor and make more complex architectures in the same size.

Thanks to this higher density, techniques like branch prediction or out-of-order execution are now common features in modern processors, even though they are resource hungry. This leads to improved IPC (Instruction Per Cycle), i.e. improved instruction throughput, one of the two fundamental sources of the overall performance on a processor. A smaller transistor size also allows higher clock rates. When you shrink the gate length of a transistor by 1/k you can obtain a circuit delay reduced in the same amount. Transistor switching time decreases as circuit delay decreases, so you can achieve a clock rate multiplied by a factor of k. Operating at higher frequencies processors achieve higher performance, but at a cost.

However, designers are now encountering some practical restrictions to following this progression. Increasing density of transistors and frequency on a chip produces limiting consequences that have more influence as you go further down in transistor size. Two that are of primary concern and are the main barriers to further progress are higher power consumption and higher transmission delays.

Power consumption on a chip
The power consumption on a chip and the associated heat dissipation are becoming a big barrier for hardware designers. With the constant increase in number of transistors, current processors are demanding a considerable amount of energy in a very small area. This means a high power density to be dissipated. And it is not only the number of transistors. High operating frequencies also have a serious impact on power consumption, as we will see next.

To get an idea of the evolution of these parameters in the last decades, Figure 1 shows transistor count and operating frequency increments for x86 Intel architectures over a period of 20 years, starting with the 80386 architecture, the first 32-bit x86 processor.

Click on image to enlarge.

Figure 1: Transistor count and frequency for the X86 architecture

Note that both parameters are shown on logarithmic scales, which denotes the huge progression they have kept. With respect to power, Figure 2 shows typical power dissipation for these processors, this time on linear scale.

Figure 2: Power consumption of succeeding generations of X86 processors

The increase in number of transistors continues. Some of the lastest Intel Core i7 processors feature more than 2200 million transistors. The dissipated power also increases slightly, depending on models, reaching values of 130 W. However, clock frequency in these new processors is not increasing and remains around 3.5 GHz.

One of the reasons for this stagnation is that current integrated circuits have reached physical limits of power density, generating as much heat as the chip package is able to dissipate, and consequently hardware designers have had to limit frequency increments. It is true that Intel has never sacrificed performance for power efficiency, but now physical consequences leave them with no option but to look carefully at power consumption.

Some equations better demonstrate how frequency and transistor count affect power consumption on a chip. A few simple mathematical relationships will make it clear why these parameters are so important in today's designs.

The following equation shows how power dissipation on a chip relates to operating frequency and other factors:

This is the expression for power dissipation in CMOS technology, the dominant semiconductor technology for integrated circuits today. The first part (addend) of the equation accounts for the dynamic power consumption on the chip (i.e. the power consumption caused by charging and discharging capacitive loads when transistors are switched) that represents the useful work performed by the chip. A is the activity factor meaning the proportion of switching transistors in each cycle (since not all transistors have to switch every clock cycle); C is the capacitive load of the transistor; V is the voltage; and f is the frequency.

The second addend in the equation also accounts for dynamic power although in minor quantity, in this case because of the transitory short circuit current (Isc ) that flows through transistors from voltage source to ground during finite rise or fall time t . And the last addend accounts for the static power consumption, i.e. the power consumption due to leakage current (Ileak ) and the only one that is present in a circuit that is powered but inactive. It applies to the whole circuit independently of transistors state and therefore the activity factor does not appear in this addend.

If we observe the first term of the equation we can see why power has being increasing only linearly while frequency has been doing it logarithmically. The reason is the quadratic dependence on the voltage.

Engineers have been able to continuously reduce this voltage from 5V down to below 1V, which has helped them to control dissipated power without losing performance. Unfortunately, many factors are interdependent and engineers have to make trade-offs constantly. For example, imagine we want to decrease dynamic power consumption on a chip (consider only first term of the equation) by reducing the supply voltage initially fixed at 2V. If we are able to reduce it to 1.7V, it is only a 15% decrease in voltage but we get a significant 28% decrease in power. However, reducing supply voltage has a side-effect on the maximum frequency for the circuit and on the threshold voltage of transistors (the voltage at which a transistor switches on):

In our example, if you had a threshold voltage of 0.5V and the circuit was operating at a frequency of 4GHz you would have to reduce the threshold voltage to a value of approximately 0.32V in order to maintain the same operating frequency. However, this might be not feasible, since threshold voltage depends on technological parameters and beyond some specific value it is not possible to reduce it without making changes in your semiconductor manufacturing process. Without changing threshold voltage, maximum frequency would then be reduced to 3GHz, a 25% decrease.

On the other hand, although you were able to reduce supply and threshold voltage without affecting performance, leakage current depends exponentially on threshold voltage:

The voltage VT is the thermal voltage, that depends on the absolute temperature T ; k is the Boltzmann constant and q is the electrical charge on an electron. At usual temperatures the thermal voltage value is around 30 mV. For large values of threshold voltages compared to the thermal voltage the effect on leakage current is negligible, but for small ones, around 100mV, the effect becomes relevant.

Moreover, it is not only the thermal voltage dependent on temperature, threshold voltage usually also varies with temperature and both variations are added together on their effect on leakage current. The increase on leakage current implies increase on static power consumption so this imposes a practical limit on the voltage reduction technique for low values.

Figure 3 shows these effects for two different temperatures. The first curve with T =300K is the presented exponential equation on threshold voltage. The second curve with T =330Kis an estimation taking into account variation on threshold voltage as aresult of incrementing temperature. In this way, the abscissa stillrepresents nominal threshold voltage but real threshold voltage on thetransistor is biased toward lower values by the effect of temperature,thus having a higher effect on leakage current.

Figure 3: Effect of threshold voltage and temperature on leakage current

Leakagecurrent also depends on gate insulator thickness. With very thin gatedielectrics, electrons can tunnel across the insulation generatingtunneling currents and leading to high power consumption. This effect isvery important in current semiconductor technology processes given theactual sizes in use of 32nm and below for gate lengths.

Ofcourse, the core of a processor is not the only component on a chip thatconsumes energy. Memories, for example, also consume a considerableamount of energy and modern processors dedicate a large area of the dieto incorporate several levels of cache memory.

Engineers applyseveral design techniques to reduce leakage current or the activityfactor of the memory (the A factor in the power dissipation equationshown) and in this way they mitigate power consumption.

Forexample, the hierarchical organization in levels of cache not onlyimproves data access time, it also helps in reducing power consumed,since smaller, nearer caches require less energy than larger, furtherones. With this organizational solution it is possible to reduce powerwhile preserving performance. In line with this idea, another commonlyused solution is to organize memory into banks for efficiency. In thiscase it is possible to activate only the bank being accessed and therebysave energy.

However, looking for higher performance is notalways the right thing to do. Sometimes it is adequate to reduce powerat the cost of some throughput. There are processors dedicated tospecific applications that are always doing the same kind ofcalculations, for example DSPs. Audio processing, digital filters, ordata compression algorithms are typical applications on these devices,where assessments are characterized by how much energy an operationrequires and how long it takes for these processors to make suchcalculations.

A processor that initially takes more time thananother executing an algorithm but that consumes less power can, in theend, be more energy efficient. A metric employed for measuring thisefficiency is MIPS/W (Million Instructions Per Second-per-Watt).Although metric MIPS has to be taken with care, in general devices withhigher MIPS/W are considered more efficient and this is especiallyinteresting for embedded devices, particularly battery-powered devices.Indeed, at this time there is increasing interest and pressure to haveenergy efficient processors in the world of servers and data centers.

Transmission delays on a chip
Theother main factor limiting increasing density of transistors andfrequency on a chip is wire transmission delays. The very highfrequencies on the order of gigahertz used in modern processors meansthat a clock cycle occurs every fraction of a nanosecond. This smallcycle time is becoming a problem for signal propagation.

Reducingfeature size on a chip has enabled a decrease in gate length andcapacitance on transistors and so increases clock rates, overcomingcapacity bound constraints. But wires on a chip are becoming slower dueto higher resistance and capacitance. The width and height of wires noware smaller and this results in higher resistance due to a smaller wirearea.

With smaller area and hence less wire surface,surface-related capacitance decreases but the distance betweenneighboring wires is also being reduced and this produces a highercoupling capacitance. Coupling capacitance increases at a faster pacethan surface capacitance decreases, thus counteracting its effect andproducing a combined effect of higher overall wire capacitance.

Wiretransmission delay is directly proportional to the product of itsresistance and capacitance, Rw x Cw, so with each new technologyshrinking feature size we get higher wire delays. With faster clockrates and slower wire transmission velocity, the distance that a signalcan travel and hence the chip area that can be reached in a single clockcycle are reduced, leading to a new situation in which the constraintnow is communication bound.

Given a concrete micro-architecturethis would not be a big problem since circuit size would decrease inquadratic proportion. But in order to make the most of smallertransistor size and get higher IPC, designers develop more complexmicro-architectures, making deeper pipelines, adding more executionunits, and using large micro-architectural structures. Now, higherdelays in communications across the chip put a practical limit on thesize and even the placement of these structures, and on the maximumoperating frequency.

As an example, the design of themisprediction pipeline used in the Intel Pentium 4 required twice asmany stages as the Pentium III pipeline. With higher clock rate and wiredelays, pipeline has to be divided into smaller pieces and do less workduring each pipeline stage. But wire delays had become so large thattwo of the stages of the Pentium 4 pipeline were extra stages requiredto drive signals from one stage to the following one in order to haveenough time to perform the required computation, since much of the clockcycle time was spent by the signal in reaching the next stage.

Asimilar example of how wire delay affects a design can be found on theAdvanced Microcontroller Bus Architecture (AMBA) specification from ARM.The Advanced System Bus (ASB), introduced in the first AMBAspecification and designed to interconnect high-performance systemmodules, uses bidirectional buses and a master/slave architecture.

Onits second AMBA specification, the Advanced High-performance Bus (AHB)was introduced to improve support for higher performance and as areplacement for ASB. In this new bus specification, apart from otherfeatures, bidirectional buses have been substituted for a multiplexedbus scheme. Initially this modification would seem to add unnecessarywires and complexity to the circuit. But the effect of wire delays invery high performance systems sometimes makes it necessary to introducerepeater drivers (as seen in the Pentium 4 case). This is possible inthe unidirectional buses that make up a combined multiplexed bus but itis very hard in bidirectional buses.

The challenges ahead
Wehave seen the two main restrictions that technology imposes to continueapplying Moore's law and improving performance on a processor. Buttechnology is constantly evolving. Scaling down feature sizes hasenabled increased density of transistors and frequency, and designersare still managing to shrink transistor size and increase the number oftransistors on a chip to more than a billion.

Predictions werethat semiconductor technology processes would reach 35nm gate lengths in2014, but actually they’ve been manufacturing at 22nm since 2011. Powerdissipation and transmission delay problems are motivating everyone inthe industry to investigate new materials for making transistors, andnew organizational and architectural solutions are already being appliedin modern processors. High-k gate oxides (k refers to the dielectricconstant of a material) are replacing the silicon dioxide gatedielectric used for decades, allowing thinner insulators and controllingleakage currents.

New use of low-k dielectrics makes itpossible to reduce coupling capacitance and therefore transmissiondelays. Traditional micro-architectures implementing a single and largemonolithic core are evolving to simpler multicore micro-architectures toallow mainly local communications and thus avoid large delays.

Recently,some chip manufacturers, such as Intel, have announcedthree-dimensional integrated circuits. Its new Ivy Bridge family ofprocessors, the successor to the Sandy Bridge family, is based on a newtri-gate transistor technology that boosts processing power whilereducing the amount of energy needed.

Using 3-D transistorsinstead of the previous planar structure transistors, pipeline stagescan be vertically stacked on top of each other, effectively reducing thedistance between blocks and eliminating wire delay effects. Accordingto Intel, its 22nm 3-D Tri-Gate transistors consume less than half thepower when operated at the same clock frequencies as planar transistorson 32nm chips, exceeding what is typically achieved from one processgeneration to the next.

Multicore architectures are evolvingquickly. For example, Tilera has developed the first 100-core processoron a single chip! To achieve such a level of integration Tilera combinesa processor with a communications switch that their designers call a“tile.” By combining such tiles the company is able to build a piece ofsilicon creating a mesh network. Processors are usually connected toeach other through a bus, but as the number of processors increases thisbus quickly becomes a bottleneck. With a Tilera tiled mesh, everyprocessor gets a switch and they all talk to each other as in apeer-to-peer network. Besides, each tile can independently run an realtime operating system. Alternatively you can take multiple tilestogether to run an operating system like SMP Linux.

Andinvestigations are being conducted to develop amazing graphenetransistors, each of which is made from a sheet of carbon just one atomthick. Theoretically, these transistors will get very high operatingfrequencies, toward 1 THz (1000 GHz), and it will be even possible tomanufacture them on flexible substrates. There are still lots ofchallenges for this technology, though, and we will probably have towait several years to see these advances become reality.

Theproblem now facing the industry is how to take full advantage of thishuge parallel processing power. But the embedded software industry isalready developing powerful tools to help build the new and complexmany-core applications world.

Proposals like OpenMP and MPI forshared and distributed memory architectures, or OpenCL (Open ComputingLanguage), the open standard for parallel programming of heterogeneoussystems, are very promising. With OpenCL you can develop software forsystems with a mix of multicore CPUs, GPUs, and even DSPs. But probablythe biggest challenge is to change programmers’ mindsets to learn how towrite highly parallel and reliable software in these systems.

Julio Díez is a software engineer with fifteen years of experience mainly in theembedded world. He has spent the last six years developing communicationand security software for embedded systems, including the first securecommunication system in its class for the Spanish NSA. He is interestedin, multicore architectures, operating systems, software design, andparallel programming. He holds a bachelor’s degree in telecommunicationsengineering from Technical University of Madrid, Spain. You can reachhim at .

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.