Advanced Processor Features and Why You Should Care: Part 1 - Embedded.com

Advanced Processor Features and Why You Should Care: Part 1

The CPU architectures used in embedded systems are becomingincreasingly complex. Multi-core processors are fast becoming more ofthe rule rather than the exception in our industry. Convergentprocessors, mixtures of both the standard MCU and DSP worlds, are alsostarting to become prevalent. And, these are just the beginning of whatour industry is turning to in order to be able to deliver increasinglevels of services in ever shrinking packages.

So, what do all of thesenew processor architectures portend for embedded systems developers?How will these advances change the way we develop embedded systems?

While 8-and 16-bit processors still have a place in certainapplications, the industry is increasingly moving to 32-and 64-bitprocessors. In some cases, it can be difficult to justify thetransition to 64-bit processors other than to say that we do it becausethey are the best processor the industry has to offer. Of course,software always has a way of consuming as much CPU horsepower andmemory as we are willing to throw at the problem. So, I’m sure thatdevelopers will find something to do with the extra performance.

On the large-end of the embedded spectrum, we have the carrier-gradesystems. Communications switching systems and the like are hugeconsumers of CPU and memory. In this market, it’s not uncommon to seehyper-threaded, symmetric multiprocessing engines with 16Gbytes or moreof RAM and terabytes of disk space. This can be seen in products likeTyan’s Thunder quad Opteron server board.

Even battery-operated devices are becoming increasingly complex.This can easily be seen in portable game machines and cellulartelephones. The convergence of the cell phone as camera, PDA, gamemachine and streaming media device is forcing developers to resort tomultiple cores to be able to meet performance goals while maintainingbattery life.

Figure1. TI OMAP 2420 (Source: Texas Instruments)

For example, the Texas Instruments OMAP2420 is shown in Figure 1,above. This processor has both an ARM 11 and a TMS320C55 fixed pointDSP on board. In addition, a number of specialty silicon units such ashardware encryption engines, graphics accelerators, WLAN interfaces andeven TV out are all included on-chip. This is all in addition to itsprimary function as a cell phone processor.

These levels of extreme integration come with a significant price inprogramming complexity. Simple sequential thinking during developmentwill not result in a satisfactory product. We must look at thearchitecture of the CPU vis-à-vis our use cases for the productto optimize the design and assignment of software entities to hardwarein order to ensure that we are getting the maximum “bang-for-the-buck”for such sophisticated silicon.

SISD, SIMD and Super-Scalar Processors
In the simplest case, we have a single processor (single executionunit) with a single instruction/data stream. This is referred to as aSingle Instruction/Single Data (SISD) architecture. This type of CPU iswhat we see most frequently in embedded applications. The typicalMPC860DT PowerPC processor falls into this category. However, this isthe starting point for much more interesting processor architectures.

From the simple SISD architecture, we can add a specializedco-processor such as the Altivec unit found on the PowerPC G4 (MPC74xx)processors. The Altivec unit is designed to handle vectors of data in asingle instruction. Technically, this makes the Altivec a SingleInstruction/Multiple Data (SIMD) architecture with an SISD front-endprocessor.

Intel-compatible processors also have a variation on the SIMD front.The MMX and SSE instruction sets perform vectorized processing on aregister of data. Referred to as SIMD-within-a-Register (SWAR), thesespecialized instructions optimize graphics transformations found inmany common applications.

The typical embedded RISC processor is capable of executing oneinstruction per cycle. This is referred to as a scalar processor.However, if we add additional Arithmetic Logic Units (ALUs or executionunits) to the processor core, we have the ability to execute more thanone instruction per cycle yielding a super-scalar processor.

Processor families such as the Intel Pentium and i80960 were earlyexamples of super-scalar processors. Today, most of the modern 32 and64-bit processors have super-scalar elements. For example, theFreescale Semiconductor MPC7455 shown in Figure 2 below can execute 3instructions plus a branch per cycle. This is exclusive of Altivecinstructions.

Figure2. Freescale Semiconductor MPC7455 Super-Scalar CPU (Source: FreescaleSemiconductor)

It’s important to distinguish between super-scalar processors andmulti-core processors, however. Processors like the MPC7455 are notmulti-core. There is a single pre-fetch unit and pipeline with mostsuper-scalar processors. The ability to execute multiple instructionssimultaneously relies upon the ability of the processor to scheduleinstructions that can be executed in parallel because of their relativeindependence from each other. This is referred to as Instruction-LevelParallelism or ILP.

ILP introduces potential problems for the developer due to an ILPfeature known as “speculative” execution. With speculative execution,the instruction scheduler tries to guess where your code may branchnext based on past behavior and execute that code before you need it.

For instance, let’s say that you are in a loop of code thatfrequently calls a function located elsewhere in your program. If thedata required for the function call is available, the instructionscheduler may assign an otherwise idle ALU to the function call so thatthe return value from the function call is available before the mainloop code actually gets to the function call.

In general, this can be quite beneficial for overall performance.However, occasionally, this out-of-order (OOO) execution can causeproblems. For example, device driver writers frequently need to set upa series of registers prior to executing I/O to a hard disk.

If we need to set the number of sectors, the sector number, cylindernumber and head number registers prior to issuing the command to read asector, OOO execution could cause problems if the read command wasissued prior to setting the cylinder number. This is such a significantproblem that operating systems like Linux provide memory barriers fordevelopers to ensure that all of the instructions issued to this pointhave been executed prior to continuing.

How would you know to use a memory barrier? In general, anytime theexecution order of I/O instructions was important to the application.But, if the behavior of the system was radically different withdebugging enabled or because of enabling a compiler optimization, thenyou might be a victim of the hardware trying to help you with OOOexecution. If your processor is super-scalar and supports OOOexecution, then you should at least consider the possibility that someof your bugs may induced by the hardware doing more for you than youexpected.

Pipelines, Caches and Branch Prediction
Another set of issues that we need to address relates to the internalsof the processor’s instruction pipeline, branch prediction and its useof caches. Understanding how these issues are interrelated will help usunderstand how to design code to best take advantage of the CPU’sarchitecture. Especially when one can see how differences in codingstyle can have major impacts on CPU performance.

The basic processor pipeline contains four stages. These stages areinstruction fetch, decode, execute and write back (retire). In an idealworld,each of these stages would take a single instruction time to execute.However, depending on whether the processor is RISC or CISC-based, eachof these steps can take multiple machine cycles to complete. Figure 3shows a basic 4-stage processor pipeline.

Figure3. Basic 4-stage pipeline (Source: ARS Technica)

The processor pipeline is analogous to the automobile assembly line.Let’s say that each stage of the work can be accomplished in 1nanosecond. That would mean that it would take 4 nanoseconds tocompletely process one instruction. If each stage is independent, andwe continue to pump instructions in at one end of the pipeline, itmeans that after the first 4 nanoseconds that it takes to load thepipeline, we start completing one instruction every subsequentnanosecond. So, the use of the pipeline has an initial delay, but oncethe delay is paid for, we can see a significant increase inperformance.

Because the pipeline stages may take multiple cycle times, someprocessor architectures add additional stages to sub-divide the workperformed at each stage so the pipeline can move lock step with theprocessor clock speed. The reduction of work performed at each stagehas another effect as well. With smaller steps, we can run fasterprocessor clock speeds. This results in two basic pipeline models. Thefirst is referred to as the “shallow-and-wide” model while the secondis the “deep-and-narrow” model.

The shallow-and-wide model is designed for energy efficiency. Anexcellent example of this model is the Freescale (Motorola) G4eprocessor. In this processor there are five 7-stage instructionpipelines into 10 execution units (8 filled per cycle). This means thatat any point in time there may be 16 in-flight instructions with 32registers and 16 shadow registers. This shallow-and-wide approach usesa lower clock speed to the processor and a lower transistor countresulting in lower power consumption with less thermal dissipation.

Before moving on to the deep-and-narrow model, one of the keyconsiderations in keeping the pipeline full is the ability to fetchinstructions in a single cycle time. This is where the cache memorycomes into play. The cache is a duplicate of physical RAM. However,whereas SDRAM may have 50ns access times, the cache may have 1ns accesstimes.

The cache memory in modern processors is typically divided intolevel 1 and level 2 caches. The level 1 (or L1) cache is usually veryhigh-speed static RAM. Static RAM typically takes 5-6 transistors percell to create. Consequently, the amount of L1 cache is limited by theamount of processor real estate we are willing to dedicate to staticRAM.

L2 cache, on the other hand, is comprised of dynamic RAM and may bein the 5-10ns access time ranges. Since dynamic RAM requires only 1-2transistors per cell, we can get a lot more L2 cache in the same amountof space as we would with L1 cache. In the for-what-it’s-worthcategory, some processors are even starting to add L3 caches. TypicalL1 cache sizes are in the 16-32 Kbyte range. L2 cache sizes range from512 Kbytes to as much as 2 Mbytes.

What this all implies is that we have a memory hierarchy as shown inFigure 4, below. Assuming a 1 GHz processor to make the math easier,accessing the L1 cache can be done in a single clock cycle. Assessingthe L2 cache might take 5 clock cycles, and accessing physical SDRAMmight take 50 clock cycles.

Figure4. Memory Hierarchy

The L1 cache contains a subset of the data found in the L2 cache.And the L2 cache is a subset of the data found in the main RAM. If thepre-fetch unit goes to fetch an instruction and it can be found in thecache, then we have a cache hit. Given a 7-stage pipeline, fetchingfrom the L1 cache will keep us running at full speed. Fetching from theL2 cache would cause a 5-cycle delay creating a gap in the pipelinereferred to as a pipeline bubble.

This is illustrated in Figure 5, below. If the instruction to befetched is not in either of the caches, we have a cache miss and mustfetch the instruction from main RAM. This means that we will run thepipeline dry in 7 clock cycles and then have to wait for 43 more clockcycles before we can start the pipeline again.

Figure5. Seven-Stage Pipeline with 5-cycle Bubble

As mentioned previously, manufacturers add additional pipelinestages so they can reduce the amount of work performed at each stageand can then increase the clock speed of the processor. Perhaps theultimate implementation of this approach is the Pentium 4. Thisprocessor follows the “deep-and-narrow” pipeline model with three20-stage pipelines and 7 execution units. This results in over 128instructions in flight at any point in time. This situation isexacerbated with the P4 Prescott and its 31-stage pipeline.

These deep pipelines allow the processor clock speeds to approach 5GHz and they are optimized for streaming media applications. However,if the instruction stream contains a jump to a location outside of thecaches, we run the risk of having as much as 70-80 idle clock cyclesfor every cache miss as opposed to the 50 idle clock cycles encounteredby the shallow-and-wide processor pipelines. This problem is somewhatmitigated by the fact that the deep-and-narrow processor is running ata higher clock speed. Nonetheless, extremely “branchy” code willperform poorly regardless of the clock speed

To try and mitigate the effects of a cache miss, most modernprocessors add branch prediction hardware. This circuitry tries topredict which code will be executed when a branch is encountered in theinstruction stream and pre-fetch that code into the caches. If theprediction is correct, the processor continues at full speed. However,if the prediction is wrong, then we have to flush the instructionpipelines and start over. This results in the loss of 16 in-flightinstructions in the shallow-and-wide pipeline, but the loss of over 128in-flight instructions in the deep-and-narrow model.

What all of this means to embedded developers is that thedeep-and-narrow processor is well suited to linear code with fewbranches where we are doing a lot of repetitive operations. Examples ofthis type of code can be found in streaming media applications.However, the shallow-and-wide pipeline model is better suited for very“branchy” code as we might find in case of multiple, asynchronousthreads running simultaneously.

In either case, we must give serious consideration to structuringour applications to eliminate as many conditional function calls aspossible that could cause pipeline flushes. Another coding technique isto co-locate code that calls functions with the functions themselves,referred to as the “locality of reference”. This results in code thatstands a good chance of being able to fit completely in either the L1or L2 caches minimizing the effects of cache misses.

In Part 2, the author move beyondthe familiar, if complex, world of traditional single processorarchitecture into designs that use symmetric multi-threading, symmetricmultiprocessing and multi-core.

Michael E. Anderson is CTO/ChiefScientist at The PTR Group, Inc.

Thisarticle is excerpted from a paper of the same name presented at theEmbedded Systems Conference Silicon Valley 2006. Used with permissionof the Embedded Systems Conference. Please visit www.embedded.com/esc/sv.

To learn about this general subject on Embedded.com go to Moreabout multicores, multiprocessing and tools.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.