A huge amount of attention is focused on multi-threading architecturessuch as MIPS Technologies, Inc.'s new MIPS32 34K cores because sucharchitectures offer the potential for substantial performance gains atlittle cost in chip real estate or power consumption.
The key advantage of hardware multi-threading is that it usescycles to process instructions from other threads when the processorwould otherwise sit idle while waiting for cache refill.
The cost of adapting consumer device applications formulti-threading is usually small, because most of them are alreadydesigned as a set of semi-independent threads. Application threads canbe assigned to hardware resources dedicated by the processor tohandling an individual thread. Multiple threads can be assigned to suchhardware, and share use of CPU cycles to achieve maximum efficiency.
In this article, we will look at how multi-threading architectureswork, and how they can enable additional performance at minimal cost.Then we will show how existing applications easily can be converted torun on a multi-threading processor, and how new applications can bewritten for this powerful new type of architecture.
Embedded computing facingperformance barrier
Manufacturers of consumer devices and other embedded computing productsare adding new features such as WiFi, VoIP, Bluetooth, video, etc. at arapid rate. Historically, increased feature sets have been accommodatedby ramping up the clock speed of the processor. Clock speed has alreadyincreased to the 3+ Gigahertz level in desktops and high Megahertzlevels even in embedded applications.
In the embedded space, this approach rapidly loses viability becausemost devices are already running up against power consumption and realestate constraints that limit additional processor speed increases.Cycle speed increases drive dramatically greater power consumption,making high cycle speeds unmanageable for more and more embeddedapplications.
|Figure1: Processor speeds outpacing memory.|
In addition, further processor speed improvements yield diminishingperformance gains because memory performance has not kept pace withprocessor speed as shown in Figure 1,above .
Processors are already so much faster than memory that more thanhalf the cycles in many applications are spent waiting while the cacheline is refilled. Each time there is a cache miss or another conditionthat requires off-chip memory access, the processor needs to load acache line from memory, write those words into the cache, write the oldcache line into memory, and resume the thread.
MIPSTechnologies Inc. has stated that a high-end synthesizable coretaking 25 cache misses per thousand instructions (a plausible value formultimedia code) could be stalled more than 50% of the time if it hasto wait 50 cycles for a cache fill. And since processor speed iscontinuing to advance at a faster rate than memory speed, the problemis growing worse.
The multi-threading alternative
Multi-threading solves this problem by using the cycles that theprocessor would otherwise waste while waiting for memory access, tohandle multiple concurrent threads of program execution. When onethread stalls waiting for memory, another thread immediately presentsitself to the processor to keep computing resources fully occupied.
Notably, conventional processors cannot employ this approach becauseit takes a large number of cycles to switch the thread context from oneto another. Multiple application threads must be immediately availableand “ready-to-run” on a cycle-by-cycle basis for this approach to work.
The 34K processor is the first such multi-threading product from amajor supplier of embedded processors targeting the consumer devicemarket. Each software thread runs on aTC , orThread Context .A TC includes a complete set of general-purpose registers and a programcounter (that's where the name comes from ” in OS terminology the vitalregisters and PC of a thread make up the “Thread Context”).
Each TC has its own instruction prefetch queue, and the queues arekept full independently. That means that the core can switch betweenthe threads on a cycle-by-cycle basis to avoid the overhead involved insoftware context. Adding more TCs requires very little additionalsilicon. TCs share most of the CPU hardware including the executionunit, ALU, and caches. Moreover, adding a TC does not require the CPUto have another copy of the CP0 registers bywhich OS software runs the CPU.
A set of shared CP0 registers and the TCs affiliated with them make up a VirtualProcessing Element (VPE) . A TC runs a thread and a VPE hostsan OS: if you have two VPEs you can either have two independentOperating Systems or one SMP-style OS. A VPE with one TC looks exactlylike a traditional MIPS32 architecture CPU and is fully compliant withthe MIPS architecture specification: it's a complete virtual processor.
The 34K core can be built with up to nine TCs and two VPEs. Theaffiliation of TCs to VPEs is determined at run-time. By default allTCs which are ready for execution get a fair share of processing time,but the 34K also includes handles for an application to influencethread scheduling when some particularly demanding thread mightotherwise get starved. That is, software can control the Quality ofService (QoS) for each thread. Application software interacts with ahardware Policy Manager which assigns dynamically-changing prioritiesto individual TCs. A hardware Dispatch Scheduler then assigns threadsto the execution unit on a cycle-by-cycle basis to meet the QoSrequirements.
|Figure2: Multi-threading improves pipeline efficiency.|
In a multi-threaded environment such as the 34K, performance can besubstantially improved because whenever one thread waits for memoryaccess, other threads use the processor cycles that would otherwise bewasted. Figure 2 above showshow multi-threading can speed up an application. With just Thread0 running,only 5 out of 13 processor cycles are used for instruction executionand the rest are spent waiting for the cache line to refill. Theefficiency in this case using conventional processing is only 38%.
Adding Thread1 makes it possible to use 5 additional processor cycles previouslywasted waiting. With 10 out of 13 processor cycles now used, efficiencyimproves to 77%, providing a 100% speedup over the base case. Adding Thread2 makes itpossible to fully load the processor, executing instructions on 13 outof 13 cycles for 100% efficiency. This represents a 263% speedupcompared to the base case.
Using EEMBC performance benchmarks ,the 34K core shows an application speedup of 60% using just two threadsover the 24KE family with only a 14% increase in die-size as shown in Figure 3, below .
|Figure3: EEMBC benchmark performance examples shows 60% improvement with justtwo threads.|
Adapting software to multi-threading
A key advantage of the multi-threading approach is that in most casesit will run existing software with relatively little modification. Mostconsumer device applications are already written as a series ofsemi-independent threads. Each thread can automatically or manually beassigned to a particular hardware TC.
If the currently executing thread cannot continue because of a delaycaused by a cache miss or other circumstance, CPU execution switchesfrom that TC to one whose thread is ready to run without any wastedcycles. The more threads in the application, the greater the potentialto use cycles that would otherwise be wasted waiting for memory access.
Multi-threading is ideal for anyone using or considering using an RTOSsince RTOS applications are inherently multi-threaded. There is no needto rewrite an RTOS application for multi-threading because the RTOS canmap the application threads to the TCs automatically under programcontrol in the same way that it would map the threads to a conventionalprocessor.
|Figure4: Mapping threads to TCs.|
If there are more threads than TCs, a conventional context switch isusually required. These context switches occur just as in aconventional processor. The RTOS saves the state for the current task,loads the context for another task, and begins execution. Themulti-threading environment obviously has the potential for morecontext switches than a conventional processor so the speed at whichcontext switching can be accomplished becomes more significant.
Linux/Windows vs RTOSes formultithreading
This highlights an advantage of a fast RTOS relative to OSes such as Linux and the embedded versionsof Windows. Typical real-time performance for Linux is generally in therange of a few hundred microseconds to a few milliseconds. But worstcase Linux real-time performance is unbounded. On the other hand, afast RTOS provides a deterministic real-time performance in the rangeof about 1 to 2 microseconds on a single-threaded processor and evenfaster on a multi-threaded processor.
The RTOS assigns unique (non-duplicated) resources to unique TCs. Byconvention, the single floating point unit (FPU) is always assigned toTC0. Any thread that performs hardware-based floating point operationsneeds to be mapped to TC0 and that all such threads must share TC0.This results in some interesting programming choices, particularly thechoice of whether to implement floating point operations in hardware orsoftware.
Doing floating point in hardware is obviously faster but on theother hand requires sharing the FPU. If threads only do a small amountof floating point work, then it may make sense to implement them insoftware while threads that do intensive floating point work wouldnormally be implemented as floating point operations in hardware and bemapped to TC0. It's important to note that this change does not requirerecording because the decision of whether to use hardware or softwarefloating point can be implemented with a compiler switch.
Assigning weights to threads
If the application doesn't specify weights for the various threads, theapplication scheduler will assign equal weights to all. On the otherhand, time slicing can be used so that threads share CPU cyclesaccording to user-specified weights. Assigning weights is equivalent toassigning a proportion of CPU cycles to particular threads. Threadweights are mapped to the hardware TC transparently by the RTOS.
Some existing applications may be designed for conventionalprocessors under the assumption that a lower priority thread will notbe allowed to run while a higher priority thread is “ready”. Inembedded programming, ready means that all conditions necessary for thethread to run have been met and the only thing preventing it fromrunning is its priority.
Multi-threading runs the risk of violating this condition sincelower priority threads are able to run whenever higher priority threadsare stalled. Writing code to eliminate the need for this condition willoptimize performance.
On the other hand, existing code written based on this condition canbe run unmodified on multi-threading processors simply by setting anoperating system switch that that only allows threads of the samepriority to be loaded in TCs at the same time. When setting thisswitch, be sure to assign threads that can be run in parallel to thesame priority level whenever possible.
<>The importance of being interruptable
Interrupts are critical in a conventional embedded application becausethey provide the primary and in many cases the only means for switchingfrom one thread to another. Interrupts fulfill exactly the same role inmulti-threading applications as they do in a conventional application.But there is an important difference caused by the fact that in amulti-threaded application, changes from one thread to another occurnot only through interrupts but also to use spare CPU cycles.
It's important to avoid the situation where one thread isinterrupted while modifying a critical data structure, enabling adifferent thread to make other changes to the same structure. Thiscould result in the data structure being left in an inconsistent state,with potentially catastrophic results.
Most conventional applications overcome this problem by brieflylocking out interrupts while an ISR or system service is modifyingcrucial data structures inside the RTOS. This reliably prevents anyother program from jumping in and making uncoordinated changes to thecritical area being used by the executing code.
This approach, however, is not sufficient in a multi-threadedapplication because of the potential for a switch to a different TCthat is not impeded by the interrupt lockout and therefore mightoperate on the critical area. This can be overcome by using the DMTinstruction in the 34K architecture to disable multi-threading whilethe data structure is being modified.
With these relatively simple exceptions, application code can rununchanged when moving from conventional to multi-threaded applications.This makes it easy and inexpensive to take advantage of the ability ofmulti-threading to use the CPU cycles that are often wasted byconventional RISC processors. Multi-threading meets the demands oftoday and tomorrow for consumer, networking, storage and industrialdevice applications for high performance with only minor increases incost and power consumption.
Compared to its primary competitor—the multicore approach—themulti-threading requires a smaller die area and consumes less powerwhile also being simple to program and requiring few or no changes toexisting applications. Multicore approaches have their own advantagesand strengths, and there is no reason the two approaches cannot becombined for “the best of both worlds.” In applications that requirehigh performance, low cost, and minimum power consumption,multi-threading is a compelling approach.
John A. Carbone, vice president ofmarketing for ExpressLogic, has 35 years experience in real-time computer systems andsoftware, ranging from embedded system developer and FAE to vicepresident of sales and marketing. Prior to joining Express Logic, Mr.Carbone was vice president of marketing for Green Hills Software.
To read more about this topic go to “More about Multithreading and Multicores.”