Achieving multicore performance in a single core SoC using a multi-threaded virtual multiprocessor: Part 1 - Embedded.com

Achieving multicore performance in a single core SoC using a multi-threaded virtual multiprocessor: Part 1

Designers everywhere face ever increasing constraints on system costand power consumption, while at the same time being required to addmore performance and functionality to their designs. This is adifficult trade-off to address successfully.

Some previous approaches have been to ramp up the clock speed of aprocessor, but this usually results in increased power consumption.Additionally, memory performance has not kept pace with processortechnology (see Figure 1, below ),and this mismatch limits any significant gains in system performance.Consequently the higher-frequency approach has led to diminishingreturns.

Figure1 Processor-memory mismatch causes system performance bottlenecks

A multi-core system is anotheroption, but this suffers from alarger die area and higher cost. Any increase in performance comes at afairly substantial cost in silicon and system power consumption. Multiple-issueprocessors with two or more execution units offeranother option, but they struggle to make best use of hardwareresources, and also have an area penalty. Additionally, the softwarehas to be revised in many cases to make best use of the multiplepipelines.

Multi-threading on a Single Core
Supporting multiple software threads on a single processor core offersthe benefits of these traditional approaches without any of theassociated disadvantages. While multi-threading is gaining traction inthe server and desktop markets it has not yet been optimized for theembedded market.

Figure2: Pipelines can stall easily, slowing down applications

One of the main problems with traditional single-threaded processorsis that the execution pipeline will stall for many reasons includingcache misses, branch mis-predicts and other pipeline interlockingevents (Figure 2, above ). Thekey to obtaining the maximum performance from any processor core iscontrolling the way the threads are executed in the pipeline(Figure 3, below ).

Figure3: Improving pipeline efficiency with multiple threads

To resolve such problems, without resorting to a totally newarchitecture, the MIPS3234K core uses a new Virtual Processing Elementapproach, supported by carefully chosen application-specific extensionsto the MIPS Instruction Set Architecture. This allows efficientmulti-threaded use of the 34K core's nine stage execution pipeline (Figure 4 below ) coupled with a smallamount of hardware to handle the virtual processors, the threadcontexts and the quality of service (QoS)prioritization.

Figure4: 34K's nine stage pipeline

The VPE hardware
As illustrated in Figure 5 below, each thread has its own dedicated hardware, called the thread context(TC). This allows each thread to have its own instruction buffer withpre-fetching so that the processor can switch between threads on acycle-by-cycle basis to keep the pipeline as full as possible. All thisavoids the costly overheads of context switching.

Figure5: 34K processor design

Each TC has its own set of general-purpose registers and a PC(program counter) that allows a TC to run a thread from a complexoperating system such as Linux. A TC also shares resources with otherTCs, particularly the CP0 registers used by the privileged code in anOS kernel.

The set of shared CP0 registers and the TCs affiliated with themmake up a VPE (Virtual Processing Element). A VPE running one thread(i.e. with one TC) looks exactly like an independent MIPS CPU, and isfully compliant with the MIPS32 architecture specification.

All threads (in either VPE )share the same caches, so cachecoherency is inherently maintained. This eliminates the problem inmulti-core and multiprocessorsystems, where many cycles andadditional logic are used to manage the different processors and ensurecache-coherency.

Depending on the application requirements, the 34K core can beconfigured for up to nine TCs that are supported across a maximum oftwo VPEs. It is this combination of the VPEs with the TCs that providesthe most area-efficient and flexible solution. For example, one VPEcould be configured to run a complete real-time operating system (RTOS)or data plane application for DSPs, while the other could run Linux ora control plane application.

Alternatively, the processor could also be configured for VSMP(virtual symmetric multiprocessing)offering significantly higher application throughput with only a verysmall increase in die-area. In both of these scenarios, a single 34Kcore replaces multiple discrete processors.

Quality of Service (QoS)
The QoS engine picks instructions from runnable threads using aweightedround-robin algorithm, interleaving instructions on acycle-by-cycle basis. For maximum overall application throughput, theprocessing bandwidth is shared finely between the different threads,thus using each other's processing “gaps.”

Alternatively, it can also achieve QoS for real-time tasks such ascommunications, video and audio processing by allocating dedicatedprocessing bandwidth to specific thread(s).

The QoS is handled with a hierarchical approach (see Figure 6, below ) where the usercan program various levels of processing bandwidth to the availableTCs. Based on this allocated bandwidth, the integrated Policy Managerassigns priorities to the individual TCs, constantly monitors theprogress of the threads, and provides critical “hints” to the DispatchScheduler as needed. The Dispatch Scheduler in turn, schedules thethreads to the execution unit on a cycle-by-cycle basis, ensuring thatthe QoS requirements are met.

Figure6: QoS hierarchy

Why the new approach tomulti-threading?
As processor operating frequency increases, it becomes increasinglydifficult to hide latencies inherent in the operation of a computersystem. A high-end synthesizable core taking 25 cache misses perthousand instructions (a plausiblevalue for “multimedia” code ) couldbe stalled more than 50% of the time if it has to wait 50 cycles for acache fill.

More generally, individual computer instructions have specificsemantics, such that different classes of instructions requiredifferent resources to perform the desired operation. Integer loadsdon't exploit the logic or registers of a floating-point unit, any morethan register shifts require the resources of a load/store unit.

No single instruction consumes all of a computer's resources, andthe proportion of the total system resources that is used by theaverage instruction diminishes as one adds more pipeline stages andparallel functional units to high-performance designs.

Multi-threading arises in large measure from the notion that, if asingle sequential program is fundamentally unable to make fullyefficient use of a processor's resources, the processor should be ableto share some of those resources among multiple concurrent threads ofprogram execution.

The result does not necessarily make any particular program executemore quickly – indeed, some multi-threading schemes actually degradethe performance of a single thread of program execution – but it allowsa collection of concurrent instruction streams to run in less timeand/or on a smaller number of processors.

Multi-threading can provide benefits beyond improved multitaskingthroughput, however. Binding program threads to critical events canreduce event response time, and thread-level parallelism can, inprinciple, be exploited within a single application program to improveabsolute performance.

Varieties of Multi-threading
There are a number of implementation models for multithreading thathave been proposed, some of which have been implemented commercially.

InterleavedMulti-threading is a TDM-style approach which switchesfrom one thread to another on each instruction issued. Interleavedmulti-threading assures some degree of “fairness” in schedulingthreads, but implementations which do static allocation of issue slotsto threads generally limit the performance of a single program thread.Dynamic interleaving ameliorates this problem, but is more complex toimplement.

The diagram in Figure 7 below showshow instructions from program threads “A “and “B ” might be issued. Inthe classical scalar RISC case, we see 4 consecutive instructions fromprogram A being issued, with one wasted cycle due to a pipeline stall.

Figure7: Interleaved Multithreading

A statically interleaved scheme which alternates between two threadsmight only be able to issue three instructions in the same amount oftime, if A is the only threadavailable to run, as shown in theleft-hand column, but if A and B are both runnable, as shown in theright hand column, the alternance between the two fills the pipelineand hides the stall in A .Using a dynamic interleaving scheme allows a singlethread to run without degradation relative to the scalar RISC case,while still achieving better efficiency if multiple threads can be run.

Blockedmulti-threading issues consecutive instructions from asingle program thread until some designated blocking event, such as acache miss, causes that thread to be suspended and another threadactivated.

The diagram in Figure 8 below shows how instructions from program threads “A ” and “B ” might be issuedin a blocked multi-threading system, relative to a scalar RISC. Thescalar RISC stalls for as long as it takes to perform the cache refill,shown here as a wildly optimistic three cycles.

Figure8: Blocked Multi-threading

A blocked multi-threading design might behave identically to thescalar RISC if there is no other thread to run, but given two threads,the blocked multi-threading processor switches from thread A to threadB as soon as thread A encounters the major stall. Notethat the threadswitch may not be instantaneous, and that while thread A is runnable onthe last cycle shown, thread B retains the processor, since it has notyet been blocked.

Because blocked multi-threading changes threads less frequently, itsimplementation can be simplified. On the other hand, it is less “fair”in scheduling threads. A single thread can monopolize the processor fora long time if it is lucky enough to find all of its data in thecache.Hybrid scheduling schemes combining elements of blocked andinterleaved multi-threading have also been built and studied.

SimultaneousMulti-threading is a scheme implemented on superscalarprocessors wherein instructions from different threads can be issuedconcurrently.The diagram in Figure9 below shows a superscalar RISC,issuing up to two instructions per cycle, and a simultaneouslymulti-threaded superscalar pipeline, issuing up to two instructions percycle from either of the two threads.

Those cycles where dependencies or stalls prevented full utilizationof the processor by a single program thread are filled by issuinginstructions for another thread.

Simultaneous multi-threading is thus a very powerful technique forrecovering lost efficiency in superscalar pipelines. It is also themost complex multi-threading system to implement. More than one threadmay be active at a given pipeline stage on a given cycle, complicatingthe implementation of memory access protection, etc.

Figure9: Simultaneous Multithreading

Multi-threading versusMulticore/multiprocessors
Multi-threading and multiprocessing are closely related. Indeed, onecould argue that the difference is one of degree: Whereasmultiprocessors share only memory and/or connectivity, multi-threadedprocessors share those, but also share instruction fetch and issuelogic, and potentially other processor resources.

In a single multi-threaded processor, the various threads competefor issue slots and other resources, which limits parallelism. Some”multi-threaded” programming and architectural models assume that newthreads are assigned to distinct processors, to execute fully inparallel.

In very-high-end processors like the Intel P4, thethroughputimprovement from the limited parallelism provided by a multi-threadedprocessor seems to be quite good relative to the incremental siliconcost. Figures like 65% more throughput in return for 5% more siliconhave been claimed. It must be understood, however, that the siliconcost of multi-threading can be much higher as a percentage of totalarea in a small embedded core, relative to a Pentium 4-class processor.

In the light of all this, the approach used in the MIPS MT ASEstrives to provide aframework both for the management of parallel threads on the same CPUand for the management of parallel threads across multiple cores, andindeed for the migration of threads from one multi-threaded processorto another.

The Virtual Multiprocessor Approach
Mainstream operating systems technology understandssymmetricmultiprocessing (SMP) reasonably well. Several Microsoft operatingsystems support SMP platforms, as does Linux. “Multi-threaded”applications exist which exploit the parallelism of such platforms,using “heavyweight” threads provided by the operating system.

The virtualmultiprocessor (VMP) model of the proposed MIPS multi-threading ASEis designedto provide maximum leverage to this technology. A multi-threadedprocessor, configured as two single-threaded VPEs, is indistinguishableto applications software from a 2-way SMP multiprocessor. The operatingsystem would have no need to use any of the new instructions orprivileged resources defined by the ASE. Only in the case of adynamically configurable VMP would logic need to be added to the bootcode to set up the various VPEs.

Each MIPS MT TC has its own interrupt “exemption” bit and its ownMMU address space identifier(ASID), which allows operating systems tobe modified or written to use a “symmetric multi-TC” (SMTC) model,wherein each TC is treated as an independent “processor”. Becausemultiple TCs may share the privileged resources of a single VPE, anSMTC operating system requires additional logic and complexity tocoordinate the use of the shared resources, relative to a standardMIPS32/MIPS64 OS, but the SMTC model allows SMP-likeconcurrencyup tothe limit of available TCs.

While the default configuration of multiple VPEs in a MIPS MTprocessor provides each VPE with an independently programmable MMU,such that legacy SMP memory management code will work correctly, it ispossible for software to configure the processor to share MMU TLBresources across all VPEs. This requires a level of coordinationbetween “CPUs” (really TCs or VPEs) that is not present in legacy SMPoperating systems, but allows for an advanced SMP/SMTC operating systemto achieve a more favorable TLB miss rate.

Master/Slave VPEs: How theyaccomplish their tasks
One or more VPEs on a processor may power-up as “master” VPEs,indicated by the MVP field ofthe VPConf0 register. Amaster VPE canaccess the registers of other VPEs by using MTTR/MFTR instructions, andcan, via a DVPE instruction,suspend all other VPEs in a processor. (More on theseinstructions and registers later, in Part 2 of thisseries ).

This Master/Slave model allows a multi-taskingmaster “applicationprocessor” VPE running an operating system such as Linux to dispatchreal-time processing tasks on another VPE on behalf of variousapplications. While this could be done using an SMP paradigm, handingwork off from one OS to another, MIPS MT also allows this to be donemore directly.

A master VPE can take control of another VPE of the same processorat any time. This is enabled through the use of a special DVPEinstruction, which suspends all threads of a core except the one thatexecuted the instruction.

Once a DVPE instruction has been issued by the master VPE, the slaveVPE's CP0 privileged resource state can be set up as needed using MTTRinstructions targeting TCs that are bound to the slave VPE, thenecessary instructions and data can be set up in memory visible to theslave VPE, one or more TCs of the slave VPE can be initialized usingMTTR instructions to set up their TCRestart addresses (and indeed theirGPR register values, if appropriate), and the slave VPE can bedispatched to begin execution using the configured TCs by the masterVPE executing an EVPE instruction.

Threads versus VPEs
Why the distinction between threads and VPEs? Because there are twoways for software to approach multi-threading, one which is easy, butrelatively expensive in silicon support and limited in the leverageprovided to applications, and another which is more difficult toprogram, but which provides leverage for finer degrees of parallelismat a lower cost in silicon.

VPE Parallelism is equivalent to symmetric multiprocessor (SMP)parallelism. This means that operating systems which know how to dealwith SMP system configurations can easily be adapted to run multi-VPEcores, and that programs already written using SMP multi-threading ormulti-tasking can exploit VPE parallelism.

Thread parallelism in thecontext of the proposed ASE refers tofine-grained, explicitly controlled thread parallelism. This requiresnew OS/library/compiler support, but takes full advantage of low threadcreation and destruction overhead to exploit parallelism at agranularity that would otherwise be impractical. The hardware supportrequirement for a TC is less than that of a VPE, so more TCs can beinstantiated per unit of chip area.

Implementation and Instantiation ofThread Contexts
The “name space” for TCs in MIPS MT is flat, with thread numbersranging from 0 to 255. This does not mean that 256 TCs need to beimplemented, nor that the full implemented compliment ofarchitecturally visible threads needs to be instantiated as high speed,multi-ported register files. Designers may implement a hierarchy ofthread storage, provided that the software semantics of the ASE arerespected. A typical implementation would be a flat structure of 4-8TCs.

Figure10: Thread resources superimposed on thread scheduling state machine

It is important todistinguish between the software-visible state of aTC as defined by the MIPS MT ASE, and the hardware state associatedwith the selection and scheduling of runnable threads.

As seen by software ,a MIPS MT TC may be in either a free or anactivated allocation state, and independent of its allocation state, itmay be halted, but an activated TC should not be confused with a”running” thread, though a running thread must have an activatedcontext, nor should a halted TC be confused with a “waiting” thread.The diagram in Figure 10, above shows the TC resource states superimposed on animplementation's thread scheduling state machine.

Next in Part 2: The MIPS MTApplication Specific Extensions and how to use them.

Kevin D.Kissell is PrincipalArchitect at MIPSTechnologies, and has been a part of the MIPS architecturedevelopment team since 1997. He was first involved in the architectureand design of RISC microprocessors when he joined the originalFairchild Clipper design team in 1983. In between, Kevin has beenvariously responsible for processor, systems and software architecture,for decoupled access/execute supercomputers at ACRI, massively paralleldistributed memory computers at nCUBE, and large-scale shared-memorysupercomputers at Evans & Sutherland. His prior work at MIPSincludes having been principal architect of the SmartMIPS extensionsfor smart cards.

Peter DelVecchio is a ProductMarketing Manager at MIPS Technologies, responsible for the MIPS32 24K,24KE and 34K core families. Peter began his career at Sun Microsystems,where he worked in Engineering on the SuperSPARC and UltraSPARCprocessors. He then joined CompCore Multimedia, a startup focused on IPlicensing for audio and video technology, and worked at Mobilygen Corp.as the Director of Product Marketing.

To learn more about the subject ofmulticore and multithreadedarchitectures on Embedded.com, go to MoreAbout Multicores, Multiprocessors and Multithreading.”

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.