Get multicore performance from one core - Embedded.com

Get multicore performance from one core

An SoC with a multithreaded virtual multiprocessor might be just what you're looking for.

System-on-chip (SoC) designers know what it's like to do more with less. They're constantly challenged by ever-increasing constraints on system cost and power consumption while being tasked with increasing the performance and functionality of their designs. The tricks of the trade available to designers are, at best, a set of difficult trade-offs.

For example, some designers ramp up the processor's clock speed, but this approach usually results in higher power consumption. In addition, memory performance hasn't kept pace with processor technology, as Figure 1 illustrates, and this mismatch limits any significant gains in system performance. A multicore system is another option, but this suffers from a larger die area and higher cost. Any performance increase comes at a fairly substantial cost in silicon and system power.

View the full-size image

Multiple-issue processors, with two or more execution units, are another option, but they struggle to make best use of hardware resources and have an area penalty. In addition, the software has to be revised in many cases to make best use of the multiple pipelines.

Multithreading on a single core
One problem with traditional single-threaded processors is that the execution pipeline will stall for many reasons including cache misses, branch mispredicts, and other pipeline interlocking events. The key to obtaining the maximum performance from any processor core is controlling the way the threads are executed in the pipeline.

Supporting multiple software threads on one processor core offers the benefits of these traditional approaches without any of the associated disadvantages. While multithreading is gaining traction in the server and desktop markets, it hasn't yet been optimized for the embedded systems market.

Why multithreading?
As processor operating frequency increases, it becomes increasingly difficult to hide latencies inherent in the operation of a computer system. A high-end synthesizable core taking 25 cache misses per thousand instructions (a plausible value for “multimedia” code) could be stalled more than 50% of the time if it must wait 50 cycles for a cache fill.

More generally, individual computer instructions have specific semantics, such that different classes of instructions require different resources to perform the desired operation. Integer loads don't exploit the logic or registers of a floating-point unit, any more than register shifts require the resources of a load/store unit. No single instruction consumes all of a computer's resources, and the proportion of the total system resources used by the average instruction diminishes as you add pipeline stages and parallel functional units to high-performance designs.

Multithreading arises in large measure from the notion that, if a single sequential program is fundamentally unable to make fully efficient use of a processor's resources, the processor should be able to share some of those resources among multiple concurrent threads of program execution. The result doesn't necessarily make a particular program execute more quickly, but it allows a collection of concurrent instruction streams to run in less time and on fewer processors, as shown in Figure 2.

View the full-size image

Multithreading can provide benefits beyond improved multitasking throughput, however. Binding program threads to critical events can reduce event response time, and thread-level parallelism can, in principle, be exploited within a single application program to improve absolute performance.

Varieties of multithreading
A number of implementation models for multithreading have been proposed, some of which have been implemented commercially. Interleaved multithreading is a time division multiplexing (TDM)-style approach that switches from one thread to another on each instruction issued. This technique assures some degree of “fairness” in scheduling threads, but implementations that do static allocation of issue slots to threads generally limit the performance of a single program thread. Dynamic interleaving ameliorates this problem but is more complex to implement.

Blocked multithreading issues consecutive instructions from one program thread until some designated blocking event, such as a cache miss, causes that thread to be suspended and another thread activated. Because blocked multithreading changes threads less frequently, its implementation can be simplified. On the other hand, it's less “fair” in scheduling threads. One thread can monopolize the processor for a long time if it's lucky enough to find all of its data in the cache.

Hybrid scheduling schemes combining elements of blocked and interleaved multithreading have also been built and studied. Simultaneous multithreading is a scheme implemented on superscalar processors wherein instructions from different threads can be issued concurrently. Simultaneous multithreading is thus a powerful technique for recovering lost efficiency in superscalar pipelines. It's also the most complex multithreading system to implement. More than one thread can be active at a given pipeline stage on a given cycle, complicating the implementation of memory access protection, and so on.

Multithreading versus multicore/multiprocessing
Multithreading and multiprocessing are closely related. One could argue that the difference is one of degree: whereas multiprocessors share only memory or connectivity (or both), multithreaded processors share memory, connectivity, instruction fetch, issue logic, and potentially other processor resources. In a single multithreaded processor, the various threads compete for issue slots and other resources, which limits parallelism. Some multithreaded programming and architectural models assume that new threads are assigned to distinct processors, to execute fully in parallel.

When to implement multithreading
Multithreading makes sense whenever an application with some degree of concurrency is to be run on a processor that would otherwise find itself stalled a significant portion of the time waiting for instructions and operands. This is a function of core frequency, memory technology, and program memory behavior. Well-behaved real-world programs in a typical single-threaded SoC processor/memory environment might be stalled as little as 30% of the time at 500 MHz, but less cache-friendly codes may be stalled a whopping 75% of the time in the same environment. Systems where the processor speeds and memory are so well matched that there's no efficiency loss due to latency won't get any significant bandwidth improvement from multithreading.

Beyond multicore
The additional resources of a multithreaded processor can be used for things other than simply recovering lost bandwidth, if the multithreading architecture provides for it. A multithreaded processor can thus have capabilities that have no equivalent in a multicore system based on conventional processors. For example, in a conventional processor, when an external interrupt event needs to be serviced, the processor takes an interrupt exception, where instruction fetch and execution suddenly restarts at an exception vector. The interrupt vector code must save the current program state before invoking the interrupt service code and must restore the program context before returning from the exception.

A multithreaded processor, by definition, can switch between two program contexts in hardware, without the need for decoding an exception or saving/restoring state in software. A multithreaded architecture targeted for real-time applications can potentially exploit this and allow for execution threads to be suspended, then unblocked directly by external signals to the core, providing for zero-latency handling of interrupt events.

One of MIPS' latest processor cores employs a virtual processing element (VPE) , supported by application-specific extensions to the instruction set. The VPE enables efficient multithreaded use of the core's execution pipeline coupled with a small amount of hardware to handle the virtual processors, the thread contexts, and the quality-of-service (QoS) prioritization. Each thread has its own dedicated hardware, called the thread context (TC) . This allows each thread to have its own instruction buffer with prefetching so that the processor can switch between threads on a cycle-by-cycle basis to keep the pipeline full. All this avoids the costly overheads of context switching.

Each TC has its own set of general-purpose registers and a program counter that lets a TC run a thread from a complex operating system such as Linux. A TC also shares resources with other TCs, particularly the CP0 registers used by the privileged code in an operating system kernel. The set of shared CP0 registers and the TCs affiliated with them make up VPE. A VPE running one thread (such as one with one TC) looks exactly like an independent CPU.

All threads (in either VPE) share the same caches, so cache coherency is inherently maintained. This eliminates the problem in multicore and multiprocessor systems, where many cycles and additional logic help manage the different processors and ensure cache-coherency.

Depending on the application requirements, the processor core can be configured for up to nine TCs that are supported across a maximum of two VPEs. It's this combination of VPEs with the TCs that provides the most area-efficient and flexible solution. For example, one VPE could be configured to run a complete real-time operating system or data plane application for digital signal processors, while the other could run Linux or a control plane application. Alternatively, the processor could also be configured for VSMP (virtual symmetric multiprocessing), offering significantly higher application throughput with only a very small increase in die area. In both of these scenarios, one processor core replaces multiple discrete processors.

Quality of service
The QoS engine picks instructions from runnable threads using a weighted round-robin algorithm, interleaving instructions on a cycle-by-cycle basis. For maximum overall application throughput, the processing bandwidth is shared finely between the different threads, thus using each other's processing “gaps.” Alternatively, it can also achieve QoS for real-time tasks such as communications, video, and audio processing by allocating dedicated processing bandwidth to specific threads.

QoS is handled with a hierarchical approach where the user can program various processing bandwidth levels to the available TCs. Based on this allocated bandwidth, the integrated Policy Manager assigns priorities to the individual TCs, constantly monitors the threads' progress, and provides critical “hints” to the Dispatch Scheduler as needed. The scheduler in turn, schedules the threads to the execution unit on a cycle-by-cycle basis, ensuring that the QoS requirements are met.

Virtual multiprocessors
Mainstream operating-systems technology understands symmetric multiprocessing (SMP) reasonably well. Linux and several Microsoft operating systems support SMP platforms. “Multithreaded” applications exist that exploit the parallelism of such platforms, using “heavyweight” threads provided by the operating system.

Threads versus VPEs
A distinction is made between threads and VPEs because there are two ways for software to approach multithreading. One way is easy but relatively expensive in silicon support and limited in the leverage provided to applications. The other way is more difficult to program but provides leverage for finer degrees of parallelism at a lower cost in silicon.

VPE parallelism is equivalent to SMP parallelism. This means that operating systems that know how to deal with SMP system configurations can easily be adapted to run multi-VPE cores, and that programs already written using SMP multithreading or multitasking can exploit VPE parallelism.

Thread parallelism in the context of the proposed application-specific extension (ASE) refers to fine-grained, explicitly controlled thread parallelism. This requires new operating system/library/compiler support, but takes full advantage of low thread creation and destruction overhead to exploit parallelism at a granularity that would otherwise be impractical. The hardware support requirement for a TC is less than that of a VPE, so more TCs can be instantiated per unit of chip area.

The “name space” for TCs in a MIPS core is flat, with thread numbers ranging from 0 to 255. This doesn't mean that 256 TCs need to be implemented, nor that the full implemented compliment of architecturally visible threads needs to be instantiated as high-speed, multiported register files. Designers can implement a hierarchy of thread storage, provided that the software semantics of the ASE are respected. A typical implementation would be a flat structure of four to eight TCs.

Thread context state versus thread scheduling state
It's important to distinguish between a TC's software-visible state as defined by the ASE and the hardware state associated with the selection and scheduling of runnable threads. As seen by software, a TC may be in either a free or activated allocation state, and independent of its allocation state, it may be halted. But an activated TC shouldn't be confused with a “running” thread, though a running thread must have an activated context. Also, a halted TC shouldn't be confused with a “waiting” thread.

Kevin D. Kissell , principal architect at MIPS Technologies, has participated on the MIPS architecture development team since 1997. He holds a degree in computer science from University of California at Berkeley. He may be reached at .

Peter Del Vecchio is a product marketing manager at MIPS Technologies, responsible for the MIPS32 24K, 24KE, and 34K core families. He has a BSEE and MSEE from Cornell University and has worked in the semiconductor industry for more than 17 years. He may be reached at .

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.