Effective use of RTOS programming concepts to support advanced multithreaded architecturesA huge amount of attention is focused on multi-threading architectures such as MIPS Technologies, Inc.'s new MIPS32 34K cores because such architectures offer the potential for substantial performance gains at little cost in chip real estate or power consumption.
The key advantage of hardware multi-threading is that it uses cycles to process instructions from other threads when the processor would otherwise sit idle while waiting for cache refill.
The cost of adapting consumer device applications for multi-threading is usually small, because most of them are already designed as a set of semi-independent threads. Application threads can be assigned to hardware resources dedicated by the processor to handling an individual thread. Multiple threads can be assigned to such hardware, and share use of CPU cycles to achieve maximum efficiency.
In this article, we will look at how multi-threading architectures work, and how they can enable additional performance at minimal cost. Then we will show how existing applications easily can be converted to run on a multi-threading processor, and how new applications can be written for this powerful new type of architecture.
Embedded computing facing
Manufacturers of consumer devices and other embedded computing products are adding new features such as WiFi, VoIP, Bluetooth, video, etc. at a rapid rate. Historically, increased feature sets have been accommodated by ramping up the clock speed of the processor. Clock speed has already increased to the 3+ Gigahertz level in desktops and high Megahertz levels even in embedded applications.
In the embedded space, this approach rapidly loses viability because most devices are already running up against power consumption and real estate constraints that limit additional processor speed increases. Cycle speed increases drive dramatically greater power consumption, making high cycle speeds unmanageable for more and more embedded applications.
|Figure 1: Processor speeds outpacing memory.|
In addition, further processor speed improvements yield diminishing performance gains because memory performance has not kept pace with processor speed as shown in Figure 1, above.
Processors are already so much faster than memory that more than half the cycles in many applications are spent waiting while the cache line is refilled. Each time there is a cache miss or another condition that requires off-chip memory access, the processor needs to load a cache line from memory, write those words into the cache, write the old cache line into memory, and resume the thread.
MIPS Technologies Inc. has stated that a high-end synthesizable core taking 25 cache misses per thousand instructions (a plausible value for multimedia code) could be stalled more than 50% of the time if it has to wait 50 cycles for a cache fill. And since processor speed is continuing to advance at a faster rate than memory speed, the problem is growing worse.
The multi-threading alternative
Multi-threading solves this problem by using the cycles that the processor would otherwise waste while waiting for memory access, to handle multiple concurrent threads of program execution. When one thread stalls waiting for memory, another thread immediately presents itself to the processor to keep computing resources fully occupied.
Notably, conventional processors cannot employ this approach because it takes a large number of cycles to switch the thread context from one to another. Multiple application threads must be immediately available and "ready-to-run" on a cycle-by-cycle basis for this approach to work.
The 34K processor is the first such multi-threading product from a major supplier of embedded processors targeting the consumer device market. Each software thread runs on a TC, or Thread Context. A TC includes a complete set of general-purpose registers and a program counter (that's where the name comes from " in OS terminology the vital registers and PC of a thread make up the "Thread Context").
Each TC has its own instruction prefetch queue, and the queues are kept full independently. That means that the core can switch between the threads on a cycle-by-cycle basis to avoid the overhead involved in software context. Adding more TCs requires very little additional silicon. TCs share most of the CPU hardware including the execution unit, ALU, and caches. Moreover, adding a TC does not require the CPU to have another copy of the CP0 registers by which OS software runs the CPU.
A set of shared CP0 registers and the TCs affiliated with them make up a Virtual Processing Element (VPE). A TC runs a thread and a VPE hosts an OS: if you have two VPEs you can either have two independent Operating Systems or one SMP-style OS. A VPE with one TC looks exactly like a traditional MIPS32 architecture CPU and is fully compliant with the MIPS architecture specification: it's a complete virtual processor.
The 34K core can be built with up to nine TCs and two VPEs. The affiliation of TCs to VPEs is determined at run-time. By default all TCs which are ready for execution get a fair share of processing time, but the 34K also includes handles for an application to influence thread scheduling when some particularly demanding thread might otherwise get starved. That is, software can control the Quality of Service (QoS) for each thread. Application software interacts with a hardware Policy Manager which assigns dynamically-changing priorities to individual TCs. A hardware Dispatch Scheduler then assigns threads to the execution unit on a cycle-by-cycle basis to meet the QoS requirements.
|Figure 2: Multi-threading improves pipeline efficiency.|
In a multi-threaded environment such as the 34K, performance can be substantially improved because whenever one thread waits for memory access, other threads use the processor cycles that would otherwise be wasted. Figure 2 above shows how multi-threading can speed up an application. With just Thread0 running, only 5 out of 13 processor cycles are used for instruction execution and the rest are spent waiting for the cache line to refill. The efficiency in this case using conventional processing is only 38%.
Adding Thread1 makes it possible to use 5 additional processor cycles previously wasted waiting. With 10 out of 13 processor cycles now used, efficiency improves to 77%, providing a 100% speedup over the base case. Adding Thread2 makes it possible to fully load the processor, executing instructions on 13 out of 13 cycles for 100% efficiency. This represents a 263% speedup compared to the base case.
Using EEMBC performance benchmarks, the 34K core shows an application speedup of 60% using just two threads over the 24KE family with only a 14% increase in die-size as shown in Figure 3, below.
|Figure 3: EEMBC benchmark performance examples shows 60% improvement with just two threads.|
Adapting software to multi-threading
A key advantage of the multi-threading approach is that in most cases it will run existing software with relatively little modification. Most consumer device applications are already written as a series of semi-independent threads. Each thread can automatically or manually be assigned to a particular hardware TC.
If the currently executing thread cannot continue because of a delay caused by a cache miss or other circumstance, CPU execution switches from that TC to one whose thread is ready to run without any wasted cycles. The more threads in the application, the greater the potential to use cycles that would otherwise be wasted waiting for memory access.
Multi-threading is ideal for anyone using or considering using an RTOS since RTOS applications are inherently multi-threaded. There is no need to rewrite an RTOS application for multi-threading because the RTOS can map the application threads to the TCs automatically under program control in the same way that it would map the threads to a conventional processor.
|Figure 4: Mapping threads to TCs.|
If there are more threads than TCs, a conventional context switch is usually required. These context switches occur just as in a conventional processor. The RTOS saves the state for the current task, loads the context for another task, and begins execution. The multi-threading environment obviously has the potential for more context switches than a conventional processor so the speed at which context switching can be accomplished becomes more significant.
Linux/Windows vs RTOSes for
This highlights an advantage of a fast RTOS relative to OSes such as Linux and the embedded versions of Windows. Typical real-time performance for Linux is generally in the range of a few hundred microseconds to a few milliseconds. But worst case Linux real-time performance is unbounded. On the other hand, a fast RTOS provides a deterministic real-time performance in the range of about 1 to 2 microseconds on a single-threaded processor and even faster on a multi-threaded processor.
The RTOS assigns unique (non-duplicated) resources to unique TCs. By convention, the single floating point unit (FPU) is always assigned to TC0. Any thread that performs hardware-based floating point operations needs to be mapped to TC0 and that all such threads must share TC0. This results in some interesting programming choices, particularly the choice of whether to implement floating point operations in hardware or software.
Doing floating point in hardware is obviously faster but on the other hand requires sharing the FPU. If threads only do a small amount of floating point work, then it may make sense to implement them in software while threads that do intensive floating point work would normally be implemented as floating point operations in hardware and be mapped to TC0. It's important to note that this change does not require recording because the decision of whether to use hardware or software floating point can be implemented with a compiler switch.
Assigning weights to threads
If the application doesn't specify weights for the various threads, the application scheduler will assign equal weights to all. On the other hand, time slicing can be used so that threads share CPU cycles according to user-specified weights. Assigning weights is equivalent to assigning a proportion of CPU cycles to particular threads. Thread weights are mapped to the hardware TC transparently by the RTOS.
Some existing applications may be designed for conventional processors under the assumption that a lower priority thread will not be allowed to run while a higher priority thread is "ready". In embedded programming, ready means that all conditions necessary for the thread to run have been met and the only thing preventing it from running is its priority.
Multi-threading runs the risk of violating this condition since lower priority threads are able to run whenever higher priority threads are stalled. Writing code to eliminate the need for this condition will optimize performance.
On the other hand, existing code written based on this condition can
be run unmodified on multi-threading processors simply by setting an
operating system switch that that only allows threads of the same
priority to be loaded in TCs at the same time. When setting this
switch, be sure to assign threads that can be run in parallel to the
same priority level whenever possible.
Interrupts are critical in a conventional embedded application because they provide the primary and in many cases the only means for switching from one thread to another. Interrupts fulfill exactly the same role in multi-threading applications as they do in a conventional application. But there is an important difference caused by the fact that in a multi-threaded application, changes from one thread to another occur not only through interrupts but also to use spare CPU cycles.
It's important to avoid the situation where one thread is interrupted while modifying a critical data structure, enabling a different thread to make other changes to the same structure. This could result in the data structure being left in an inconsistent state, with potentially catastrophic results.
Most conventional applications overcome this problem by briefly locking out interrupts while an ISR or system service is modifying crucial data structures inside the RTOS. This reliably prevents any other program from jumping in and making uncoordinated changes to the critical area being used by the executing code.
This approach, however, is not sufficient in a multi-threaded application because of the potential for a switch to a different TC that is not impeded by the interrupt lockout and therefore might operate on the critical area. This can be overcome by using the DMT instruction in the 34K architecture to disable multi-threading while the data structure is being modified.
With these relatively simple exceptions, application code can run unchanged when moving from conventional to multi-threaded applications. This makes it easy and inexpensive to take advantage of the ability of multi-threading to use the CPU cycles that are often wasted by conventional RISC processors. Multi-threading meets the demands of today and tomorrow for consumer, networking, storage and industrial device applications for high performance with only minor increases in cost and power consumption.
Compared to its primary competitor—the multicore approach—the multi-threading requires a smaller die area and consumes less power while also being simple to program and requiring few or no changes to existing applications. Multicore approaches have their own advantages and strengths, and there is no reason the two approaches cannot be combined for "the best of both worlds." In applications that require high performance, low cost, and minimum power consumption, multi-threading is a compelling approach.
John A. Carbone, vice president of marketing for Express Logic, has 35 years experience in real-time computer systems and software, ranging from embedded system developer and FAE to vice president of sales and marketing. Prior to joining Express Logic, Mr. Carbone was vice president of marketing for Green Hills Software.
To read more about this topic go to "More about Multithreading and Multicores."