Effective use of RTOS programming concepts to support advanced multithreaded architectures
How your RTOS can make more effective use of new advanced multithreading architectures
By John A. Carbone
A huge amount of attention is focused on multi-threading architectures
such as MIPS Technologies, Inc.'s new
MIPS32 34K cores because such
architectures offer the potential for substantial performance gains at
little cost in chip real estate or power consumption.
The key advantage of hardware multi-threading is that it uses
cycles to process instructions from other threads when the processor
would otherwise sit idle while waiting for cache refill.
The cost of adapting consumer device applications for
multi-threading is usually small, because most of them are already
designed as a set of semi-independent threads. Application threads can
be assigned to hardware resources dedicated by the processor to
handling an individual thread. Multiple threads can be assigned to such
hardware, and share use of CPU cycles to achieve maximum efficiency.
In this article, we will look at how multi-threading architectures
work, and how they can enable additional performance at minimal cost.
Then we will show how existing applications easily can be converted to
run on a multi-threading processor, and how new applications can be
written for this powerful new type of architecture.
Embedded computing facing
performance barrier
Manufacturers of consumer devices and other embedded computing products
are adding new features such as WiFi, VoIP, Bluetooth, video, etc. at a
rapid rate. Historically, increased feature sets have been accommodated
by ramping up the clock speed of the processor. Clock speed has already
increased to the 3+ Gigahertz level in desktops and high Megahertz
levels even in embedded applications.
In the embedded space, this approach rapidly loses viability because
most devices are already running up against power consumption and real
estate constraints that limit additional processor speed increases.
Cycle speed increases drive dramatically greater power consumption,
making high cycle speeds unmanageable for more and more embedded
applications.
 |
| Figure
1: Processor speeds outpacing memory. |
In addition, further processor speed improvements yield diminishing
performance gains because memory performance has not kept pace with
processor speed as shown in Figure 1,
above.
Processors are already so much faster than memory that more than
half the cycles in many applications are spent waiting while the cache
line is refilled. Each time there is a cache miss or another condition
that requires off-chip memory access, the processor needs to load a
cache line from memory, write those words into the cache, write the old
cache line into memory, and resume the thread.
MIPS
Technologies Inc. has stated that a high-end synthesizable core
taking 25 cache misses per thousand instructions (a plausible value for
multimedia code) could be stalled more than 50% of the time if it has
to wait 50 cycles for a cache fill. And since processor speed is
continuing to advance at a faster rate than memory speed, the problem
is growing worse.
The multi-threading alternative
Multi-threading solves this problem by using the cycles that the
processor would otherwise waste while waiting for memory access, to
handle multiple concurrent threads of program execution. When one
thread stalls waiting for memory, another thread immediately presents
itself to the processor to keep computing resources fully occupied.
Notably, conventional processors cannot employ this approach because
it takes a large number of cycles to switch the thread context from one
to another. Multiple application threads must be immediately available
and "ready-to-run" on a cycle-by-cycle basis for this approach to work.
The 34K processor is the first such multi-threading product from a
major supplier of embedded processors targeting the consumer device
market. Each software thread runs on a
TC, or Thread Context.
A TC includes a complete set of general-purpose registers and a program
counter (that's where the name comes from " in OS terminology the vital
registers and PC of a thread make up the "Thread Context").
Each TC has its own instruction prefetch queue, and the queues are
kept full independently. That means that the core can switch between
the threads on a cycle-by-cycle basis to avoid the overhead involved in
software context. Adding more TCs requires very little additional
silicon. TCs share most of the CPU hardware including the execution
unit, ALU, and caches. Moreover, adding a TC does not require the CPU
to have another copy of the CP0 registers by
which OS software runs the CPU.
A set of shared CP0
registers and the TCs affiliated with them make up a Virtual
Processing Element (VPE). A TC runs a thread and a VPE hosts
an OS: if you have two VPEs you can either have two independent
Operating Systems or one SMP-style OS. A VPE with one TC looks exactly
like a traditional MIPS32 architecture CPU and is fully compliant with
the MIPS architecture specification: it's a complete virtual processor.
The 34K core can be built with up to nine TCs and two VPEs. The
affiliation of TCs to VPEs is determined at run-time. By default all
TCs which are ready for execution get a fair share of processing time,
but the 34K also includes handles for an application to influence
thread scheduling when some particularly demanding thread might
otherwise get starved. That is, software can control the Quality of
Service (QoS) for each thread. Application software interacts with a
hardware Policy Manager which assigns dynamically-changing priorities
to individual TCs. A hardware Dispatch Scheduler then assigns threads
to the execution unit on a cycle-by-cycle basis to meet the QoS
requirements.
 |
| Figure
2: Multi-threading improves pipeline efficiency. |
In a multi-threaded environment such as the 34K, performance can be
substantially improved because whenever one thread waits for memory
access, other threads use the processor cycles that would otherwise be
wasted. Figure 2 above shows
how multi-threading can speed up an application. With just Thread0 running,
only 5 out of 13 processor cycles are used for instruction execution
and the rest are spent waiting for the cache line to refill. The
efficiency in this case using conventional processing is only 38%.
Adding Thread1
makes it possible to use 5 additional processor cycles previously
wasted waiting. With 10 out of 13 processor cycles now used, efficiency
improves to 77%, providing a 100% speedup over the base case. Adding Thread2 makes it
possible to fully load the processor, executing instructions on 13 out
of 13 cycles for 100% efficiency. This represents a 263% speedup
compared to the base case.
Using EEMBC performance benchmarks,
the 34K core shows an application speedup of 60% using just two threads
over the 24KE family with only a 14% increase in die-size as shown in Figure 3, below.
 |
| Figure
3: EEMBC benchmark performance examples shows 60% improvement with just
two threads. |
Adapting software to multi-threading
A key advantage of the multi-threading approach is that in most cases
it will run existing software with relatively little modification. Most
consumer device applications are already written as a series of
semi-independent threads. Each thread can automatically or manually be
assigned to a particular hardware TC.
If the currently executing thread cannot continue because of a delay
caused by a cache miss or other circumstance, CPU execution switches
from that TC to one whose thread is ready to run without any wasted
cycles. The more threads in the application, the greater the potential
to use cycles that would otherwise be wasted waiting for memory access.
Multi-threading is ideal for anyone using or considering using an RTOS
since RTOS applications are inherently multi-threaded. There is no need
to rewrite an RTOS application for multi-threading because the RTOS can
map the application threads to the TCs automatically under program
control in the same way that it would map the threads to a conventional
processor.
 |
| Figure
4: Mapping threads to TCs. |
If there are more threads than TCs, a conventional context switch is
usually required. These context switches occur just as in a
conventional processor. The RTOS saves the state for the current task,
loads the context for another task, and begins execution. The
multi-threading environment obviously has the potential for more
context switches than a conventional processor so the speed at which
context switching can be accomplished becomes more significant.
Linux/Windows vs RTOSes for
multithreading
This highlights an advantage of a fast RTOS relative to OSes such as Linux and the embedded versions
of Windows. Typical real-time performance for Linux is generally in the
range of a few hundred microseconds to a few milliseconds. But worst
case Linux real-time performance is unbounded. On the other hand, a
fast RTOS provides a deterministic real-time performance in the range
of about 1 to 2 microseconds on a single-threaded processor and even
faster on a multi-threaded processor.
The RTOS assigns unique (non-duplicated) resources to unique TCs. By
convention, the single floating point unit (FPU) is always assigned to
TC0. Any thread that performs hardware-based floating point operations
needs to be mapped to TC0 and that all such threads must share TC0.
This results in some interesting programming choices, particularly the
choice of whether to implement floating point operations in hardware or
software.
Doing floating point in hardware is obviously faster but on the
other hand requires sharing the FPU. If threads only do a small amount
of floating point work, then it may make sense to implement them in
software while threads that do intensive floating point work would
normally be implemented as floating point operations in hardware and be
mapped to TC0. It's important to note that this change does not require
recording because the decision of whether to use hardware or software
floating point can be implemented with a compiler switch.
Assigning weights to threads
If the application doesn't specify weights for the various threads, the
application scheduler will assign equal weights to all. On the other
hand, time slicing can be used so that threads share CPU cycles
according to user-specified weights. Assigning weights is equivalent to
assigning a proportion of CPU cycles to particular threads. Thread
weights are mapped to the hardware TC transparently by the RTOS.
Some existing applications may be designed for conventional
processors under the assumption that a lower priority thread will not
be allowed to run while a higher priority thread is "ready". In
embedded programming, ready means that all conditions necessary for the
thread to run have been met and the only thing preventing it from
running is its priority.
Multi-threading runs the risk of violating this condition since
lower priority threads are able to run whenever higher priority threads
are stalled. Writing code to eliminate the need for this condition will
optimize performance.
On the other hand, existing code written based on this condition can
be run unmodified on multi-threading processors simply by setting an
operating system switch that that only allows threads of the same
priority to be loaded in TCs at the same time. When setting this
switch, be sure to assign threads that can be run in parallel to the
same priority level whenever possible.
<>
The importance of being interruptable
Interrupts are critical in a conventional embedded application because
they provide the primary and in many cases the only means for switching
from one thread to another. Interrupts fulfill exactly the same role in
multi-threading applications as they do in a conventional application.
But there is an important difference caused by the fact that in a
multi-threaded application, changes from one thread to another occur
not only through interrupts but also to use spare CPU cycles. >
It's important to avoid the situation where one thread is
interrupted while modifying a critical data structure, enabling a
different thread to make other changes to the same structure. This
could result in the data structure being left in an inconsistent state,
with potentially catastrophic results.
Most conventional applications overcome this problem by briefly
locking out interrupts while an ISR or system service is modifying
crucial data structures inside the RTOS. This reliably prevents any
other program from jumping in and making uncoordinated changes to the
critical area being used by the executing code.
This approach, however, is not sufficient in a multi-threaded
application because of the potential for a switch to a different TC
that is not impeded by the interrupt lockout and therefore might
operate on the critical area. This can be overcome by using the DMT
instruction in the 34K architecture to disable multi-threading while
the data structure is being modified.
With these relatively simple exceptions, application code can run
unchanged when moving from conventional to multi-threaded applications.
This makes it easy and inexpensive to take advantage of the ability of
multi-threading to use the CPU cycles that are often wasted by
conventional RISC processors. Multi-threading meets the demands of
today and tomorrow for consumer, networking, storage and industrial
device applications for high performance with only minor increases in
cost and power consumption.
Compared to its primary competitor—the multicore approach—the
multi-threading requires a smaller die area and consumes less power
while also being simple to program and requiring few or no changes to
existing applications. Multicore approaches have their own advantages
and strengths, and there is no reason the two approaches cannot be
combined for "the best of both worlds." In applications that require
high performance, low cost, and minimum power consumption,
multi-threading is a compelling approach.
John A. Carbone, vice president of
marketing for Express
Logic, has 35 years experience in real-time computer systems and
software, ranging from embedded system developer and FAE to vice
president of sales and marketing. Prior to joining Express Logic, Mr.
Carbone was vice president of marketing for Green Hills Software.
To read more about this topic go to "More about Multithreading and Multicores."