Multiprocessing with real-time operating systems

In an ideal world, programmers switching from 1 to n processors would see their code run n times as fast, with no code changes. Things aren't that simple, unfortunately, but as you'll see, there are a number of ways the RTOS can make things easier.

Although the concept of a multiprocessor system has been around for decades, it's only recently attained commercial viability as demand grows for scalable, high-performance, highly reliable systems. This increased demand for so-called “highly available” CPU-intensive systems (systems with 99.999% uptime) has spawned several types of multiprocessor systems, each designed for a specific application. Multiprocessor systems are widely used in applications involving 3D graphics with audio/video compression and decompression running on specialized multiprocessor chips (for example, Fuzion 150 from PixelFusion). These systems are also seen in high bandwidth network traffic switch/router designs, including special network management features on specialized multiprocessors (for example, SB1250 with SB-1 chip of Broadcom's Mercurian family of programmable network processors).

This article focuses on a specific type of multiprocessor system, the tightly coupled shared-memory symmetric multiprocessor (SMP), and how it's supported by a typical real-time operating system (RTOS).

Our SMP
In our SMP configuration, the term tightly coupled means that the individual processor cores lie close to each other and are physically connected over a common high-speed bus. The processors share a global memory module (shared memory ) and peripheral devices through a common I/O bus interface.

It's important to know the difference between such an SMP system and a distributed multiprocessor environment. In a distributed multiprocessor system, the individual processing units typically reside as separate nodes. Each of these nodes can have a different processor type and its own memory and I/O devices. Each processor runs its own operating system and synchronizes with other processors using messages or semaphores over an interconnect such as Ethernet or a more specialized high-speed interconnect such as Infiniband.

On the other hand, tightly coupled shared-memory SMP system runs a single copy of an operating system that coordinates simultaneous activity occurring in each of the identical CPUs. Since these tightly coupled CPUs access a common memory region, they synchronize with each other using a communication mechanism based on low-latency shared memory.

Symmetry
An SMP system, by definition, has multiple identical processors connected to a set of peripherals, such as memory and I/O devices, on a common high-speed bus. The term symmetric is used because each processor has the same access mechanism to the shared memory and peripheral devices.


Figure 1: Shared-memory symmetric multiprocessor architecture

Figure 1 shows an SMP system with two identical processors (P1 and P2), a global memory module, and I/O devices such as disk controllers and network interfaces. The two processors share the memory device over a common high-speed bus and are typically connected to each of the peripheral devices through a high-speed bus interface.

The processors have to contend for access to the bus (for example, to read/write to the global memory).

If P1 has occupied the bus for receiving data from the network interface, P2 has to wait for the operation to complete before getting access to write data to the disk. Thus, while any one device is busy accessing the memory for a read-write operation, all the other devices seeking access enter the busy-wait state.

However, each of these devices typically has some local memory reserved exclusively for the device's own use. For example, an I/O device, such as a typical network interface controller, has some local memory to buffer and preprocess data frames that are received over the network before they're transferred (typically through DMA) to the global memory for further processing by P1 or P2.

In addition to the per-CPU interrupt controller, such a system usually has another interrupt controller (an IO-APIC from Intel, for example) that routes the interrupt request signals from the peripheral devices to one of the CPUs for servicing.

Owing to the symmetric hardware architecture, the software design remains independent of any specific interconnection scheme of the various devices or other hardware-related aspects, including the actual number of processing elements in the system.

Why SMP?
Before we explore how a typical RTOS supports an SMP system, let's look at some benefits to using such a system:

  • Two processing units running in parallel complete a set of tasks more quickly than one processor. Four processors are, likewise, better than two.
  • An SMP system that can be scaled by adding more processing cores and/or peripheral devices to execute more tasks in parallel offers a viable platform for scalability.
  • If one processor suffers a fatal failure, another can take over its work seamlessly, reducing downtime.

The operating system
Another benefit is that an SMP system is relatively straightforward to use through the operating system. A suitable operating system design hides the application programmer from the actual number of processing units in the hardware. This enables the operating system to run existing application software without any SMP-specific modification.

Much like its uniprocessor equivalent, a preemptive RTOS running on an SMP system guarantees that at any given time tasks of the highest priority are running. The difference is that each processor can run one task, so multiple tasks can be running at the same time. If a task of higher priority becomes ready to run, it will preempt the running task of the lowest priority.

In a multithreaded application running on a typical uniprocessor system, the RTOS restricts access to a global resource to any one task through the principle of mutual exclusion. This principle extends to a multiprocessor system by the addition of spin locks.

The main decisions you'll make in using an RTOS in this type of multiprocessor system are the number of spin locks and the appropriate kernel architecture. Let's start by looking at spin locks.

Spin locks
A spin lock is the most basic SMP primitive. It is used to ensure data integrity in a kernel supporting SMP systems. Listing 1 illustrates a simple spin-lock implementation in pseudo-code.

Listing 1: Lock-unlock for short-term mutual exclusion

SPINLOCK sp_lock = 0;

lock(&sp_lock) { // Try to obtain exclusive access to the lock. while (swap_atomic(&sp_lock, 1) == 1) { // Lock in use. Try again. } // The caller now has exclusive access to the lock.}

unlock (&sp_lock) { // Make the lock available to the next user. swap_atomic(&sp_lock, 0);}

The SPINLOCK sp_lock is a kernel data structure. A task running on any one processor that acquires sp_lock , by calling the lock() function successfully, exits the “spin” loop therein and gains exclusive access to the shared resource it protects. When the task is done with the shared resource, it releases the lock by calling the unlock() function. In the meantime, all the other running tasks (on other processors) trying to acquire that lock will spin idly while checking for sp_lock to be reset to 0. Only one of the waiting tasks will be admitted successfully when it is released.


Figure 2: Propagation of single giant lock among tasks

Consider the example depicted in Figure 2 to illustrate how the spin lock works in a four processor environment. First T1 acquires the lock before entering its critical section. While T1 is still accessing the shared resource, T2, which is running on a different processor, tries to acquire the same lock and spins waiting for T1 to release it. Subsequently, T3 also calls lock() and spins idly. T2 and T3 both spin idly waiting for the lock. When T1 completes its critical code execution and releases the lock, T3 acquires the lock first even though it arrived later than T2 to acquire the lock. There are no context switches during this sequence of operations.

Immediately after the release of the spin lock, the spinning task that next tries to acquire the lock gets it and exits the while loop to continue with its kernel service. If the RTOS supports prioritized spin locks, then upon release of the spin lock by T1, the lock is given to the spinning task of the highest priority.[1]

An alternative approach is to let the task that is spinning idly sleep for a while and switch to running another user-mode task. Upon release of the spin lock, a wakeup() function is called that wakes up all the tasks sleeping on the spin lock.

Listing 2: Lock-unlock with sleep-wakeup for long-term mutual exclusion

SPINLOCK sp_lock = 0;

lock(&sp_lock) { // Try to obtain exclusive access to the lock. while (swap_atomic(&sp_lock, 1) == 1) { // Lock in use. sleep(&sp_lock); // sleep for some time } // The caller now has exclusive access to the lock.}

unlock (&sp_lock) { // Make the lock available to the next user. swap_atomic(&sp_lock, 0); wakeup(&sp_lock); // wakeup tasks that slept for this lock}

Listing 2 shows pseudo-code for a spin lock implementation with associated sleep-wakeup. This implementation is better suited for long-term mutual exclusion. Otherwise, if the lock is only held for very short durations of time, the context switch overhead associated with these sleep-wakeup calls could consume any performance improvement achieved in switching to other tasks instead of keeping the processor tied up with one task spinning idly for the lock as in Listing 1.

Now that we have seen how a basic spin lock works, let's see how they're used to ensure data integrity inside a multiprocessor kernel.

Architectures
There are several possible architectures for an operating system to support SMP hardware.

Master-slave
The simplest way to modify a uniprocessor kernel to support multiprocessor systems is to treat the entire operating system as one indivisible entity and restrict all kernel-mode activity to run on the same processor, known as the master processor. The other processors, called slave processors, execute only user-mode operations and, hence, the resulting software architecture is no longer symmetric.

A task running on one of the slave processors may request a kernel service by entering a queue of tasks waiting to run on the master for kernel-mode execution. When the master processor is free, the highest priority task waiting in this queue is allowed to run. Upon completion of the requested kernel service, the task, if it's still ready to run, enters a second queue for user-mode tasks that need execution on one of the slaves.

Note that the two queues (one for managing tasks requesting execution on the master and the other for any slave) must be protected via spin locks. Tasks running in parallel on different processors may try to access these queues at the same time. The spin locks prevent race conditions associated with the queues.

Other kernel data structures do not require spin-lock protection, since the kernel always executes on the master processor. Thus, as far as the kernel is concerned, it is still running in a uniprocessor environment. Only the user-level tasks are able to exploit the parallelism offered by the slave processors.

A good example of such an implementation is the pSOS+m RTOS from Wind River Systems.

The drawback of the master-slave architecture is that it works well only when the tasks execute mostly in user mode. In applications that involve a significant percentage of kernel-mode operations, a master-slave kernel serializes such operations of all tasks for execution on the master. Therefore, the system as a whole performs as if it's running on a uniprocessor system. Furthermore, if several tasks concurrently request kernel services, the overhead associated with these tasks entering and leaving the run queues can significantly slow overall performance.

With a master-slave RTOS, if the application consists of 40% kernel-mode execution, adding a second CPU will result in at best a 30% improvement in system performance. As the kernel mode execution time increases as a percentage of the overall CPU usage, and as more tasks enter and leave the run queues, the application performance degrades rapidly.

Giant lock, coarse-grained locking
A refinement of the master-slave kernel is the giant lock architecture. The giant lock treats the entire operating system as one monolithic entity protected by one spin lock, but doesn't restrict the execution of kernel-mode operations to one specific processor.

Any task requesting a kernel-mode operation can acquire the giant lock and continue running on the processor it was already using. However, kernel-mode execution is still restricted to any one of the processors at any given time. So while one task holds the giant lock, all others waiting to use kernel services will idle.

Since there's no notion here of a master or a slave processor—true to the symmetric hardware architecture of the SMP system—this RTOS design allows any processor to execute kernel-mode operations. Although the overhead of switching to a different processor is eliminated in this architecture, note that the task waiting for the lock will spin idly, unlike a master-slave kernel where the processor can switch to another user-mode task while the current task waits in the master queue.

One such design, in which the entire kernel is treated as one entity and protected by a single spin lock, is shown in Figure 2. Notice that T2 spends a significant amount of time spinning idly waiting for the giant lock to be released by T3. It's possible, therefore, that if T1 then T3 occupy the lock for extended periods of time, T2 will continue to spin idly and never be able to do anything useful—although it is still running on one of the processors. Hence, if several such tasks are waiting for a kernel service, these tasks will execute serially, one after the other, as if on a uniprocessor system, and not together concurrently, as desired when we opt for a multiprocessor system.

This is the essential drawback of the giant lock kernel architecture. Having multiple processors may not actually improve the application performance if several tasks need kernel mode operation for significant amounts of time. As a result of this poor scaling capability and the adverse impact it has on interrupt latencies, this design is not used in most RTOSes that support multiple processors.

An example of such a kernel-lock implementation is the FreeBSD 4.2 operating system, which has a traditional non-real-time UNIX kernel.

The giant lock kernel is one extreme of a coarse-grained kernel lock design. In order to provide good scaling behavior, an SMP operating system needs to have separate locks for different kernel subsystems. So called fine-grained kernel locking is designed to allow individual kernel subsystems to run as separate threads on different processors at the same time—thus allowing a greater degree of parallelism among the tasks in an application.

Fine-grained locking
The goal underlying a fine-grained kernel locking architecture is to allow tasks running on different processors to concurrently execute kernel-mode operations. The result is often called a threaded kernel. This is accomplished by using separate spin locks for different kernel subsystems, so that tasks trying to access these individual subsystems can be allowed to run concurrently.

The granularity of the locking mechanism determines the maximum number of concurrent kernel threads. For example, the core scheduler and file-system components of most operating systems are relatively independent kernel services. Hence, two different spin locks can protect these two subsystems. Thus, a task performing a large file read/write involving time-consuming disk I/O does not have to block another task attempting to access the scheduler to say, activate another task. Hence, having separate spin locks for these two subsystems results in a threaded kernel, where two different tasks could be executing in kernel mode at the same time.


Figure 3: Two separate spin locks

As shown in Figure 3, while T1 is busy doing disk I/O, T2 is able to activate the high priority T3—thereby allowing T2 and T3 to do useful operations instead of spinning idly during this time. Notice that the spin lock acquired by T3 is released before that task disables itself. Otherwise, any other task trying to acquire the same lock would spin idly forever.

A fine-grained kernel locking mechanism can enable a much greater degree of parallelism in execution of user tasks. This helps boost CPU utilization and the overall application throughput.

However, having additional spin locks for different kernel subsystems introduces the overhead of having to manage each spin lock separately in order to guarantee mutual exclusion for each subsystem.

Having multiple spin locks also introduces a potential for deadlock. For example, in the scenario depicted in Figure 3, an alternative outcome is:

  1. T1 acquires SP_LOCK1
  2. T2 acquires SP_LOCK2
  3. T2 then tries to acquire SP_LOCK1 and spins waiting for T1 to release SP_LOCK1
  4. T1 then tries to acquire SP_LOCK2 and spins waiting for T2 to release SP_LOCK2

This results in a deadlock in which the entire system hangs. The problem gets more complicated if there are several such spin locks in the operating system. (This problem can be addressed by implementing the prioritized spin lock with priority ceiling or priority inheritance protocols.[1])

Thus, you must carefully balance the overhead of managing multiple spin locks with the desired degree of parallelism in the application. It's critical that you identify the kernel subsystems requiring separate spin locks. By carefully selecting a set of kernel subsystems to be protected by separate spin locks, a typical RTOS design can provide good scaling behavior of its SMP implementation for certain types of applications.

The MontaVista Linux Professional Edition, a derivative of the Linux 2.4 kernel with a fully preemptive scheduler, has a fine-grained locking mechanism inside the SMP kernel for improved scalability. This design allows user tasks using these services to run concurrently as separate kernel mode threads on different processors.

Among the other more popular commercial RTOS vendors, QNX and Wind River also proclaim multiprocessor support. The multiprocessor support on VxWorks or VxMP has synchronization primitives based on a message-passing interface between distinct VxWorks sessions running on each node. This is targeted more towards multi-board designs with board level interconnect rather than a traditional SMP design.

QNX on the other hand supports a true SMP design with a microkernel architecture instead of the monolithic architecture of its operating system. While the kernel itself supports a core set of services such as scheduler or interrupt management, all the other services such as network I/O (including the protocol stacks) are run as separate processes. This causes the kernel mode operations to run for very short periods of time and thus the microkernel can still be treated as a single entity in implementing the SMP locking mechanism. For a more detailed discussion of the QNX and its microkernel architecture, see the QNX System Architecture manual.[2])

Interrupt management
An important aspect of RTOS design that I intentionally left out in this discussion is the mechanism for servicing the various interrupt-driven devices such as the system timer and I/O devices. Interrupt latency affects the overall application's responsiveness to various external events and internal alarms. In the case of the system timer, interrupt latency of system clock interrupts is directly related to the scheduling latency that determines the real-time properties of the kernel.

In a typical uniprocessor RTOS, potential race conditions between tasks and interrupt handlers accessing the same global shared resource are prevented by having the task temporarily disable interrupts while it executes in the critical code section. In a multiprocessor system, however, disabling interrupts is not enough to prevent such race conditions.

Listing 3: A multiprocessor design with a race condition

Consider a situation where the interrupt handler Timer_ISR() , as shown in Listing 3, is being serviced, and Task_1 , running on another processor, then disables interrupts and enters its critical code section. Here we have the system clock interrupt handler, Timer_ISR() , running concurrently with Task_1(), both accessing the global variable time. Thus, we have a race condition between Task_1() and Timer_ISR() where the current time in seconds, read by Task_1() , does not correspond with the value in microseconds.

In order to avoid race conditions between tasks and interrupt service routines in an SMP system, both the tasks and the ISRs must gain exclusive access to the shared resource by obtaining a spin lock for the resource used by their critical section.

Modified pseudo-code for Task_1() and Timer_ISR() is shown in Listing 4. If the ISR finds that the spin lock is already in use, then it too will spin idly waiting for the task or another ISR to release it. This will affect interrupt latency, and overall system responsiveness to external events.

Listing 4: A version that lacks the race condition

In the master-slave configuration, since all interrupt handlers, being kernel mode operations, are restricted to run on the master, this type of race condition is implicitly avoided. In the giant lock and fine-grained kernel lock designs, however, we do need to implement this spin lock protection in the interrupt handler routines.

The adverse impact on interrupt latency becomes especially bad with coarse-grained kernel locking. In such operating systems, all tasks and ISRs must vie for access to the single kernel spin lock.

Typically, if an application is I/O-intensive, with a high degree of interrupt-driven data transfer involving kernel mode operations, fine-grained kernel locking is essential to accomplish a reasonable degree of parallelism on an SMP system.

More is better
Because of their unique capabilities, tightly coupled shared-memory SMP systems can be used in place of an existing uniprocessor systems to provide scalable platforms for CPU-intensive and high-availability designs.

Extending a uniprocessor RTOS to support an SMP system is relatively straightforward. This is typically accomplished by the addition of some basic primitives in the operating system such as spin locks. With spin locks and the choice of an appropriate kernel architecture, user-level tasks can safely and efficiently compete for shared resources while running on parallel processors. You should determine the number of spin locks based on the degree of parallelism you desire in the application and the overhead of managing the locks correctly.

Migration to an SMP system from a uniprocessor system may not result in an improvement in the overall throughput. However, many multithreaded applications and parallelizable algorithms can achieve significant speed-ups.

Srinivas Dharmasanam is an engineer at Pillar Data Systems. He has an MS degree in aerospace engineering from the University of Michigan, Ann Arbor. You can contact him by e-mail at .

References:

  1. Johnson, Theodore and Krishna Harathi, “A Prioritized Multiprocessor Spin Lock,” Tech. Rep. TR-93-005, Department of Computer Science, University of Florida, 1993.


Back

  • “System Architecture,” QNX Real-Time Operating System, QNX Software Systems Ltd.
    Back
  • Leave a Reply

    This site uses Akismet to reduce spam. Learn how your comment data is processed.