Do most embedded projects still need an real time operating system(RTOS)? It's a good question,given the speed of today's high performance processors and theavailability of patches for real-time Linux, Windows, and other generalpurpose operating systems (GPOSs). The answer lies in the very natureof embedded devices.
Devices that, in most cases, are manufactured in the thousands, ormillions, of units. Devices where even a $1 reduction in per-unithardware costs can save the manufacturer a small fortune. Devices, inother words, that can't afford multi-gigahertz processors or a largememory array. In the automotive telematics market, for instance, thetypical 32-bit processor runs at about 200Mhz — a far cry from the 2Ghzor faster processors now common in desktops and servers.
In an environment like this, an RTOS designed to extractextremely fast (and predictable) real-time response times fromlower-end hardware offers a serious economic advantage. Savings aside,the services provided by an RTOS make many computing problems easier tosolve, particularly when multiple activities compete for a system'sresources.
Consider, for instance, a system where users expect (or need)immediate response to input. With an RTOS, a developer can guaranteethat operations initiated by the user will execute in preference toother system activities, unless a more important activity (forinstance, an operation that helps protect the user's safety) mustexecute first.
Consider also a system that must satisfy quality of service (QoS)requirements, such as a device that presents live video. If the devicedepends on software for any part of its content delivery, it canexperience dropped frames at a rate that users perceive as unacceptable- from the users' perspective, the device is unreliable. But, with anRTOS, the developer can precisely control the order in which softwareprocesses execute and thereby ensure that playback occurs at anappropriate and consistent media rate.
RTOSs Aren't “Fair”
The need for “hard” real time – and for OSs that enable it – remainsprevalent in the embedded industry. The question is, what does an RTOShave that a GPOS doesn't? And how useful are the realtime extensionsnow available for some GPOSs? Can they provide a reasonable facsimileof RTOS performance?
Let's begin with task scheduling .In a GPOS, the scheduler typically uses a “fairness” policy to dispatchthreads and processes onto the CPU. Such a policy enables the highoverall throughput required by desktop and server applications, butoffers no assurances that high-priority,time-criticalthreads will execute in preference to lower-prioritythreads.
For instance, a GPOS may decay the priority assigned to ahigh-priority thread, or otherwise dynamically adjust the priority inthe interest of fairness to other threads in the system. Ahigh-priority thread can, as a consequence, be preempted by threads oflower priority. In addition, most GPOSs have unbounded dispatchlatencies: the more threads in the system, the longer it takes for theGPOS to schedule a thread for execution. Any one of these factors cancause a high-priority thread to miss its deadlines, even on a fast CPU.
In an RTOS, on the other hand, threads execute in order of theirpriority. If a high-priority thread becomes ready to run, it can,within a small and bounded time interval, take over the CPU from anylower-priority thread that may be executing. Moreover, thehigh-priority thread can run uninterrupted until it has finished whatit needs to do – unless, of course, it is preempted by an evenhigher-priority thread. This approach, known as priority-basedpreemptive scheduling, allows high-priority threads to meet theirdeadlines consistently, no matter how many other threads are competingfor CPU time.
In most GPOSs, the OS kernel isn't preemptible. Consequently, ahigh-priority user thread can never preempt a kernel call, but mustinstead wait for the entire call to complete – even if the call wasinvoked by the lowest-priority process in the system. Moreover, allpriority information is usually lost when a driver or other systemservice, usually performed in a kernel call, executes on behalf of aclient thread. Such behavior causes unpredictable delays and preventscritical activities from completing on time.
In an RTOS, on the other hand, kernel operations are preemptible.There are still windows of time in which preemption may not occur, butin a well-designed RTOS, those intervals are extremely brief, often inthe order of hundreds of nanoseconds. Moreover, the RTOS will impose anupper bound on how long preemption is held off and interrupts disabled;this allows developers to ascertain worst-case latencies.
To realize this goal, the RTOS kernel must be simple and elegant aspossible. And the best way to achieve this simplicity is to design akernel that only includes services with a short execution path. Byexcluding work-intensive operations (for instance, process loading)from the kernel and assigning them to external processes or threads,the RTOS designer can help ensure that there is an upper bound on thelongest non-preemptible code path through the kernel.
In a few GPOSs, such as Linux v2.6, some degree of preemptibility has been added tothe kernel. However, the intervals during which preemption may notoccur are still much longer than those in a typical RTOS; the length ofany such preemption interval will depend on the longest criticalsection of any modules (for instance, networking, file systems)incorporated into the GPOS kernel. Moreover, a preemptible GPOS kerneldoesn't address other conditions that can impose unbounded latencies,such as the loss of priority information that occurs when a clientinvokes a driver or other system service.
Mechanisms to Avoid PriorityInversion
Even in an RTOS, a lower-priority thread can inadvertently prevent ahigher-priority thread from accessing the CPU – a condition known aspriority inversion. When an unbounded priority inversion occurs,critical deadlines can be missed, resulting in outcomes that range fromunusual system behavior to outright failure. Unfortunately, priorityinversion is often overlooked during system design. Many examples ofpriority inversion exist, including one that plagued the Mars Pathfinder project in July 1997.
Generally speaking, priority inversion occurs when two tasks ofdiffering priority share a resource, and the higher-priority taskcannot obtain the resource from the lower-priority task. To preventthis condition from exceeding a bounded interval of time, an RTOS mayprovide a choice of mechanisms, including priority inheritance andpriority ceiling emulation. We couldn't possibly do justice to bothmechanisms here, so let's focus on an example of priority inheritance.
To begin, we must consider at how task synchronization can result inblocking, and how this blocking can, in turn, cause priority inversion.Let's say two jobs are running, Job 1 and Job 2, and that Job 1 has thehigher priority. If Job 1 is ready to execute, but must wait for Job 2to complete an activity, we have blocking.
This blocking may occur because of synchronization; for instance,Job 1 and Job 2 share a resource controlled by a lock or semaphore, andJob 1 is waiting for Job 2 to unlock the resource. Or, it may occurbecause Job 1 is requesting a service currently used by Job 2.
The blocking allows Job 2 to run until the condition that Job 1 iswaiting for occurs (for instance, Job 2 unlocks the resource that bothjobs share). At that point, Job 1 gets to execute. The total time thatJob 1 must wait may vary, with a minimum, average, and maximum time.This interval is known as the blocking factor. If Job 1 is to meet anyof its timeliness constraints, this factor can't vary according to anyparameter, such as the number of threads or an input into the system.In other words, the blocking factor must be bounded.
Now let's introduce a third job – Job 3 – that has a higher prioritythan Job 2 but a lower priority than Job 1 (see Figure 1 below ). If Job 3becomes ready to run while Job 2 is executing, it will preempt Job 2,and Job 2 won't be able to run again until Job 3 blocks or completes.This will, of course, increase the blocking factor of Job 1; that is,it will further delay Job 1 from executing. The total delay introducedby the preemption is a priority inversion.
|Figure1 – Job 1 is waiting for Job 2 to complete an activity, when Job 3preempts Job 2. This further delays Job 1 from executing.|
In fact, multiple jobs can preempt Job 2 in this way, resulting inan effect known as chain blocking. Under these circumstances, Job 2might be preempted for an indefinite period of time, yielding anunbounded priority inversion and causing Job 1 to fail to meet any ofits timeliness constraints.
This is where priority inheritance comes in. If we return to ourscenario and make Job 2 run at the priority of Job 1 during thesynchronization period, then Job 3 won't be able to preempt Job 2, andthe resulting priority inversion is avoided (see Figure 2 below ).
|Figure2 – Job 2 inherits Job 1's higher priority, thereby preventing Job 3from preempting Job 2. Job 3 no longer delays Job 1 from executing.|
Partitioning Schedulers forGuaranteed CPU Availability
As discussed, RTOSs use priority-based preemptive scheduling todetermine which task gets control of the processor. While this approachprovides developers with an easy method to define scheduling priority,it does pose a problem: If a given task is even one priority levelhigher than another task, that higher-priority task has the power tocompletely starve the less-critical task of CPU time. For instance,let's say you have two processes, process A and process B, where A hasa slightly higher priority than B.
If process A becomes swamped with work or becomes the target of adenial of service attack, it will lock out process B (as well as anyother lower-priority process) from accessing the CPU. Simply put,priority-based scheduling offers no guarantee that lower-priorityprocesses will access at least a fraction of the CPU. If ahigh-priority thread runs in a loop, it can prevent other threads fromaccessing the CPU and effectively freeze the entire system.
This inability to provide guaranteed CPU budgets can make itdifficult to integrate subsystems from multiple development teams,since it allows tasks in one subsystem to starve tasks in othersubsystems — a problem that may not become obvious until integrationand verification testing. Lack of CPU guarantees also makes the systemvulnerable to denial of service attacks and other malicious exploits.
To compensate for this flaw, some RTOSs offer a fixed partitionscheduler. Using this scheduler, the system designer can divide tasksinto groups, or partitions, and allocate a percentage of CPU time toeach partition. With this approach, no task in any given partition canconsume more than the partition's statically defined percentage of CPUtime. For instance, let's say a partition is allocated 30% of the CPU.
If a process in that partition subsequently becomes the target of adenial of service attack, it will consume no more than 30% of CPU time.This allocated limit not only allows other partitions to maintain theiravailability, but can also ensure that the user interface (forinstance, a remote terminal) remains accessible. As a result, operatorscan access the system and resolve the problem – without having to hitthe reset switch.
There is a problem, however. Because the scheduling algorithm isfixed, a partition can never use CPU cycles allocated to otherpartitions, even if those partitions haven't used their allottedcycles. This approach squanders CPU cycles and prevents partitions fromeffectively handling peak demands. Systems designers must, as a result,use more-expensive processors, tolerate a slower system, or limit theamount of functionality that the system can support.
Another approach, called adaptive partitioning, addresses thesedrawbacks by providing a more dynamic scheduling algorithm. Like staticpartitioning, adaptive partitioning allows the system designer toreserve CPU cycles for a process or group of processes. The designercan thus guarantee that the load on one subsystem or partition won'taffect the availability of other subsystems. Unlike static approaches,however, adaptive partitioning uses standard priority-based schedulingwhen the system isn't under full CPU load or attack.
As a result, threads in one partition can access any spare CPUcycles unused by threads in any other partition. This approach, uffersthe best of both worlds: it can enforce CPU guarantees when the systemruns out of excess cycles (for maximum security and guaranteedavailability of lower-priority services) and can dispense free CPUcycles when they become available (for maximum performance).
|Figure3 – Adaptive partitioning prevents high-priority tasks from consumingmore than their assigned CPU percentage, unless the system containsunused CPU cycles. For instance, tasks A and D can run in timeallocated to Partition 3 because tasks E and F don't require the restof their budgeted CPU cycles.|
GPOSs – including Linux, Windows, and various flavors of Unix – typically lack therealtime mechanisms discussed thus far. Nonetheless, vendors havedeveloped a number of realtime extensions and patches in an attempt tofill the gap. There is, for example, the dual-kernel approach, in whichthe GPOS runs as a task on top of a dedicated realtime kernel (see Figure 4 below ). Any tasks thatrequire deterministic scheduling run in this kernel, but at a higherpriority than the GPOS. These tasks can thus preempt the GPOS wheneverthey need to execute and will yield the CPU to the GPOS only when theirwork is done.
Unfortunately, tasks running in the realtime kernel can make onlylimited use of existing system services in the GPOS – file systems,networking, and so on. In fact, if a realtime task calls out to theGPOS for any service, it will be subject to the same preemptionproblems that prohibit GPOS processes from behaving deterministically.As a result, new drivers and system services must be createdspecifically for the realtime kernel, even when equivalent servicesalready exist for the GPOS.
Also, tasks running in the realtime kernel don't benefit from therobust MMU-protected environment thatmost GPOSs provide for regular, nonrealtime processes. Instead, theyrun unprotected in kernel space. Consequently, a realtime task thatcontains a common coding error, such as a corrupt C pointer, can easilycause a fatal kernel fault.
To complicate matters, different implementations of the dual-kernelapproach use different APIs. In most cases, services written for theGPOS can't easily be ported to the realtime kernel, and tasks writtenfor one vendor's realtime extensions may not run on another's.
|Figure4 – In a typical dual-kernel implementation, the GPOS runs as thelowest-priority task in a separate realtime kernel.|
Realtime Patches for GPOS Kernels
Rather than use a second kernel, other approaches modify the GPOSitself, such as by adding highresolution timers or a modified processscheduler. Such approaches have merit, since they allow developers touse a standard kernel (albeit with proprietary patches) and programmingmodel. Moreover, they help address the requirements of reactive,event-driven systems. Unfortunately, such low-latency patches don'taddress the complexity of most realtime environments, where realtimetasks span larger time intervals and have more dependencies on systemservices and other processes than do tasks in a simple event-drivensystem.
For instance, in systems where realtime tasks depend on devicedrivers, file systems, or other services, the problem of priorityinversion would have to be addressed. In Linux, the driver and virtual file system (VFS)frameworks would effectively have to be rewritten – along with anydevice drivers and file systems that employ them. Without suchmodifications, realtime tasks could experience unpredictable delayswhen blocked on a service. As a further problem, most existing Linuxdrivers aren't preemptible.
To ensure predictability, programmers would also have to insertpreemption points into every driver in the system. All this points tothe real difficulty, and immense scope, of making a GPOS capable ofsupporting realtime behavior. This isn't a matter of “RTOS good, GPOSbad,” however. GPOSs such as Linux, Windows XP, and the various Unixesall serve their intended purposes very well. They only fall short whenforced into deterministic environments they weren't designed for, suchas in-car telematics systems, medical instruments, and continuous mediaapplications.
Extending the RTOS
Still, there are undoubted benefits to using a GPOS, such as supportfor widely used APIs and, in the case of Linux, the open source model.With open source, a developer can customize OS components forapplication-specific demands and save considerable timetroubleshooting.
The RTOS vendor can't afford to ignore these benefits. Extensivesupport for POSIX APIs – the same APIs usedby Linux and various Unixes — is an important first step. So isproviding well-documented source code and customization kits thataddress the specific needs and design challenges of embeddeddevelopers.
The architecture of the RTOS also comes into play. An RTOS based ona microkernel design, for instance, can make the job of OScustomization fundamentally easier to achieve. In a microkernel RTOS,only a small core of fundamental OS services (for instance, signals,timers, scheduling) reside in the kernel itself.
All other components – drivers, file systems, protocol stacks,applications – run outside the kernel as separate, memory-protectedprocesses (see Figure 5, below ).As a result, developing custom drivers and other application-specificOS extensions doesn't require specialized kernel debuggers or kernelgurus. In fact, as user-space programs, such extensions become as easyto develop as standard applications, since they can be debugged withstandard source-level tools and techniques.
For instance, if a device driver attempts to access memory outsideits process container, the OS can identify the process responsible,indicate the location of the fault, and create a process dump fileviewable with source-level debugging tools. The dump file can includeall the information the debugger needs to identify the source line thatcaused the problem, along with diagnostic information such as thecontents of data items and a history of function calls.
Such an architecture also provides superior fault isolation andrecovery: If a driver, protocol stack, or other system service fails,it can do so without corrupting other services or the OS kernel. Infact, “software watchdogs” can continuously monitor for such events andrestart the offending service dynamically, without resetting the entiresystem or involving the user in any way. Likewise, drivers and otherservices can be dynamically stopped, started, or upgraded, againwithout a system shutdown.
These benefits shouldn't be taken lightly – the biggest disruptionthat can occur to realtime performance is an unscheduled system reboot!Even a scheduled reboot to incorporate software upgrades disruptsoperation, though in a controlled manner. To ensure that deadlines arealways met, developers must use an OS that can remain continuouslyavailable, even in the event of software faults or service upgrades.
|Figure5 – In a microkernel RTOS, system services run as standard, user-spaceprocesses, simplifying the task of OS customization.|
A Strategic Decision
An RTOS can help make complex applications both predictable andreliable; in fact, the precise control over timing made possible by anRTOS adds a form of reliability that cannot be achieved with a GPOS.(If a system based on a GPOS doesn't behave correctly due to incorrecttiming behavior, then we can justifiably say that the system isunreliable.) Still, choosing the right RTOS can itself be a complextask. The underlying architecture of an RTOS is an important criterion,but so are other factors. These include:
1) Flexiblechoice of scheduling algorithms – Does the RTOS support a choiceof scheduling algorithms (FIFO, round robin, sporadic, etc.)? Can youassign those algorithms on a per-thread basis, or does the RTOS forceyou into assigning one algorithm to all threads in your system?
2) GuaranteedCPU availability – Does the RTOS support a partitioningscheduler that provides tasks with a guaranteed percentage of CPU time,regardless of what other tasks, including higher-priority tasks, aredoing? Such guarantees simplify the job of integrating subsystems frommultiple development teams or vendors. They also ensure that criticaltasks can remain available and meet their deadlines, even when thesystem is subjected to denial of service attacks and other maliciousexploits.
3) Graphicaluser interfaces – Does the RTOS use primitive graphics librariesor does it provide advanced graphics capabilities such as multi-layerinterfaces, multi-headed displays, accelerated 3D rendering, and a truewindowing system? Can you easily customize the GUI's look-and-feel? Canthe GUI display and input multiple languages (Chinese, Korean,Japanese, English, Russian, etc.) simultaneously? Can 2D and 3Dapplications easily share the same screen?
4) Tools forremote diagnostics – Because downtime is intolerable for manyembedded systems, the RTOS vendor should provide diagnostics tools thatcan analyze a system's behavior without interrupting services that thesystem provides. Look for a vendor that offers tools for code coverage,application profiling, system profiling, and memory analysis.
5) Opendevelopment platform – Does the RTOS vendor provide adevelopment environment based on a open platform like Eclipse, whichlets you “plug in” your favorite third-party tools for modeling,version control, and so on? Or is the development environment based onproprietary technology?
7) Standard APIs – Does the RTOS lock you into a proprietary API, or does it providefull support for a standard API like POSIX, which makes it easier toport code to and from other OSs? Also, does the RTOS offercomprehensive support for the API, or does it support only a smallsubset of the defined interfaces?
8) Support formulti-core processors – The ability to migrate to multi-coreprocessors is becoming a requirement for a variety of high-performancedesigns. Does the RTOS support a choice of multiprocessing models (symmetric multiprocessing, asymmetricmultiprocessing, bound multiprocessing) to help you take best advantageof multi-core hardware? And is the RTOS supported by system-tracingtools that let you diagnose and optimize the performance of amulti-core system? Without tools that can highlight resourcecontention, excessive thread migration between cores, and otherproblems common to multi-core designs, optimizing a multi-core systemcan quickly become an onerous, time consuming task.
9) Source codekits – Does the RTOS vendor provide well-documented source andcustomization kits to help tailor the RTOS to your specificrequirements? Does the vendor also offer driver development kits,including source code, to help you develop drivers for custom hardware?Choosing an RTOS is a strategic decision for any project team. Once anRTOS vendor has provided clear answers to the above questions, you'llbe much closer to choosing the RTOS that's right for you now — and inthe future.
|This article is excerpted from a paper ofthe same name presented atthe Embedded Systems Conference Boston 2006. Used with permission ofthe Embedded Systems Conference. For more information, please visit www.embedded.com/esc/boston/|
Paul N. Leroux is TechnologyAnalyst and Jeff Schaffer is Senior Applications Engineer at QNXSoftware Systems