The Linux operating system, specifically the Linux kernel, wasdesigned and tuned by its creator, Linus Torvalds (and others) toprovide a combination of throughput and fairness in the allocationof CPU resources. But, as Linux continues to move into industrialand embedded computing environments, other requirements come to thefore. Specifically, there is a growing need for guaranteed responseto external events. Also, the application designer may need torigorously control the allocation of CPU resources.
Three critical areas of kernel design impact responsiveness:
- External interrupt management
- Process (thread) preemptability
- Process (or thread) scheduling.
Here, the term “process” refers to the primary schedulableentity in the Linux kernel. Many operating systems directlyschedule threads as entities separately represented in kernel datastructures, and, in those cases, the term “thread” would beappropriate. However, because Linux uses most of the internal datastructure of a process to represent a single thread, “process” ismore appropriate.
When an external interrupt occurs, an external interface and/ordevice needs attention or service. A system considered real-timemust respond within a critical time window to many types ofexternal interrupts. Hard real-time systems are those where failureof the system to respond by a particular deadline results incatastrophic failure of the control system. In other words, thedeadline (whatever it may be) cannot be missed.
Soft real-time systems are those where the system under controldegrades with increasing delays in responding to interrupts, butdoesn't suffer a catastrophic failure. So missed deadlines, whilenot desirable, are allowable (because the event can be retriedand/or recovered).
An important issue in external interrupt management is the useof an interrupt-off processor mode. Such a mode can be used toensure the kernel is not reentered—when the hardwareacknowledges an external interrupt and then executes a softwareinterrupt handler. Reentrancy is a problem relative to criticalsections of code execution, which must be executed from beginningto end without intervening operations to ensure correctness. Asimple example of a critical section occurs when we increment aglobal integer variable—implemented in most machinearchitectures as a load from memory, an increment in a register,then a store back to memory. Intermixing these three operationsacross multiple execution streams will cause errors in the value ofthe stored integer.
Kernel architectures are frequently designed so that particulardata structures and operations are off limits to interrupthandlers—so that the data structures and operations do notneed protection from interrupt handlers. For the most part, this isthe case in the Linux kernel. As a result, interrupt-off periodsare not used extensively, but they are used and they do impactworst-case delays between the arrival of an external interrupt andinitiation of the interrupt handler.
Another critical design issue for interrupt management is theability of the system to nest interrupts—to accept a higherpriority interrupt while still servicing an interrupt. By providingsuch a feature, the system designer can specify the relativepriority of external event servicing, and ensure that criticalexternal operations get priority of service by interrupt handlers.Linux does allow nested interrupts, and most hardware allows thepriority of different interrupt types to be specified in someway.
Many systems have relatively simple software for real-timecontrol; it can be entirely structured into the interrupt serviceroutine associated with a particular I/O interface or device. Butother control systems require more complex data-processing and/orcontrol logic running with real-time constraints. Such systemsemploy user-level processes (single-threaded or multi-threaded),which are scheduled (directly or via signals) from externalinterrupt handlers. This type of architecture may cause systemdelays before executing the real-time process that must be run inresponse to an external event.
A simple approach to this problem would be to suppress oreliminate other system activity. But that would be unrealistic inmost cases. A system complex enough to require real-time processesat the user level will also include non-real-time (or lowerpriority) processes for other operations. Examples of theseoperations include data processing, event logging, userinterfacing, or other less time-critical applications. Theseprocesses may be executing when a high-priority external eventgenerates an interrupt, causing the interrupt handler to schedule auser-level process that has a priority higher than the executingprocess.
If the interrupted process runs in user mode, the kernel can(and usually will) switch context immediately to the newlyexecutable highest-priority process after exiting the interruptservice routine. (However, as we shall see later, Linux may not dothat.) Of course, if the interrupted process executes in kernelmode, then kernel design with respect to preemptability becomes amajor concern.
A truly preemptable kernel can switch from one process runningin kernel mode to a different process, which may then enter orcontinue in kernel mode. A kernel that is not preemptable cansafely execute critical sections of code (e.g. an incrementoperation on a global integer variable) without risk of intermixingsuch operations and creating errors. Once again, we assume theexclusion of external interrupt handlers operating on datamanipulated in the critical section.
A preemptable kernel must protect all critical sections fromreentrancy. This protection can be achieved in various ways. Oneapproach turns off all interrupts during execution of the criticalsection, and enables the interrupts again upon exit. A secondapproach uses semaphores—locking on entry to the criticalsection, and unlocking on exit. An attempt to enter a lockedcritical section results in a blocked running process. Reschedulingand re-dispatching follows after the locking process continuesexecution and exits the critical section.
A third hybrid kernel design (or a retro-design) allowskernel-mode execution to be preempted only at fixed, known safepoints, typically called preemption points. A simplistic approachto improving kernel responsiveness in a non-preemptable kernelplaces preemption points at roughly-equal spacing—thusreducing delays in initiating context switches to higher-priorityprocesses.
Finally, a fourth hybrid design uses a context-switch lockaround critical sections. Such a design sets and clears a globalflag upon entry and exit for critical sections. A set flag tellsthe process scheduler that a context switch is notallowed—requiring a straight return from interrupt, even if ahigher-priority process was scheduled by an interrupt serviceroutine. When the process holding the context-switch lock releasesit, the presence of a higher-priority executable process must thenbe noted—allowing an immediate context switch to occur.
Each of the kernel designs has a specific impact onprocess-level response to external interrupts.
A non-preemptable kernel raises two concerns:
- What is the worst-case path length (in terms of execution time)through kernel code? This time will constitute the majority of theworst-case delay in running a newly executable real-time process(scheduled by an external interrupt handler).
- The second concern involves general responsiveness. In a softreal-time system, a non-preemptable kernel will have generallylonger and more frequent context-switch delays—because everykernel entry by a process initiates a potential delay.
In a kernel with preemption points, the issues are very similarto those for the non-preemptable case. The maximum path lengthbetween preemption points corresponds to the maximum kernel pathlength in a non-preemptable kernel. Similarly, typical path lengthsbetween preemption points are equivalent to typical kernel pathlengths in a non-preemptable kernel.
In both a preemption-point approach and a non-preemptable kernelapproach, it becomes difficult to maintain any given specificationof responsiveness when code additions or modifications areexecuted. It requires frequent and careful repeats of measurements,and/or tight control over modifications.
In the preemptable kernel design, the issues are quitedifferent. With interrupt management (off/on operations aroundcritical sections), we must consider the maximum interrupt-offperiod, the number of critical sections, and the typical length ofsuch sections. Here again, re-measurement and/or very carefulengineering must be applied whenever code inside critical sectionsis modified, or when new critical sections are added. Note that anysystem modification may cause timing changes—sometimes hugeones—due to the impact on data, code, or address-translationcache content.
If a semaphore technique is used for critical-sectionmanagement, there will be no delay in initiating a context switchto a newly executable higher priority process. However, such aprocess may run into a locked critical region necessitating twoextra context switches before continuation: a switch back to theprocess holding the lock, followed by release of the lock, followedby a switch back to the higher-priority process. The frequency withwhich common locks (to lock common data structures or criticalsections) are used, and the time periods for which they are held,are the critical factors in determining effective process response.Latency problems can be avoided if the real-time process does notneed to use kernel services in response to externalinterrupts—or if it is guaranteed to use only kernel servicesthat don't use locks, or if it uses locks unique to itself.
The lock method causes extra delay in context-switch initiationduring execution of the critical section. However, it eliminatespotential delays later if the new running process attempts to enterthe same critical section.
The third critical area in kernel design that impactsresponsiveness (in this case, process responsiveness) involves thebehavioral characteristics of the scheduler. In a real-time controlsystem, the system designer must be able to specify the relativeimportance of operations—which are then executed by acombination of interrupt-service and process-level software.Process priorities are specified via an API, and in the case ofLinux and most current versions of Unix, by way of the POSIX1003.1b real-time API's (specifically, sched_setscheduler(2)).
The existence of the required API meets just one of therequirements. We also require that the scheduler must actuallyhonor the real-time priorities—by always selecting anddispatching the highest priority process first, regardless of otherfactors. In a third requirement, the scheduler must select anddispatch the highest priority process in a fixed (deterministic)time period. We refer to such a scheduler as a real-timescheduler.
Note that several other approaches exist for schedulers thatinclude timing attributes. These include rate monotonic scheduling,fair-share scheduling, and others—and such schedulers may bereferenced as having real-time attributes. Here, however, we definereal-time scheduling as the simple property of selecting anddispatching processes in their specified priority order—anddoing so quickly and deterministically.
For a Linux kernel, the design focuses on applicationthroughput—along with overall design simplicity and elegance.As a corollary to throughput, the Linux scheduler also attempts toprovide a fair share policy to ensure that all process groups getbalanced CPU resources.
These objectives have driven some basic Linux design choices forreal-time control applications. We have the following alternativesfor the Linux kernel:
- Use interrupt-off management for critical sections
- Make it non-preemptable
- Use a merged-process selection algorithm focused on ensuringaggregate throughput and overall forward progress of allprocesses
- Refuse to allow nested interrupts and interrupt prioritization(dependent on the hardware).
For Linux, an interrupt-off technique manages sections of codethat are critical relative to interrupt-service-routine operations.Research by MontaVista has identified a problem area in Linuxinvolving termination of a process that has a large number ofactive child processes. When a loop is executing with interruptsoff (with the number of child processes as the iteration parameter)this creates a theoretically unbounded interrupt-off period inLinux. Careful application design can avoid the problem (alsoMontaVista offers a software patch).
A uniprocessor system usually employs a non-preemptable Linuxkernel. For a throughput-oriented system, this is a reasonabledesign choice. Allowing kernel preemption would mean that manyoperations become critical sections that need protection using oneor more of the methods we have discussed. Such protection thencreates overhead—negatively impacting throughput. That is whyoperating system engineers argue that the design of a kernel forreal-time operation directly contradicts the design of a system forthroughput.
Note, however, that this reasoning may be flawed—thesignificance of I/O operations has not been factored into therationale. Generally speaking, overall throughput is enhanced whenthe entire computer system is fully used at all times. Simply put,the operating system must keep the I/O channels busy. Processes inan execution mode of short period of execution, initiate I/O, andwait should be given priority over those in a mode of long periodof execution.
The non-preemptability of Linux raises questions about thesuitability of the OS for real-time control applications. Initialmeasurements by MontaVista of non-preemptable paths in the Linuxkernel on a 266-MHz Pentium II processor show path lengths of over130-ms.
We will not attempt here to explain in detail the specificalgorithms used by the Linux scheduler to combine throughput andfairness. Suffice to say that when processes are given a real-timepriority, that priority does not provide the primary algorithmicdeterminant of process selection. Rather, the scheduler fullyreviews every executable process. If no real-time processes areencountered, then a combination of factors are used in finalselection—with an emphasis on keeping I/O channels busy,assuring that a single process group doesn't overly monopolize theCPU resource, and that process starvation doesn't occur. Ifexecutable real-time processes are found, then the highest priorityprocess is dispatched.
The problem of linear overhead in process selection and dispatch(as a function of the number of executable processes) has been asubject of extensive debate. Some argue that real-world systemsnever have more than a very small number of executable processes.Or, if they do, they are poorly designed systems or improperlyoverloaded). Others disagree. They point out that real-worldengineering teams don't always implement excellent designs.Instead, they create systems that may have large numbers ofexecutable processes, and then expect the scheduler to properlymanage the load.
At MontaVista, we believe both views may be correct. As acommercial company, we must meet customer needs, and those needsinclude having a process scheduler that will select and dispatch ahigh priority real-time process in a fixed period of time,independent of process loading on the CPU. We cannot just argueaway the perceived problem by blaming a customer's flaweddesign.
The first approach to solving these issues for use of Linux in areal-time control application is “don't”. In other words, designthe application solution around the problems. The followingapproaches are essential:
- If possible, address all real-time response needs directly withinterrupt service routines.
- Avoid known excessive interrupt-off periods in Linux. I/Odrivers can and do turn off interrupts. So the specific driversshould be evaluated for excessive interrupt-off operations, and ifproblems are found, they must be corrected.
- If a process component is required in the real-time controlpath, then consider aggregate system loading. (Might otherprocesses be running when the real-time process becomes executable?If so, what operations could those processes be performing? Isthere any risk of a long kernel call in execution? If so, can theapplication deal with the delay? If not, are there ways torestructure the application to avoid the problem?)
To avoid problems that can occur when a process component isrequired in the real-time control path, system designers shouldseriously consider using the MontaVista real-time scheduler. Thedefault Linux scheduler will not guarantee an important processwill get immediately and rapidly selected and dispatched.
Another approach to dealing with these issues is to use a hybridkernel. Specifically,
A possible problem with hybrid-kernel approaches is theircomplexity—from both design and implementation perspectives.Debugging a real-time control system can be difficult enough in asingle environment. Also, the question, “Can the real-timecomponent be addressed in a standard Linux interrupt serviceroutine?” really should be considered and addressed before jumpingto a hybrid-kernel solution. Again, the key issue with theinterrupt-service-routine solution is the impact of interrupt-offoperations in Linux. Also we must consider potential needs for moresoftware complexity and interaction in the real-time component thanis reasonable to embed in an interrupt service routine.
A third approach addresses Linux issues relative to real-timecontrol applications directly in Linux. There has been substantialwork in this area. Important examples include using preemptionpoints in the Linux kernel, a rate-monotonic scheduler developed atUCIrvine, the
MontaVista's approach to providing improved responsiveness inLinux proper is to execute the following plan:
- Measure the interrupt-off periods, provide the information forthe usage of application designers, and if possible, reduce thelength of any outliers (see the
MontaVista Web page forcurrent measurement information)
- Provide a priority driven real-time process scheduler withfixed and low overhead for process selection and dispatch
- Measure process preemption delays, provide the information forthe usage of application designers, and if possible, reduce thelength of any outliers
- Develop, assess, and provide a fully preemptable Linuxkernel.
The priority-driven scheduler provided by MontaVista isstructured as a kernel configuration option. The company recommendsthis scheduler for any application needing firm control of theexecution priority of selected processes. The scheduler selects anddispatches processes with a real-time priority (specified via thesched_setscheduler(2) system call), in priority order. Thisscheduler extends the real-time priority range from the defaultLinux range of 0-100 to 0-127.
The scheduler uses an optimized process-selection algorithm,based on a bit-map and a set of 128 executable process queues.Typically the priority of the most important process is known, andthe process can be immediately de-queued and executed. On occasion,the scheduler must perform a bit-map search to determine thehighest priority non-empty queue. The maximum overhead in such asituation is four tests for non-zero words, followed by a set-bitsearch through a 32-bit word. Hence, while there is somevariability in process selection time, the overhead isfixed—and extremely fast, even in slow mode.
Whenever there is no real-time process to execute, the schedulerdrops through to the standard Linux scheduler. The scheduler knowsthat there is no real-time executable task present, so the overheadincurred is almost zero. There is a very small impact in code sizeand in the data array of 128 process-list pointers.
During the development of the real-time scheduler, significantdefects in Linux were discovered—perhaps not for the firsttime. The yield (2) system call was found to not always yield toanother process (induce a context switch to another runnableprocess). This can be observed when yield is used as a basis forperforming context-switch measurements. The assumption of suchbenchmarks is that yield always induces a context switch to anotherexecutable process (of equal priority). Hence, such measurementscan actually measure substantially fewer context switches thanexpected.
The MontaVista scheduler package includes a correction to thedefect in the standard Linux scheduler. A related problem found,and fixed, is that the exhaustion of time slice in a real-timeprocess scheduled with a round-robin scheduling policy does notmake it yield to another executable process at the same priority.The company's process preemption delay study is still inprogress—results will be shared when the study iscomplete.
There has been much debate about restructuring the Linux kernelto be fully preemptable. MontaVista intends to explore this as aplausible alternative to dramatically improve Linux process-levelresponse times. The potential of such an approach to improvethroughput (by increasing the total parallelism of all operationsin the computing hardware complex) will be studied and measured.Note that such an improvement will be highlyapplication-dependent—clearly, a non-I/O oriented load willnot benefit. By leveraging the Linux (version 2.4) multiprocessorlocks as the critical-section management technique, we believe,though it remains to be proven, that Linux source code needs verylittle modification to implement such a system.