By Kevin D. Morgan
The Linux operating system, specifically the Linux kernel, was
designed and tuned by its creator, Linus Torvalds (and others) to
provide a combination of throughput and fairness in the allocation
of CPU resources. But, as Linux continues to move into industrial
and embedded computing environments, other requirements come to the
fore. Specifically, there is a growing need for guaranteed response
to external events. Also, the application designer may need to
rigorously control the allocation of CPU resources.
Three critical areas of kernel design impact responsiveness:
- External interrupt management
- Process (thread) preemptability
- Process (or thread) scheduling.
Here, the term "process" refers to the primary schedulable
entity in the Linux kernel. Many operating systems directly
schedule threads as entities separately represented in kernel data
structures, and, in those cases, the term "thread" would be
appropriate. However, because Linux uses most of the internal data
structure of a process to represent a single thread, "process" is
more appropriate.
External Interrupt Management
When an external interrupt occurs, an external interface and/or
device needs attention or service. A system considered real-time
must respond within a critical time window to many types of
external interrupts. Hard real-time systems are those where failure
of the system to respond by a particular deadline results in
catastrophic failure of the control system. In other words, the
deadline (whatever it may be) cannot be missed.
Soft real-time systems are those where the system under control
degrades with increasing delays in responding to interrupts, but
doesn't suffer a catastrophic failure. So missed deadlines, while
not desirable, are allowable (because the event can be retried
and/or recovered).
An important issue in external interrupt management is the use
of an interrupt-off processor mode. Such a mode can be used to
ensure the kernel is not reentered—when the hardware
acknowledges an external interrupt and then executes a software
interrupt handler. Reentrancy is a problem relative to critical
sections of code execution, which must be executed from beginning
to end without intervening operations to ensure correctness. A
simple example of a critical section occurs when we increment a
global integer variable—implemented in most machine
architectures as a load from memory, an increment in a register,
then a store back to memory. Intermixing these three operations
across multiple execution streams will cause errors in the value of
the stored integer.
Kernel architectures are frequently designed so that particular
data structures and operations are off limits to interrupt
handlers—so that the data structures and operations do not
need protection from interrupt handlers. For the most part, this is
the case in the Linux kernel. As a result, interrupt-off periods
are not used extensively, but they are used and they do impact
worst-case delays between the arrival of an external interrupt and
initiation of the interrupt handler.
Another critical design issue for interrupt management is the
ability of the system to nest interrupts—to accept a higher
priority interrupt while still servicing an interrupt. By providing
such a feature, the system designer can specify the relative
priority of external event servicing, and ensure that critical
external operations get priority of service by interrupt handlers.
Linux does allow nested interrupts, and most hardware allows the
priority of different interrupt types to be specified in some
way.
Process Preemptability
Many systems have relatively simple software for real-time
control; it can be entirely structured into the interrupt service
routine associated with a particular I/O interface or device. But
other control systems require more complex data-processing and/or
control logic running with real-time constraints. Such systems
employ user-level processes (single-threaded or multi-threaded),
which are scheduled (directly or via signals) from external
interrupt handlers. This type of architecture may cause system
delays before executing the real-time process that must be run in
response to an external event.
A simple approach to this problem would be to suppress or
eliminate other system activity. But that would be unrealistic in
most cases. A system complex enough to require real-time processes
at the user level will also include non-real-time (or lower
priority) processes for other operations. Examples of these
operations include data processing, event logging, user
interfacing, or other less time-critical applications. These
processes may be executing when a high-priority external event
generates an interrupt, causing the interrupt handler to schedule a
user-level process that has a priority higher than the executing
process.
If the interrupted process runs in user mode, the kernel can
(and usually will) switch context immediately to the newly
executable highest-priority process after exiting the interrupt
service routine. (However, as we shall see later, Linux may not do
that.) Of course, if the interrupted process executes in kernel
mode, then kernel design with respect to preemptability becomes a
major concern.
A truly preemptable kernel can switch from one process running
in kernel mode to a different process, which may then enter or
continue in kernel mode. A kernel that is not preemptable can
safely execute critical sections of code (e.g. an increment
operation on a global integer variable) without risk of intermixing
such operations and creating errors. Once again, we assume the
exclusion of external interrupt handlers operating on data
manipulated in the critical section.
A preemptable kernel must protect all critical sections from
reentrancy. This protection can be achieved in various ways. One
approach turns off all interrupts during execution of the critical
section, and enables the interrupts again upon exit. A second
approach uses semaphores—locking on entry to the critical
section, and unlocking on exit. An attempt to enter a locked
critical section results in a blocked running process. Rescheduling
and re-dispatching follows after the locking process continues
execution and exits the critical section.
A third hybrid kernel design (or a retro-design) allows
kernel-mode execution to be preempted only at fixed, known safe
points, typically called preemption points. A simplistic approach
to improving kernel responsiveness in a non-preemptable kernel
places preemption points at roughly-equal spacing—thus
reducing delays in initiating context switches to higher-priority
processes.
Finally, a fourth hybrid design uses a context-switch lock
around critical sections. Such a design sets and clears a global
flag upon entry and exit for critical sections. A set flag tells
the process scheduler that a context switch is not
allowed—requiring a straight return from interrupt, even if a
higher-priority process was scheduled by an interrupt service
routine. When the process holding the context-switch lock releases
it, the presence of a higher-priority executable process must then
be noted—allowing an immediate context switch to occur.
Each of the kernel designs has a specific impact on
process-level response to external interrupts.
A non-preemptable kernel raises two concerns:
- What is the worst-case path length (in terms of execution time)
through kernel code? This time will constitute the majority of the
worst-case delay in running a newly executable real-time process
(scheduled by an external interrupt handler).
- The second concern involves general responsiveness. In a soft
real-time system, a non-preemptable kernel will have generally
longer and more frequent context-switch delays—because every
kernel entry by a process initiates a potential delay.
In a kernel with preemption points, the issues are very similar
to those for the non-preemptable case. The maximum path length
between preemption points corresponds to the maximum kernel path
length in a non-preemptable kernel. Similarly, typical path lengths
between preemption points are equivalent to typical kernel path
lengths in a non-preemptable kernel.
In both a preemption-point approach and a non-preemptable kernel
approach, it becomes difficult to maintain any given specification
of responsiveness when code additions or modifications are
executed. It requires frequent and careful repeats of measurements,
and/or tight control over modifications.
In the preemptable kernel design, the issues are quite
different. With interrupt management (off/on operations around
critical sections), we must consider the maximum interrupt-off
period, the number of critical sections, and the typical length of
such sections. Here again, re-measurement and/or very careful
engineering must be applied whenever code inside critical sections
is modified, or when new critical sections are added. Note that any
system modification may cause timing changes—sometimes huge
ones—due to the impact on data, code, or address-translation
cache content.
If a semaphore technique is used for critical-section
management, there will be no delay in initiating a context switch
to a newly executable higher priority process. However, such a
process may run into a locked critical region necessitating two
extra context switches before continuation: a switch back to the
process holding the lock, followed by release of the lock, followed
by a switch back to the higher-priority process. The frequency with
which common locks (to lock common data structures or critical
sections) are used, and the time periods for which they are held,
are the critical factors in determining effective process response.
Latency problems can be avoided if the real-time process does not
need to use kernel services in response to external
interrupts—or if it is guaranteed to use only kernel services
that don't use locks, or if it uses locks unique to itself.
The lock method causes extra delay in context-switch initiation
during execution of the critical section. However, it eliminates
potential delays later if the new running process attempts to enter
the same critical section.
Process Scheduling
The third critical area in kernel design that impacts
responsiveness (in this case, process responsiveness) involves the
behavioral characteristics of the scheduler. In a real-time control
system, the system designer must be able to specify the relative
importance of operations—which are then executed by a
combination of interrupt-service and process-level software.
Process priorities are specified via an API, and in the case of
Linux and most current versions of Unix, by way of the POSIX
1003.1b real-time API's (specifically, sched_setscheduler(2)).
The existence of the required API meets just one of the
requirements. We also require that the scheduler must actually
honor the real-time priorities—by always selecting and
dispatching the highest priority process first, regardless of other
factors. In a third requirement, the scheduler must select and
dispatch the highest priority process in a fixed (deterministic)
time period. We refer to such a scheduler as a real-time
scheduler.
Note that several other approaches exist for schedulers that
include timing attributes. These include rate monotonic scheduling,
fair-share scheduling, and others—and such schedulers may be
referenced as having real-time attributes. Here, however, we define
real-time scheduling as the simple property of selecting and
dispatching processes in their specified priority order—and
doing so quickly and deterministically.
Real-Time Control with Linux
For a Linux kernel, the design focuses on application
throughput—along with overall design simplicity and elegance.
As a corollary to throughput, the Linux scheduler also attempts to
provide a fair share policy to ensure that all process groups get
balanced CPU resources.
These objectives have driven some basic Linux design choices for
real-time control applications. We have the following alternatives
for the Linux kernel:
- Use interrupt-off management for critical sections
- Make it non-preemptable
- Use a merged-process selection algorithm focused on ensuring
aggregate throughput and overall forward progress of all
processes
- Refuse to allow nested interrupts and interrupt prioritization
(dependent on the hardware).
For Linux, an interrupt-off technique manages sections of code
that are critical relative to interrupt-service-routine operations.
Research by MontaVista has identified a problem area in Linux
involving termination of a process that has a large number of
active child processes. When a loop is executing with interrupts
off (with the number of child processes as the iteration parameter)
this creates a theoretically unbounded interrupt-off period in
Linux. Careful application design can avoid the problem (also
MontaVista offers a software patch).
A uniprocessor system usually employs a non-preemptable Linux
kernel. For a throughput-oriented system, this is a reasonable
design choice. Allowing kernel preemption would mean that many
operations become critical sections that need protection using one
or more of the methods we have discussed. Such protection then
creates overhead—negatively impacting throughput. That is why
operating system engineers argue that the design of a kernel for
real-time operation directly contradicts the design of a system for
throughput.
Note, however, that this reasoning may be flawed—the
significance of I/O operations has not been factored into the
rationale. Generally speaking, overall throughput is enhanced when
the entire computer system is fully used at all times. Simply put,
the operating system must keep the I/O channels busy. Processes in
an execution mode of short period of execution, initiate I/O, and
wait should be given priority over those in a mode of long period
of execution.
Suitability of Linux for Real-Time Control
The non-preemptability of Linux raises questions about the
suitability of the OS for real-time control applications. Initial
measurements by MontaVista of non-preemptable paths in the Linux
kernel on a 266-MHz Pentium II processor show path lengths of over
130-ms.
We will not attempt here to explain in detail the specific
algorithms used by the Linux scheduler to combine throughput and
fairness. Suffice to say that when processes are given a real-time
priority, that priority does not provide the primary algorithmic
determinant of process selection. Rather, the scheduler fully
reviews every executable process. If no real-time processes are
encountered, then a combination of factors are used in final
selection—with an emphasis on keeping I/O channels busy,
assuring that a single process group doesn't overly monopolize the
CPU resource, and that process starvation doesn't occur. If
executable real-time processes are found, then the highest priority
process is dispatched.
The problem of linear overhead in process selection and dispatch
(as a function of the number of executable processes) has been a
subject of extensive debate. Some argue that real-world systems
never have more than a very small number of executable processes.
Or, if they do, they are poorly designed systems or improperly
overloaded). Others disagree. They point out that real-world
engineering teams don't always implement excellent designs.
Instead, they create systems that may have large numbers of
executable processes, and then expect the scheduler to properly
manage the load.
At MontaVista, we believe both views may be correct. As a
commercial company, we must meet customer needs, and those needs
include having a process scheduler that will select and dispatch a
high priority real-time process in a fixed period of time,
independent of process loading on the CPU. We cannot just argue
away the perceived problem by blaming a customer's flawed
design.
Current Linux Solutions
The first approach to solving these issues for use of Linux in a
real-time control application is "don't". In other words, design
the application solution around the problems. The following
approaches are essential:
- If possible, address all real-time response needs directly with
interrupt service routines.
- Avoid known excessive interrupt-off periods in Linux. I/O
drivers can and do turn off interrupts. So the specific drivers
should be evaluated for excessive interrupt-off operations, and if
problems are found, they must be corrected.
- If a process component is required in the real-time control
path, then consider aggregate system loading. (Might other
processes be running when the real-time process becomes executable?
If so, what operations could those processes be performing? Is
there any risk of a long kernel call in execution? If so, can the
application deal with the delay? If not, are there ways to
restructure the application to avoid the problem?)
To avoid problems that can occur when a process component is
required in the real-time control path, system designers should
seriously consider using the MontaVista real-time scheduler. The
default Linux scheduler will not guarantee an important process
will get immediately and rapidly selected and dispatched.
Another approach to dealing with these issues is to use a hybrid
kernel. Specifically, RTLinux or RTAI can
be used as an underlying kernel technology, providing hard
real-time behavior. Such kernels effectively run Linux as the idle
task. The real-time control components of the application are
threads executing in the underlying kernel environment. Meanwhile,
all non-time critical components can be standard Linux processes.
Interrupt off/on operations by Linux are emulated by the
lower-level kernel. So Linux semantics are not impacted, while
actual interrupts (particularly those handled by the lower-level
kernel tasks) are not blocked.
A possible problem with hybrid-kernel approaches is their
complexity—from both design and implementation perspectives.
Debugging a real-time control system can be difficult enough in a
single environment. Also, the question, "Can the real-time
component be addressed in a standard Linux interrupt service
routine?" really should be considered and addressed before jumping
to a hybrid-kernel solution. Again, the key issue with the
interrupt-service-routine solution is the impact of interrupt-off
operations in Linux. Also we must consider potential needs for more
software complexity and interaction in the real-time component than
is reasonable to embed in an interrupt service routine.
A third approach addresses Linux issues relative to real-time
control applications directly in Linux. There has been substantial
work in this area. Important examples include using preemption
points in the Linux kernel, a rate-monotonic scheduler developed at
UC
Irvine, the KURT
program at the University of Kansas, and the
Linux
SRT project.
Improving Linux Responsiveness
MontaVista's approach to providing improved responsiveness in
Linux proper is to execute the following plan:
- Measure the interrupt-off periods, provide the information for
the usage of application designers, and if possible, reduce the
length of any outliers (see the MontaVista Web page for
current measurement information)
- Provide a priority driven real-time process scheduler with
fixed and low overhead for process selection and dispatch
- Measure process preemption delays, provide the information for
the usage of application designers, and if possible, reduce the
length of any outliers
- Develop, assess, and provide a fully preemptable Linux
kernel.
The priority-driven scheduler provided by MontaVista is
structured as a kernel configuration option. The company recommends
this scheduler for any application needing firm control of the
execution priority of selected processes. The scheduler selects and
dispatches processes with a real-time priority (specified via the
sched_setscheduler(2) system call), in priority order. This
scheduler extends the real-time priority range from the default
Linux range of 0-100 to 0-127.
The scheduler uses an optimized process-selection algorithm,
based on a bit-map and a set of 128 executable process queues.
Typically the priority of the most important process is known, and
the process can be immediately de-queued and executed. On occasion,
the scheduler must perform a bit-map search to determine the
highest priority non-empty queue. The maximum overhead in such a
situation is four tests for non-zero words, followed by a set-bit
search through a 32-bit word. Hence, while there is some
variability in process selection time, the overhead is
fixed—and extremely fast, even in slow mode.
Whenever there is no real-time process to execute, the scheduler
drops through to the standard Linux scheduler. The scheduler knows
that there is no real-time executable task present, so the overhead
incurred is almost zero. There is a very small impact in code size
and in the data array of 128 process-list pointers.
During the development of the real-time scheduler, significant
defects in Linux were discovered—perhaps not for the first
time. The yield (2) system call was found to not always yield to
another process (induce a context switch to another runnable
process). This can be observed when yield is used as a basis for
performing context-switch measurements. The assumption of such
benchmarks is that yield always induces a context switch to another
executable process (of equal priority). Hence, such measurements
can actually measure substantially fewer context switches than
expected.
The MontaVista scheduler package includes a correction to the
defect in the standard Linux scheduler. A related problem found,
and fixed, is that the exhaustion of time slice in a real-time
process scheduled with a round-robin scheduling policy does not
make it yield to another executable process at the same priority.
The company's process preemption delay study is still in
progress—results will be shared when the study is
complete.
There has been much debate about restructuring the Linux kernel
to be fully preemptable. MontaVista intends to explore this as a
plausible alternative to dramatically improve Linux process-level
response times. The potential of such an approach to improve
throughput (by increasing the total parallelism of all operations
in the computing hardware complex) will be studied and measured.
Note that such an improvement will be highly
application-dependent—clearly, a non-I/O oriented load will
not benefit. By leveraging the Linux (version 2.4) multiprocessor
locks as the critical-section management technique, we believe,
though it remains to be proven, that Linux source code needs very
little modification to implement such a system.
About the Author
Kevin Morgan has 20 years of experience
developing embedded and real-time computer systems for
Hewlett-Packard. While at Hewlett-Packard, he worked as an
engineer, project manager and section manager spanning the
development of five operating systems and was a member of the HP
1000 computer software design team. Most recently serving as HP-UX
Operating System Laboratory Manager, Kevin was responsible for
overall HP-UX release planning, execution and delivery. He is now
working with Linux as the vice president of engineering at
MontaVista Software.