
Advanced Embedded X86 Programming: Protection and Segmentation
By Jean Gareau
In this second part of a three-part series, the author explains how protection
and segmentation are implemented in protected mode.
This article is the second in a series of three describing protected mode features of the 386 up to the Pentium. Because this family of CPUs is extremely popular and is affecting the embedded industry, RTOSes and embedded applications can be updated to take advantage of its 32-bit programming capabilities and larger, simpler memory models. Development tools also obtain some benefits because popular memory models, such as the flat memory model, are
simpler to support. The previous article introduced the 32-bit protected mode, in which each segment register is an index to a table of descriptors, each descriptor describing a memory segment by its base address, limit, type, and protection fields. A 32-bit linear address is produced from a segment/offset logical address by adding the offset to the base address found in the descriptor thats pointed to by the segment. That article also demonstrated how to switch from real mode to protected mode.
This
article explains how protection and segmentation are implemented in protected mode to design robust and reliable systems. These qualities are becoming more important as some embedded systems now include multiple tasks that run concurrently, such as an embedded Java Virtual Machine that supports concurrent Java applets. Whereas the lack of protection allows tasks to access, corrupt, and destroy anything from other tasks data to the operating system (OS) internal structures, protection permits the
building of robust and powerful applications by restricting them to their own code and data. Another advantage offered by protection is the ability for the OS to remove a faulty task (such as copying memory outside its address spacea typical problem with memcpy() in C). For example, when a task commits a fault, the CPU can notify the OS, which can then halt the task before any damage is done; the other tasks continue undisturbed. Protection also simplifies task developmentwhen an error is detected,
protection helps to pinpoint the exact problem where it happened, so developers dont have to search for it. In other words, protection simplifies building and deploying safe and secure embedded tasks that can work together but not harm each other.
Complete examples, fully documented and tested, that implement various kernel designs can be downloaded from
www.embedded.com/code.htm
. These examples demonstrate the concepts Ive explained in this series, including a port of mC/OS to the protected
mode.
1
The source code is provided as ready-to-run executables and additional tools. These examples provide an excellent start to help you improve your applications or even implement your own system.
WHAT PROTECTION?
The protection were talking about here is three-fold: 1) to prevent a task from executing privileged instructions, such as setting or clearing the interrupt flag; 2) to prevent a task from accessing another tasks code and data; and 3) to prevent tasks from
calling privileged kernel code in an unordered fashion or corrupting privileged kernel data. This protection doesnt deal with user authentication, since this concept is implemented in the OS itself, not in the CPU. All protection features are implemented in the CPU and activated by the OS, freeing application developers from having to worry about protection.
In the x86, protection is a feature associated with segments, and automatically kicks in when the CPU executes in protected mode.
Whenever a
segment register is referred to, the CPU accesses the related descriptor and analyzes its control bits. If the operation doesnt concur with these bits, the processor raises an exception, which is typically caught by the OSresulting in the application being shut down. Examples of this would be to write in a code segment (code segments cannot normally be altered), jump into a data segment (data isnt executable), and so on.
Another protection check involves checking the offset used in the
address calculation against the segments limit. If an operation tries to address beyond the limit, an exception is raised. The most common example is the use of an incorrect pointer or an invalid jump. This limit check feature is useful because it constrains a task to its own segments. Many tasks might be in memory at once, but if they all have their own separate segments, they cant see each other, and therefore cant alter one another.
Each task typically requires at least two descriptors,
one for code access and one for data access. Code segments are read-only and cant be used to modify datahence the need for a second segment for data access. The descriptors normally dont cover the same address space (i.e., they dont overlap), in order to maintain the initial protection (not to overwrite code, and so forth). Through its own descriptors, a task is restricted to its code and data. The flat memory models segments
do
overlap, but this model is normally used
in conjunction with paging, the subject of part three of this series of articles.
The previous article explained that descriptors were kept in either the global descriptor table (GDT) or local descriptor table (LDT). Lets forget for a moment about the LDT and suppose that all the code and data descriptors of all tasks are kept in the GDT. One problem that arises at this point is that any task can load another tasks data descriptor from the GDT and alter this other tasks data segment. The
problem exists because the GDT is unique and a task can theoretically access any descriptors in it. Providing distinct address space per task isnt sufficient in itself; the solution comes through the use of the LDTs, task state segments (TSSes), privilege levels, and gate descriptors, all explained in the following sections.
LDTs, like the GDT and the interrupt descriptor table (IDT), are built by the OS, not by the task. A given LDT usually contains a tasks code and data descriptors, and is
built when the task is loaded in memory. If the task is going to be in ROM, the LDT can be hard-coded and ROMable. Although a system may have more than one LDT, only one is active at a time (see
Figure 1), pointed to by the LDTR register. Note that no LDTs might be active at all, if none are referred to.
An LDT is described by a descriptor whose base address is the LDT address in memory. The limit is the LDT size, as its useful to limit the number of entries in it,
In a system
where each task has its own LDT, you can keep the LDT selector in the task control block (or a similar structure) to identify this LDT as the current one when the task is selected to run. Such a system also provides a good deal of protection, because each task sees either the GDT or its LDT, but not the other tasks LDTs. Consequently, a task cant alter or even look at another tasks code and dataa must with segmentation.
If a system is designed to provide such isolation among the
tasks, the OS must provide primitives to transfer information among tasks (to send and receive messages, for instance); this is something tasks cant do by themselves because theyre restricted to their own address space.
PRIVILEGE LEVELS
Despite the LDTs, a task can still refer to any GDT descriptor and alter some data. To prevent this from happening, privilege levels are introduced. The Intel CPUs support four levels: 0 (most privileged) to 3 (least privileged). Level 0 allows the
execution of privileged instructions (such as set or clear interrupts, accessing I/O ports, loading the GDT, LDT, or IDT, and the like), whereas this is strictly forbidden in any other level. Operating systems must run at level 0 to have no restriction at all, whereas you must decide whether device drivers, OS extensions, and application tasks run at either 0, 1, 2, or 3. Using level 0 for all system software and 3 for application software is very common, while simply ignoring levels 1 and 2. Such tasks can
neither execute privilege instructions nor alter the state of the system.
A privilege level is always associated with current executing code. When entering the protected mode, the current privilege level (CPL) is 0 (the highest) because OS code is expected to be running. All descriptors contain a descriptor privilege level (DPL), which is a two-bit value (zero to three) identifying the privilege of the related segment. The DPL has a different meaning depending on the segment type (code or data).
When the
code segment registerthe CS registeris loaded with a valid code selector (via a jump, a call, or returning from a function or an interrupt), the CPU examines the descriptor and the DPL becomes the CPL, as seen in
Figure 2. For instance, once the OS initialization (running at CPL 0) is completed, the first task is executed by loading CS with a selector referring to the tasks code descriptor with a DPL of 3; consequently, the task starts running at CPL 3.
But there is a trick: when loading CS, the CPL can never become more privileged; it has to stay at the same or a lesser privilege. Thus, a CPL 3 task cannot load CS with a selector referring to a descriptor with a DPL of 0, 1, or 2doing so will raise an exception. The OS would typically catch that exception, analyze it, delete the faulty task, and reschedule another one.
When a data segment register is referred to, the DPL of the related descriptor indicates the minimum CPL required to access
itthe CPL must have the same or a higher privilege than the DPL. Thus, a CPL 2 task cannot refer to a data descriptor carrying a DPL of 0 or 1; it can only access data descriptors with a DPL of 2 or 3.
The OS, by carefully setting the DPL of all GDT and LDT descriptors, prevents nonauthorized tasks from accessing sensitive or protected code and data. For instance, all OS descriptors are set with a DPL of 0, whereas all task descriptors are marked with a DPL of 3. Consequently, tasks cannot directly access
OS code and data. As far as the applications are concerned, they only see their code and data, nothing else.
TASK STATE SEGMENTS
Before we explore privilege levels in detail, we must introduce task state segments (TSS). A TSS is a placeholder for all the registers of a task when that task doesnt run. Like LDTs, only one TSS is active at all times and it is interpreted at some point by the CPU. It is also described by a descriptor that indicates the base address, the size (which may vary
because extra data can be stored in each of them), the protection, and the type, which in this case is a TSS. The TR register contains the selector of the active TSS.
Having one TSS per task in a segmented system is common. In this case, TSS descriptors are usually kept in the GDT, with a DPL of 0 to prevent a task (with a CPL of 1, 2, or 3) from accessing them. A tasks TSS selector can also be stored in a task control block for quick reference. The operating system can indicate which TSS is active by
executing the ltr instruction (load task register).
TSSes are special: a far jump to a TSS selector (the offset is ignored) makes a complete context switch from the current task to the task referred to by the selected TSS. That switch not only saves and reloads the task registers, but it manages the segment registers, the current LDT selector, the current TSS selector, and so on, all with a single instruction. Keeping all LDT and TSS descriptors in the GDT is best, to ensure their accessibility during
the switch.
MIXING PRIVILEGE LEVELS
Maximum protection is achieved by mixing privilege levels: the OS is given full privileges, while the tasks receive no special privileges. Luckily, the x86 has much to offer in order to mix privileges.
If you want to execute higher-privileged code, such as directly calling an OS function at CPL 0 from a CPL 3 task, you can use call gates (see
Figure 3). Call gates are a special type of descriptor and can reside in the GDT
(making them sharable among all tasks) or a tasks LDT (making them private to that task). A call gate is no more than an indirect, controlled call to a more privileged function, typically an OS service. The call gate contains a code selector and the address of the function to call within that selector. The code selector usually has a higher privilege, allowing the system service to run with adequate privileges. Call gates are initialized and maintained by the OSes, but used by the tasks.
Call gates
are accessed via a far call (a selector/offset combination), though only the selector is meaningful (the offset is discarded). Call gates can be hidden in a normal function (such as open()), making them invisible to the application programmers.
The CPU ensures that the current task has enough privilege to use the call gate. For instance, if the call gate has a DPL of 2, only tasks executing at CPL 0, 1, or 2 can use it; tasks running at CPL 3 are excluded. But it is common to set all call
gates DPL to 3, to make them available to all tasks. Also, the target code segments DPL must have the same or a higher privilege than the CPL. For example, if a task running at CPL 2 uses a call gate that refers to a segment at CPL 3, a fault is triggered. Call gates are only used to increase privilege levels, not to decrease them; otherwise, upon returning, there would be an uncontrolled privilege increase (which would be disastrous if the return address would have been altered by the task). For
that same reason, when the system service terminates, various checks are performed to ensure that the control is returned to a code segment of the same or lesser privilege. Note that executing a far call to a less-privileged segment is possible, as long as it is a conforming segment. A segment is conforming when a special control bit is set in its descriptor. Such a segment conforms to its caller in that it executes under the callers CPL. For instance, if the current task runs at a CPL of 2, and calls a
function in a conforming segment of DPL 3, that function will also run at CPL 2. Thus, calling a conforming segment doesnt alter the calling tasks CPL. Conforming segments are a useful way to implement system libraries callable by any task, regardless of its privilege. In fact, there is no other way to call less-privileged segments. However, conforming segements are still quite rare, since libraries (such as the C library) are usually bound to applications at link time, not run time.
But a
call gate isnt enough to ensure a successful executionenough stack space must be provided for the OS service to run. Because the calling task may have very little stack space, the call gate will perform a stack switch if the privilege level is increased. The TSS is important here because, in addition to the tasks registers, it holds stack pointers for privilege levels 0, 1, and 2, all initialized by the OS. Heres an example of how it works: a CPL 3 task uses a call gate to execute a
system service at DPL 0; the tasks TSS (the current TSS) is looked up to get the privilege 0 stack pointer (selector/offset), and this value becomes the effective stack. The original tasks stack pointer (selector/offset) is pushed on the new stack to have a link back to it, as shown in
Figure 4. In addition to the stack switch, up to 32 double-word parameters can be copied from the tasks stack to the new stack, which is a convenient feature. The number of
parameters per call gate is fixedcall gates do not support a variable number of parameters.
The alternative to using call gates is to use a trap interface, which consists of calling OS functions by raising software interrupts. A trap interface involves the interrupt table (IDT), which can contain three types of descriptors:
* Interrupt gate, which refers to a specific function, usually in the kernel
* Trap gate, which is similar to an interrupt gate
* Task gate, which points to a TSS
descriptor
Like call gates, invoking a method through the IDT is a way to increase the CPL. Each IDT descriptor, like any other descriptor, contains a DPL, usually set to 0 by the OS. Such a DPL prevents unprivileged tasks from directly triggering the interrupt (via the int instruction). Note that some IDT descriptors may have a lower DPL, making them callable by the tasks, to implement system calls. These software interrupts or traps may also be hidden in a library function (such as close()), to hide them
from the application programmers.
Whenever a hardware interrupt is raised or a valid software interrupt is encountered, the related descriptor of the IDT is analyzed and the CPL set to the descriptors DPL (as seen in
Figure 5). For an interrupt gate, execution starts in the address found in the descriptor (typically an interrupt handler in the OS) with the interrupts disabled; when the handler terminates, extra protection checks occur to ensure a proper return to the
caller. Trap gates are almost identicalthe same processing happens, but with the interrupts enabled. With a task gate, a context switch occurs (as described earlier with the TSS). Note that task gates in the IDT are not a convenient way to implement multitask, because context switches normally occur under OS conditions (time slice expired, system calls that makes a higher-privileged task ready, and the like), rather than raised interrupts. Nevertheless, they can be a useful way to invoke special tasks,
such as a debugger.
The CPU invokes the handlers described by the interrupt or trap gates in a way similar to call gates: if the privilege is increased, a stack switch occurs using the TSS to get the new stack pointer. As opposed to a call gate, though, no parameter can be copied. If parameters are needed, they can be passed via the registers or they may be traced back via the tasks stack pointer, which is on the new stack.
Choosing between call gates vs. interrupt/trap gates to execute an OS
service depends on how many system calls you have and whether they require some arguments or not.
If you have many system calls, an interrupt/trap gate is better because it offers a single entry point in the kernel; however, a register must be set to identify the service called. On the other hand, because a call gate refers to one function, having many system calls implies many call gates. Since high-end systems might have hundreds of calls, call gates might be tough to maintain. Moreover, if they are placed
in the LDT (to prevent some specific tasks from using it), things can get very complicated.
If the task passes parameters, call gates allow you to transfer them into the more privileged stack, from where the system call can access them as local parameters (easy). Via the interrupt/trap interface, you might have to trace back the calling stack (which is just annoying).
If you prefer call gates (because of the fixed parameter transfer facility) and you have hundreds of system calls, only a few gates with
specific numbers of parameters (one, two, four, and so forth) may be all it takes. All system calls that have one parameter go through the call gate with one parameter, and so on. Each call requires a register/parameter to identify the service because many services converge toward the same call gates.
As I mentioned earlier, call gates can only transfer a fixed number of parameters. If some services have a variable number of parameters, an argument count and an argument pointer must be passed instead.
Call gates require far pointers, whereas traps or interrupts are simply triggered via one instruction (no pointer nor segment register at all), which is faster.
A word of caution: whenever privilege checks are involved by the CPU, execution cycles dramatically increase. Here are some examples on a 386 (for reference, the fastest instruction, excluding lock, requires two cycles):
* Operations on descriptors, such as lsl (load segment limit), lar (load access right byte), and the like usually take more
than 10 cycles. Directly accessing the tables where they reside might be a faster way to get the information
* Loading a segment register in itself takes at least 18 cycles, compared to two with general registers. (This extra time has to do with the internal validation of the descriptor.) Remember that within a task, you should load a segment register only if the new value is different (the comparison instruction itself only takes two cycles)
* Loading the current LDTthe lldt
instructiontakes 20 cycles
**Loading the task register (setting the current TSS)the ltr instructiontakes at least 23 cycles
* A call/interrupt/trap gate to a higher-privilege descriptor takes a minimum of 90 cycles (and it increases with the number of parameters for a call gate)
* And the worst case: a task switch through a TSS takes more than 300 cycles! TSS task switches are only useful if all registersespecially the segment registersmust be reloaded with new values
IMPLEMENTATION
EXAMPLES
These four features (LDT, TSS, privilege levels, and the variety of gates) can be used in a multiple of ways to satisfy specific needs.
2
Following are some examples of implementing protection in an OS, from the easiest to the hardest.
Case 1.
The OS and a single task run at privilege 0 (see
Figure 6). This is ideal for a simple, real-time dedicated 32-bit controller and is easy to achieve, such as with a fuel-air mixture analyzer that
needs 32-bit registers to perform calculations with a certain precision.
* The OS and the task form one combined image
* Two entries are required in the GDT (in addition to the first entry): one code descriptor (zero to 4GB, DPL 0), one data descriptor (zero to 4GB, DPL 0). Segment registers always refer to these code and data descriptors
* Hardware interrupts are implemented via interrupt gates, all DPL 0, which call interrupt handlers
* No call gate nor task gate is used; the system services
can be called directly
* A TSS isnt required because there is only one task and no privilege transition
Case 2.
The OS and many tasks run at level 0. This model would be best for a multitask, real-time kernel, such as mC/OS. This case is ideal for a breaking system that needs multiple tasks to simultaneously control hydraulic systems, breaking force, collect statistical data, and the like. Such a system has a similar architecture to the previous case.
* Because all tasks have the
highest privilege, they can share one single segment, so there is no need to use LDTs. Thus, only one code and one data descriptor (CPL 0, zero to 4GB) are required in the GDT
* Interrupts are implemented via interrupt gates (DPL 0)
* TSSes arent required because no privilege transition exists and segment registers dont change. Task switches are done by saving application registers on the stack, switching stack, and restoring registers from the new stack
Case 3.
The OS runs at
level 0 and many tasks run at level 3 (see
Figure 7). This model would be best for a simple system running untrusted tasks, such as an embedded Java Virtual Machine that supports unknown Java applets.
* All descriptors in the GDT have a DPL of 0 to prevent tasks from using them directly
* The GDT has one code and data descriptor (CPL 0, zero to 4GB) for kernel use only
* Each task runs in its own address space, and needs its private LDT with two entries: one for
the code and one for the data. For a flat memory model, the code and data segment of a given task may overlap; for better protection, they may be distinct. In the latter case, each task must be built apart (not linked with the kernel) using a small memory model, and loaded in order to be run (a loader is required). Tasks run at privilege 3 (which prevents accessing kernel code and data directly). LDTs prevents tasks from seeing each other. LDT descriptors reside in the GDT
* The IDT contains descriptors
referring to interrupt handlers (to maintain interrupts disabled when the handler is called) in the kernel, at DPL 0, to ensure that kernel code always runs at 0
* TSSes cant be avoided because protection transition requires a stack switch, which is done from the current TSS. TSS descriptors reside in the GDT
* If a flat memory model is retained, system services may be callable through interrupts (to avoid the far system calls required with call gates)
* Message-based systems usually have few
system calls (send, receive), which may equally be called via interrupts or call gates, passing parameters via registers
Case 4.
The OS runs at level 0, system libraries at level 1, device drivers at level 2, and many tasks at level 3, each of them with multiple segments. This case is a complicated variant of Case 3 (more descriptors, more privilege levels) and it requires more effort to implement it. This case would be best for a high-end system rather than an embedded one. Compared to Case 3:
* System libraries are accessed through call gates that reside in the GDT, making them available to all tasks. Application libraries may hide these call gates from the tasks
* Devices drivers have their code, and data segments too, at DPL 2 in the GDT (if they are public) or in their own LDT (if they are accessible by the kernel only)
* Each task has its own TSS, and for this case, switching via task state segments might be justifiable, since all registers must be changed
A system such as the one
in Case 4 is hardly justifiable, since it can be simplified and rendered more powerful by using pagingwhich is the subject of the next article.
Jean Gareau received an M.S. in electrical engineering from the Polytechnic School of the University of Montreal. Since 1989, he has been involved in the development of operating systems, system tools, and large commercial applications. He can be reached at jeangareau@yahoo. com.
REFERENCES
1. mC/OS is a portable, ROMable, preemptive,
real-time, multitasking kernel for microprocessors, and its free!
2. So many possible combinationsactually exist that this extra flexibility is a difficulty in itself.
BIBLIOGRAPHY
Intel Corp.,
80386 Programmers Reference Manual
. Order Number 230985-001, 1987.
Labrosse Jean J. m
C/OS The Real-Time Kernel
, Fourth Printing. Lawrence, KS: R&D Publications, 1992.
Jump to Part 3
|