Managing Tasks on x86 Processors - Embedded.com

Managing Tasks on x86 Processors

Intel's x86 microprocessors can automatically manage tasks just like a simple operating system. There are many tricks and pitfalls, however, but with the right approach the programmer can get great performance at zero cost.

Just about every embedded system does some sort of task switching or task management. You don't always have to use a full-size operating system or RTOS to do task management; sometimes a little kernel executive is enough or even a quick time-slice interrupt. In the extreme case, you don't need any software at all: the processor can manage tasks for you. In this article, we'll drill into the x86 task-management features.

Hardware task-management started with the '386 and continues to this day on chips like the old '486 and Pentium as well as the newer Athlon, Opteron, and Pentium 4 processors. It's a terrific feature for embedded systems programmers because it makes task management fairly simple and foolproof. Whether you're managing just two tasks or dozens, you can probably let the chip do it all for you—for free.

Taking the processor to task
When tasks are managed by software, that code generally “freezes” a task by storing all of its volatile information somewhere in memory until the task is “thawed” and run again. You can call this a state frame, a task image, context store, or whatever you like but in x86-land, it's called a task state segment (TSS). TSSes correlate to tasks one-to-one. As an x86 programmer, you get to create one TSS for every task, whether you've got just two tasks or the processor's theoretical limit of 8,096 tasks.

The TSS is nothing but a hundred bytes or so of memory set aside for storage. As the name suggests, it's yet another type of special memory segment defined in the global descriptor table (GDT) of memory segments. Peculiarly, this memory is not accessible from any program, even at the highest privilege level. Only the processor itself can access a TSS once it's defined; it's the chip's own private task storage area. A minimal TSS is shown in Figure 1. This format is fixed; you can't rearrange the structure of a TSS although you can make it longer, as we'll see in a minute.


Figure 1: Minimal TSS

The TSS stores all eight of the processor's general-purpose registers as well as its six segment registers. No surprises there. The instruction pointer (EIP) and flags (EFLAGS) also get saved. Here's where it gets weird: the TSS holds three different pairs of stack segments and pointers, one for each of the upper three privilege levels. This is in case the task changes privilege levels as it runs. The processor also stores away Control Register 3 (CR3), which controls memory paging, among other features. This allows page tables to be changed on a task-by-task basis. Toward the top end of the table is a selector to an LDT—a local descriptor table that allows each task to have its own private table of memory descriptors along with the GDT. We'll see more of this anon.

The last bit in the TSS is the trace (T) bit, which is used for on-chip debugging; we'll ignore it for today. Nearby is an optional 16-bit pointer to the I/O permission bit map. This allows you to grant or deny permission to each and every one of the 65,535 addresses in the I/O address space. The first bit of the first byte corresponds to I/O address 0, and so on. If you choose to set permissions for every I/O address, the bit map will add 8KB to the TSS. You don't have to define an I/O permission bit map for all I/O addresses or for all tasks, just the ports and tasks you think will need it. Finally, at the beginning of the TSS there's a Back Link field. This 16-bit value helps the processor connect a chain of tasks together when one task switches to another.

Throwing the switch
So that's what a TSS looks like. What does it do? As you might guess, the processor loads all 104 bytes from a TSS when it enters a new task. (It also reads the I/O permission bit map if you've defined one.) Unlike a subroutine call, where most registers are left untouched, a task switch changes everything. Tasks share nothing; no information crosses task boundaries. Tasks are as separate as they can be while still sharing the same piece of silicon. This keeps tasks from contaminating one another and also permits tasks to be stopped and started at any time, in any order, without side effects.

The reverse happens when a task is switched out: The processor stores its internal state into the first 104 bytes of the TSS. (The I/O permission bit map isn't stored because it doesn't change.) The areas shaded in orange only get loaded, not stored, because they're not supposed to change during execution. The processor saves a bit of time by not writing these values to memory on every task switch.

That's it in a nutshell. But you know that's not the whole story. Defining a place to store dormant tasks is one thing. Getting the processor to switch among them is something else. Here's where it gets fun.

You can define whatever task-switching algorithm you like: simple time slice, round robin, priority based, linked list, random, whatever. That's up to you, and the processor won't interfere. The mechanics of switching tasks is handled automatically but why they switch is left as an exercise for the reader.

You can switch tasks with a simple JMP or CALL instruction, or as the result of an interrupt. A fault or exception (like a divide-by-zero error) can also launch a new task. Basically, any change of flow that could change code segments can also change tasks. It's not too different from a subroutine call, except that subroutines typically get passed parameters, do some useful work, then return to the caller. In contrast, forcing a task switch means one task deliberately puts itself to sleep and awakens another, unrelated dormant task. The new task receives no parameters from the old task and is typically under no obligation to reawaken its caller at any time. It's intentionally difficult for one task to affect another in any way.

To switch tasks, you use a task gate, which is similar to the call gates in my August 2004 column. As Figure 2 shows, task gates are simpler than call gates because all they have to do is point to a TSS. Just like call gates, you simply code a FAR CALL or FAR JMP that targets this gate segment. The processor recognizes the segment as a task instead of a normal code segment and initiates a task switch. The whole ordeal is shown in Figure 3.


Figure 2: Task gates are simpler than call gates


Figure 3: Task gates to task segments

Task gates have privilege levels, so your code must be privileged enough to initiate the task switch. This doesn't mean you have to be privileged enough to run the code in that task, just privileged enough to request the switch. In keeping with the total separation of tasks, your calling program doesn't know (or care) the privilege level of the task that's about to run.

Chain of command
So when a task switch occurs, the processor empties its brain into the current (outgoing) TSS, then loads an all-new context from the incoming TSS. Which is the current TSS? That's specified in the processor's task register (TR). You can load and store TR with the LTR and STR instructions, respectively. Be sure you load TR with a pointer to a blank TSS before your first task switch or else the processor won't know where to store the current context.

Here are a couple of subtle points. Tasks aren't reentrant and you can't call them recursively. The processor secretly sets a “busy bit” on the currently running task and won't load a task that has the bit set. The processor also sets the TS (task switched) bit in its internal control register CR0 after every task switch. You can use this bit for anything you like; the processor doesn't ever clear TS and doesn't use it itself, but you could monitor it to help manage shared resources like a coprocessor.

Although tasks can't be reentrant, they can be nested. If you switched tasks through a FAR CALL (as opposed to a FAR JMP ), the new task is considered to be nested within the old task. The new task will have its Back Link field filled in with a pointer to the old task's TSS; this happens automatically. Both tasks will also have their “busy bit” set, so they can't be called again. The processor also sets the NT (nested task) flag in its EFLAGS register. This is essentially a “valid bit” that says the Back Link field is now officially binding.

With NT set and the back link updated, the processor knows this task is a child of another task. An IRET (interrupt return) instruction will now un-nest the tasks. Normally, an IRET just pops a return address off the stack and performs a normal interrupt return. When NT is set, however, the processor performs a task switch, clears the NT bit, and follows the back link to the parent task. Pretty nice work for a collection of transistors, huh?

By the way, a normal RET instruction won't un-nest tasks, even if they were nested by CALL instructions. Only an IRET will do this.

A simple task manager
Let's make a quick-and-dirty task manager. First, we'll code up two tasks, then write the code that switches between them. Later you can add more tasks and a more elaborate task manager.

Our two simple tasks light up a pair of LEDs. One task will turn on a green LED; the other will turn on a red LED. The first task is shown in Listing 1; the second task is similar but has the _on and _off constants swapped.

Listing 1: A task that lights LEDs

task1:	MOV	DX, green_led		MOV	AL, _on		OUT	DX, AL		;turn on green LED		MOV	DX, red_led		MOV	AL, _off		OUT	DX, AL		;turn off red LED		JMP	task1

Yes, it's an infinite loop. That's okay because we know that both tasks will be interrupted from the outside by our task manager. You might also notice that neither task is concerned about preserving registers—or any other state. So even though we, as omniscient programmers, know that these tasks will be interrupted, we also know that they don't care. All task switching is transparent to the tasks involved.

Why infinite loops? Why not just turn on the LEDs once and leave them? That's because each task undoes the work of the other, turning one LED on and the other one off. If they didn't continually hammer the I/O ports for the LEDs, Task 2 would overwrite the effects of Task 1 and then they'd both spin doing nothing. We want the LEDs to toggle back and forth every time the tasks alternate.

Next, you'll need to create two TSSes to hold the state you want at the beginning of each task's program. Most of the registers in our example won't matter except for CS and EIP, which obviously need to point to the program code. Both tasks should probably have valid stack pointers, too, just in case.

All is in readiness. You load the processor's TR with a pointer to Task l's TSS, and make a FAR JMP to Task 2's TSS. The red LED comes on and stays on. What's wrong with our programs? A couple of things are wrong. First, there is no provision for changing from one task to another. The tasks don't do it voluntarily, nor is there a time-slice or other interrupt defined to make it happen automatically. The other problem is that when TR is loaded and the first task switch is made, the information written into Task l's TSS is obliterated.

The trouble with starting multitasking is that the current state of the processor must be saved somewhere, even if you don't want it. There are at least two solutions to this. You can create a third, garbage TSS in which the original state of the processor is stored as you begin the first task. The TSS will probably never be used again, and so it's a waste of 104 bytes, but it works. The second workaround is to make the initialization code part of the first task. In other words, you can append Task 1 to your setup code and “fall through” into it when initialization is complete. This saves a TSS and has no ill side effects.

Given that neither task is voluntarily going to preempt itself and force a task switch to its twin, we need to create a time-slice interrupt and write a handler for it that will suspend the currently running task and start the other. The example in Listing 2 assumes we have a timer interrupt that calls this routine. We then modify the return stack so that the IRET “returns” to a FAR JMP instruction that forces a task switch.

Listing 2: A simple task manager

switch_to_task_1:			MOV	DWORD	PTR SS:[ESP]+0,	OFFSET start1		MOV	WORD	PTR SS:[ESP]+4,	SEG start1		IRETstart1:	JMP	PWORD PTR task1ptr		JMP	switch_to_task_2-----------switch_to_task_2:		MOV	DWORD PTR SS:[ESP]+0,	OFFSET start2		MOV	WORD  PTR SS:[ESP]+4,	SEG start2start2:	JMP	PWORD PTR task2ptr		IRET		JMP	switch_to_task_1-----------task1ptr	LABEL PWORD		DD	0			;32-bit offset		DW	task1_tss_selector	;16-bit selectortask2ptr	LABEL PWORD		DD	0			;32-bit offset		DW	task2_tss_selector	;16-bit selector

Scheduler as task
You can also take an entirely different approach to the scheduler. Rather than implementing it as an interrupt-service routine, you can promote it and give it a task of its own. This has several advantages, especially in real-life multitasking systems, where the scheduling software is apt to be more complex than our example. If the scheduler is a task unto itself, it doesn't have to operate in the context of the task that was interrupted or be so cautious about preserving registers. It might run at a different privilege level and might share data with the operating system. In fact, for a really simple real-time kernel, the scheduler is the operating system.

To implement the scheduler as a task, you need to define and initialize a third TSS, set it up with the proper instruction pointer to your scheduling code, and put a task gate descriptor in the appropriate slot in the interrupt descriptor table (IDT). When your time-slice interrupt arrives, the processor immediately performs a task switch, saving the state of the interrupted task and loading the state of the scheduler task. This in itself is an improvement, since the interrupted task has had its entire context saved already, and the scheduler doesn't have to be concerned with contaminating it. Because the task switch was caused by an interrupt, the scheduler's task is considered a nested task. (You can verify this by testing the NT bit in EFLAGS.) As a nested task, the scheduler will have the outgoing task's TSS selector stored in its Back Link field. If the scheduler can read its own TSS (that is, if it has access to a data segment that encompasses the TSS), it can determine which task was running, which might help you decide which task to run next.

Performing the switch is now even easier than it was in the interrupt-service routine procedure. Whenever a nested task executes an IRET instruction, the processor un-nests it by returning control to the task specified in the Back Link field—normally the task that called it. But if the scheduler task can write to its own TSS, it can modify the Back Link field to point to the task it wants to run next and then execute an IRET . This cleanly terminates the interrupt and performs a “directed” task switch.

There's one trick to this that's important to avoid crashing your system. You need to set the busy bit of the incoming task's TSS descriptor and clear the busy bit of the outgoing task's TSS descriptor. The processor thinks it's returning to the parent task, and if that task isn't marked busy, the chip detects foul play and generates a fault.

There's another trick to making task management work safely and reliably. If you've set up your segments properly and been careful about granting (and denying) privilege levels, then most programs won't be able to modify, or even see, any of the TSS structures. That's important, because it prevents errant or malicious code from tweaking the dormant state of another task, modifying back links, adjusting registers, and so on. In particular, you want to prevent tasks from modifying their own Back Link field. Unfortunately, any program can set the NT bit in the EFLAGS register and then execute an IRET , which will follow the back link to wherever it points.

The best solution is probably to initialize all Back Link fields to point to a “safety task.” Then if NT gets set when it shouldn't, control will return to a predefined task, which should probably terminate the offending program.

Sharing data among tasks
At the top of the article we glossed over something called the local descriptor table or LDT. Sharp-eyed readers may have also noticed the LDT field at offset 0x0060 in the TSS. So, you ask, what's an LDT and why would I want one?

Back in April (“Taming the x86 Beast,” April 2004) we showed how you must define every memory segment with a segment descriptor and that all descriptors live in one big happy table, the GDT. Well, that's not entirely true. Segment descriptors can live in the GDT but they can also live in one or more secondary tables called LDTs. As the name suggests, a local descriptor table is local; it's private to one task, rather than globally shared among all tasks. In a multitasking x86 system, each task can use descriptors from exactly two sources: the GDT and its own private LDT. So there are really two sets of descriptor tables active at any one time, and it doesn't really matter which of the two tables holds a particular descriptor.

You're not required to define an LDT for each task; LDTs are optional. If there are no local descriptors (and hence, no LDT) for a given task, that task will see only those segments defined in the GDT. On the other hand, if you do give a task an LDT, then only that task can see the memory segments defined in it. Got it? Good, because we're about to go down the rabbit hole.

To find a local segment, the processor has to go through several levels of indirection. Here's how it works. Your program loads a 16-bit value into a segment register, such as DS. We know that the processor uses that value as a pointer into a descriptor table—usually the GDT. But if bit 2 of the segment register is 1, the processor looks in the current LDT instead of the GDT. For example, loading 0x0008 (binary 00001000) into DS will point to the second descriptor in the GDT, but loading 0x000C (binary 00001100) into DS will point to the second descriptor in the LDT. The processor finds the LDT by looking at offset 0x0060 in the task's TSS, uses that pointer to find an “LDT descriptor” in the GDT, then uses that pointer to find the LDT in memory, then uses the pointer in DS to find the data segment, then uses the offset register (say, ESI) to find the target address in that segment. Figure 4 diagrams how this sleight of hand works.


Figure 4: Finding local descriptors

If you don't have an LDT for a specific task (that is, if the LDT field at offset 0x0060 in the TSS is zero), then you can't load your segment registers with any value that has bit 2 set. Doing so will yield a fault.

You can put an LDT anywhere you want, just like you can place the GDT anywhere you want. Similarly, the LDT can have as many or as few descriptors as you like. Every time the processor switches tasks, it also switches from one LDT to another. (The GDT never changes; that's why it's global.) The incoming LDT is specified by the 16-bit LDT field in the incoming task's TSS.

Security and reliability
So what's the point of an LDT? It's a tool to help you separate the code, data, and stack spaces of different tasks so they don't contaminate each other. If two tasks have different LDTs with different memory-segment descriptors, there's no way those two tasks can interact with each other. Task 1 can't see Task 2's memory, and vice versa. Only segments defined in the GDT can be shared. To really separate tasks, put all of their code, data, and stack descriptors in their respective LDTs, not in the GDT.

Or you can do just the opposite. If you want all of your tasks to share some code or data, put the descriptor for that segment in the GDT so anyone can use it. Or if you want to be selective and have just two (or more) tasks share a semi-private area of memory, put its descriptor into two (or more) tasks' LDT.

You can even share parts of a segment. Segmented segmentation, you might call it. You could define one data segment for Task 1 and a different data segment for Task 2, but have the two segments overlap slightly. Then put the data-segment descriptors in the respective tasks' LDT, and each task will have a small window into the other's data space. Even if one task accidentally (or maliciously) overruns its buffer it won't affect the data space of the other task. Pretty neat, huh?

Too bad Microsoft doesn't use this feature. Windows has been plagued by buffer-overflow bugs that could easily be prevented by the processor's segmentation features. Alas, even though these features have been built into every x86 chip for more than 15 years, Microsoft has never used them. Instead, Windows creates a “flat” memory system with no segmentation, no tasking, no bounds checking, and no privilege protection, and then struggles to duplicate all those features in software. The result has been famously ineffective.

To make matters worse, Microsoft recently worked with Intel and AMD to define a new NX (no execute) attribute bit that is supposed to prevent viruses from executing code from the data or stack space—even though this very feature has been around since 1988. Windows XP SP2 and server operating systems use this new bit to approximate the features of the chip's segmentation and protection model. Naturally, not all systems running Windows XP SP2 have processors built with the new NX bit, so in those cases Windows emulates the feature as best it can. Finally, some Windows applications include self-modifying code or code that does just-in-time translation, so the NX feature needs to be disabled for them—hence the new Windows Control Panel that allows users to disable data-execution permission (DEP) on a case-by-case basis. It seems inevitable that new viruses will slyly ask users to disable this feature.

Intel and AMD naturally promote NX as a benefit. AMD calls it Enhanced Virus Protection on its Athlon64 processors; Intel labels it Execute Disable, or XD. Some processors cache the NX bit while others have to read it from memory. In the latter case, NX will degrade performance while the chip reads NX over and over from slow system DRAM. Such is the price we pay for relative security.

Jim Turley is the editor in chief of Embedded Systems Programming , a semiconductor industry veteran, and the author of seven books, including the Essential Guide to Semiconductors . You can reach him at .

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.