A system call is “just” a subroutine implemented by the kernel for auser-space program. Some system calls just return information known inthe kernel but not outside (theaccurate time of day, for example ).
But there are two reasons why it's more difficult than that. Thefirst has to do with security – the kernel is supposed to be robust inthe face of faulty or malicious application code.
The second reason has to do with stability. The Linux kernel should be able torunapplications built for it, but should also be able to run anyapplication built for the same or a compatible architecture at any timein the past. A system call, once defined, may only be removed after alot of argument, work, and waiting.
But back to security. Since the system call will run in kernel mode,the entry to kernel mode must be controlled. For the MIPS architecture that's donewith asoftware-triggered exception from the syscall instruction, whicharrives in the kernel exception handler with a distinctive code (8 or”Sys”) in the CPU's Cause(ExcCode) register field.
The lowest-level exception handler will consult that field to decidewhat to do next and will switch into the kernel's system call handler.
That's just a single entry point: The application sets a numericargument to select which of several hundred different syscall functionsshould be called. Syscall values are architecture-specific: The MIPSsystem call number for a function may not be the same as the x86 numberfor the same function.
What Happens on a System Call
System call arguments are passed in registers, as far as possible. It'sgood to avoid unnecessary copies between user- and kernel-space. In the32-bit MIPS system:
1) The syscallnumber is put in v0.
2) Arguments arepassed as required by the “o32″ ABI. Most of the time, that means thatup to four arguments are passed in the registers a0” a3; but there arecorner cases where something else happens. (Many of the darkest corners of argumentpassing don't affect the kernel, because it doesn't deal infloating-point values or pass or return structures by value .)
Although the kernel uses something like o32 to define how argumentsare passed into system calls, that does not mean that userland programsare obliged to use o32.
Good programming practice requires that userland calls to the OSshould always pass through the C runtime library or its equivalent, sothat the system call interface need only be maintained in one place.The library can translate from any ABI to the o32-like syscallstandard, and that's done when running 32-bit software on 64-bit MIPSLinux.
3) The returnvalue of the syscall is usually in v0. But the pipe(2) system callreturns an array of two 32-bit file descriptors, and it uses v0 andv1—perhaps one day another system call will do the same.
(I hear a chorus of “Over our deadbodies!” from the kernel maintainers. )
All system calls that can ever fail return a summary status (0 for good, nonzero for bad ) in a3.
4) In keepingwith the calling conventions, the syscall preserves the values of thoseregisters that a32 defines as surviving function calls.
Making sure the kernel implementation of a system call is safe involvesa healthy dose of paranoia. Array indexes, pointers, and buffer lengthsmust be checked.
Not all application programs are permitted to do everything, and thecode for a systemcall must check the caller's “capabilities” whenrequired (most of the time, a process running for the root super-usercan do anything).
The kernel code running a systemcall is in process context—the leastrestrictive context for kernel code. Systemcall code can do things thatmay require the thread to sleep while some I/O event happens, or canaccess virtual memory that may need to be paged in.
When copying data in and out, you're dependent on pointers providedby the application, and they may be garbage. If we use a bad pointer,we'll get an exception, which could crash the kernel—not acceptable.
So copying inputs from user space or outputs to user space can bedone safely with functions copy to user()/copy from user(), providedfor that purpose. If the user did pass a bad address, that will cause akernel exception.
The kernel, though, maintains a list of the functions10 that aretrusted to do dangerous things: copy to user() and copy from user() areon that list. When the exception handler sees that the exceptionrestart address is in one of these functions, it returns to a carefullyconstructed error handler. The application will be sent a rude signal,but the kernel survives.
(It's actually a list ofinstruction address ranges. )
Return from a system call is done by a return through the end of thesyscall exception handler, which ends with an eret. The important thingis that the change back to user mode and return to the user instructionspace happen together.
|Figure14.1. Memory map for a Linux Thread|
How Addresses Get Translated inLinux/MIPS Systems
Before getting down to how it's done, we should get an overview of thejob. Figure 14.1 above sketches out the memory map seen by a Linux thread on a 32-bitLinux/MIPS system. The map has to fit onto what the hardware does, souser-accessible memory is necessarily in the bottom half of the map.
(The 64-bit Linux/MIPS map isbuilt on the same principles, but the more complicated hardware mapmakes it relatively confusing. )
Useful things to remember:
1) Where kernel isbuilt to run: The MIPS Linux kernel's code and data are built torun in kseg0; virtual addresses from 0x8000.0000 upward. Addresses inthis region are just a window onto the low 512 Mbytes of physicalmemory, requiring no TLB management.
2) Exceptionentry points: In most MIPS CPUs to date, these are hard-wirednear the bottom of kseg0. The latest CPUs may provide the EBaseregister to allow them to be relocated, mainly so that multiplememory-sharing CPUs can use different exception handlers without thetrouble of special-casing memory decoding.
In the Linux kernel, even when there are multiple CPUs, they shouldall run the same exception-handling code, so this is unlikely to beused for Linux.
3) Where userprograms are built to run: MIPS Linux applications (which arerun in low-privilege “usermode”) have virtual addresses from 0 through07FFF.FFFF. Addresses in this region are accessible in user mode andtranslated through the TLB.
The main program of an application is built to run startingsomewhere near zero. Not quite zero—one or more pages from virtualaddress zero are never mapped, so that an attempt to use a null pointerwill be caught as a memory-management error.
The library components of an application, though, are loadedincrementally into user space at load time or even later. That'spossible because a library component is built to beposition-independent and is adjusted to fit the place that the loaderfinds for it.
4) User stackand heap: An application's stack is initially set to the top ofuser-accessible memory (about 2GB up in virtual space) and grows down.TheOS detects accesses to not-yet-mapped memory near the lowest stackentries it has allocated and automatically maps more pages to let thestack grow.
Meanwhile, new shared libraries or explicit user data allocated withmalloc() and its descendants are growing up from the bottom of the userspace. So long as the sum of all these remains less than 2 GB, all iswell: This restriction is rarely onerous in any but the biggestservers.
5) Physicalmemory up to 512 MB: Can be accessed cached through kseg0 anduncached through kseg1. Historically, the Linux kernel assumed it haddirect access to all the physical memory of the machine.
For smaller MIPS systems that use 512 MB or less of physical memoryrange, that's true; in this case, all memory is directly accessible(cached) in kseg0 and (uncached) in kseg1.
6) Physicalmemory over 512 MB is “high memory”: Now, 512 MB is no longerenough, even for embedded systems. Linux has an architectureindependent concept of high memory—physical memory that requiresspecial, architecture-dependent handling, and for a 32-bit Linux/MIPSsystem physical memory above 512 MB is high memory. When we need toaccess it, we'll create appropriate translation entries and have themcopied into the TLB on demand.
Early MIPS CPUs sought applications in UNIX workstations andservers, so the MIPS memory-management hardware was conceived as theminimum hardware that could hope to providememory management for BSDUNIX. The BSD UNIX system was the first UNIX OS to provide real pagedvirtual memory on the DEC VAX minicomputer.
The VAX was in many ways the model for the 32-bit paged-translationvirtual-memory architectures that have dominated computing ever since;perhaps it's not surprising that there are some echoes of the VAXmemory-management organization in MIPS.
But this is a RISC, and the MIPS hardware does much less. Inparticular, many problems that the VAX (or an x86) solves withmicrocode are left to software by the MIPS system.
In this series, we'll start close to where MIPS started, with therequirements of a basic UNIX-like OS and its virtual memory system; butthis time the OS is Linux. We'll show how the essence of the MIPShardware is a reasonable response to that requirement.
What's Memory Translation For?
Memory translation hardware (whilewe're being general we'll call it MMU for memory management unit ) serves several distinct purposes,listed below (Given just how much theMMU contributes, it's remarkable that a fairly workable minimal Linuxsystem such as ucLinux can be built without it—but that's another story ):
1) Hiding andprotection: User-privilege programs can only access data whoseprogram address is in the kuseg memory region (lower programaddresses). Such a program can only get at the memory regions that theOS allows.
Moreover, each page can be individually specified as writable orwrite protected; the OS can even stop a program from accidentallyoverwriting its code.
2) Allocatingcontiguous memory to programs: With the MMU, the OS can buildcontiguous program space out of physically scattered pages of memory,allowing us to allocate memory froma simple pool of fixed-size pages.
3) Extendingthe address range: Some CPUs can't directly access their fullpotential physical memory range. MIPS32 CPUs, despite their genuine32-bit architecture, arrange their address map so that the unmappedaddress space windows kseg0 and kseg1 (which don't depend on theMMUtables to translate addresses) are windows onto the first 512MBofphysical memory. If you need a memory map that extends further, youmust go through the MMU.
4) Making thememory map suit your program: With the MMU, your program canuse the addresses that suit it. In a big OS, there may be many copiesof the same program running, and it's much easier for them all to beusing the same program addresses.
5) Demandpaging: Programs can run as if all the memory resources theyneededwere already allocated, but the OS can actually give them outonly as needed. A program accessing an unallocated memory region willget an exception that the OS can process; the OS then loads appropriatedata into memory and lets the program continue.
In theory (and sometimes incomputer science textbooks ), it says that demand paging isuseful because it allows you to run a larger program than will fit intophysical memory.
This is true, if you've got a lot of time to spare, but in realityif a program really needs to use more memory than is available it willkeep displacing bits of itself from memory and will run very, veryslowly.
But demand paging is still very useful, because large programs arefilled with vast hinterlands of code thatwon't get run, at least onthis execution.
Perhaps your program has built-in support for a data format you'renot using today; perhaps it has extensive error handling, but thoseerrors are rare. In a demand-paged system, the chunks of the programthat aren't used need never be read in. And it will start up fast,which will please an impatient user.
6) Relocation: The addresses of program entry points and predeclared data are fixed atprogram compile/build time—though this is much less true forposition-independent code, used for all Linux's shared libraries andmost applications. The MMU allows the program to be run anywhere inphysical memory.
The essence of the Linux memory manager's job is to provide eachprogram with its own memory space. Well, really Linux has separateconcepts: Threads are what is scheduled, and address spaces (memorymaps) are the protection units. Many threads in a thread group canshare one address space.
The memory translation system is obviously interested in addressspaces and not in threads. Linux's implementation is that all threadsare equal, and all threads have memory management data structures.Threads that run in the same address space, in fact, share most ofthose data structures.
But for now it will be simpler if we just consider one thread peraddress space, and we can use the older word “process” for them both.
If the memory management job is done properly, the fate of eachprocess is independent of the others (the OS protects itself too): Aprocess can crash or misbehave without bringing down the whole system.This is obviously a useful attribute for a university departmentalcomputer running student programs, but even the most rigorouscommercial environment needs to support experimental or prototypesoftware alongside the tried and tested.
The MMU is not just for big, full virtual memory systems; even smallembedded programs benefit from relocation and more efficient memoryallocation. Any system in which you may want to run different programsat different times will find it easier if it can map the program's ideaof addresses onto whatever physical address space is readily available.
Multitasking and separation between various tasks' address spacesused to be only for big computers, then migrated into personalcomputers and small servers, and are increasingly common in thevanishingly small computers of consumer devices.
However, few non-Linux embedded OSs use separate address spaces.This is probably not so much because this would not be useful, but isdue to the lack of consistent features on embedded CPUs and theiravailable operating systems. And perhaps because once you add separateaddress spaces to your system you're too close to reinventing Linux tomake sense!
This is an unexpected bonanza for the MIPS architecture. Theminimalism that was so necessary to make the workstation CPU simple in1986 is a great asset to the embedded systems of the early 21stcentury.
Even small applications, beset by rapidly expanding code size, needto use all known tricks to manage software complexity; and the flexiblesoftware-based approach pioneered by MIPS is likely to deliver whateveris needed. A few years ago it was hard to convince CPU vendorsaddressing the embedded market that the MMU was worth including; nowLinux is everywhere.
Basic Process Layout and Protection
From the OS's point of view, the low part ofmemory (kuseg) is a safe”sandbox” in which the user program can play all it wants. If theprogram goes wild and trashes all its own data, that's no worry toanyone else.
From the application's point of view, this area is free for use inbuilding arbitrarily complicated private data structures and to get onwith the job. Inside the user area, within the program's sandbox, theOS provides more stack to the program on demand (implicitly, as thestack grows down).
It will also provide a systemcall to make more data available,starting from the highest predeclared data addresses and growing up -systems people call this a heap. The heap feeds library functions, suchas malloc(), that provide your program with chunks of extra memory.
Stack and heap are supplied in chunks small enough to be reasonablythrifty with system memory but big enough to avoid too many systemcalls or exceptions. However, on every systemcall or exception, the OShas a chance to police the application's memory consumption. An OS canenforce limits that make sure the application doesn't get so large ashare of memory as to threaten other vital activities.
A Linux thread keeps its identity inside the OS kernel, and when itruns in process context in the kernel (as in a system call) it's reallyjust calling a careful subroutine.
The operating system's own code and data are of course notaccessible to user space programs. On some systems, this is done byputting them in a completely separate address space; on MIPS, the OSshares the same address space, and when the CPU is running at theuser-program privilege level, access to these addresses is illegal andwill trigger an exception.
Note that while each process's user space maps to its own privatereal storage, the privileged space is shared. The OS code and resourcesare seen at the same address by all processes – an OS kernel is amultithreaded but single address-space system inside – but eachprocess's user space addresses access its own separate space. Kernelroutines running in process context are trusted to cooperate safely,but the application need not be trusted at all.
We mentioned that the stack is initialized to the highestpermissible user addresses: That's done so we can support programs thatuse a great deal of memory. The resulting wide spread of addresses inuse (with a huge hole in between) is one characteristic of this addressmap with which any translation scheme must cope.
Linux maps application code as read-only to the application, so thatcode can be safely shared by threads in different address spaces—afterall, it's common to have many processes running the same application.
Many systems share not just whole applications but chunks ofapplications accessed through library calls (shared libraries). Linuxdoes that through shared libraries, but we'll pick up that story later.
Mapping Process Addresses to RealMemory
Which mechanisms are needed to support this model?
We want to be able to run multiple copies of the same application,and they'd better use different underlying copies of the data. Soduring program execution, application addresses are mapped to physicaladdresses according to a scheme fixed by theOS when the program isloaded.
Although it would be possible for the OS to rush around patching allthe address translation information whenever we switched contexts fromone process to another, it would be very inefficient. Instead, we awardeach active thread's memory map an 8-bit number called the addressspace ID or ASID.
When the hardware finds a translation entry for the address, itlooks for a translation entry that matches the current ASID as well asthe address, so the hardware can hold translations for different spaceswithout getting them mixed up. (Actually,we'll see that some translation entries can be ASID-independent or”global.” )
Now all the software needs to do when scheduling a thread in adifferent address space is to load a new ASID into EntryHi(ASID).
The mapping facility also allows the OS to discriminate betweendifferent parts of the user address space. Some parts of theapplication space (typically code) can be mapped read-only, and someparts can be left unmapped and accesses trapped, meaning that a programthat runs amok is likely to be stopped earlier.
The kernel part of the process's address space is shared by allprocesses. The kernel is built to run at particular addresses, so itmostly doesn't need a flexible mapping scheme and can take advantage ofthe kseg0 region, leaving memory mapping resources for application use.
Some dynamically generated kernel areas are conveniently built frommapped memory (for example, memory used to hold kernel loadable modulesfor device drivers and so on is mapped).
Paged Mapping Preferred
Many exotic schemes have been tried for mapping addresses. Until themid-1980s or so, the industry kept thinking there must be a bettersolution than fixed size pages.
Fixed-size pages have nothing to do with the needs or behavior ofapplication programs, after all, and no top-down analysis will everinvent them.
But if the hardware gives programs what they want (chunks of memoryin whichever size is wanted at that moment), available memory rapidlybecomes fragmented into awkward-sized pieces. When we finally makecontact with aliens, their wheelbarrows will have round wheels andtheir computers will probably use fixed-size pages.
So all practical systems map memory in pages – fixed-size chunks ofmemory. Pages are always a power of 2 bytes big. Small pages might taketoo much management (big tables oftranslations, many translationsneeded for a program ); big pages may be inefficient, since quitea lotof small objects end up occupying a whole page. A total of 4 KB is theoverwhelmingly popular compromise for Linux.
(The MIPS hardware would be quitehappy with bigger pages: 16 KB is the next directly supported size andis probably a better trade-off for many modern systems. But the pagesize is pervasive, with tentacles into the way programs are built; sowe almost all still use 4 KB. )
With 4-KB pages, a program/virtual address can be simply partitionedthus:
The address-within-page bits don't need to be translated, so thememory management hardware only has to cope with translating thehigh-order addresses (traditionally called virtual page number or VPN),into the high order bits of a physical address (a physical frame number, or PFN – nobodycan remember why it's not PPN ).
Thisseries of articles is based on material from “SeeMIPS Run Linux,” by Dominic Sweetman, used with the permission ofthe publisher, Morgan Kaufmann/Elsevier, which retains full copyrights.It can be purchased on line.
Dominic Sweetman is asoftware/hardware boundary expert based in London, England, whopreviously served as managing director at Algorithmics Ltd.