Getting down to basics: Running Linux on a 32-/64-bit RISC architecture: Part 3A system call is "just" a subroutine implemented by the kernel for a user-space program. Some system calls just return information known in the kernel but not outside (the accurate time of day, for example).
But there are two reasons why it's more difficult than that. The first has to do with security - the kernel is supposed to be robust in the face of faulty or malicious application code.
The second reason has to do with stability. The Linux kernel should be able to run applications built for it, but should also be able to run any application built for the same or a compatible architecture at any time in the past. A system call, once defined, may only be removed after a lot of argument, work, and waiting.
But back to security. Since the system call will run in kernel mode, the entry to kernel mode must be controlled. For the MIPS architecture that's done with a software-triggered exception from the syscall instruction, which arrives in the kernel exception handler with a distinctive code (8 or "Sys") in the CPU's Cause(ExcCode) register field.
The lowest-level exception handler will consult that field to decide what to do next and will switch into the kernel's system call handler.
That's just a single entry point: The application sets a numeric argument to select which of several hundred different syscall functions should be called. Syscall values are architecture-specific: The MIPS system call number for a function may not be the same as the x86 number for the same function.
What Happens on a System Call
System call arguments are passed in registers, as far as possible. It's good to avoid unnecessary copies between user- and kernel-space. In the 32-bit MIPS system:
1) The syscall number is put in v0.
2) Arguments are passed as required by the "o32" ABI. Most of the time, that means that up to four arguments are passed in the registers a0" a3; but there are corner cases where something else happens. (Many of the darkest corners of argument passing don't affect the kernel, because it doesn't deal in floating-point values or pass or return structures by value.)
Although the kernel uses something like o32 to define how arguments are passed into system calls, that does not mean that userland programs are obliged to use o32.
Good programming practice requires that userland calls to the OS should always pass through the C runtime library or its equivalent, so that the system call interface need only be maintained in one place. The library can translate from any ABI to the o32-like syscall standard, and that's done when running 32-bit software on 64-bit MIPS Linux.
3) The return value of the syscall is usually in v0. But the pipe(2) system call returns an array of two 32-bit file descriptors, and it uses v0 and v1—perhaps one day another system call will do the same.
(I hear a chorus of "Over our dead bodies!" from the kernel maintainers.)
All system calls that can ever fail return a summary status (0 for good, nonzero for bad) in a3.
4) In keeping with the calling conventions, the syscall preserves the values of those registers that a32 defines as surviving function calls.
Making sure the kernel implementation of a system call is safe involves a healthy dose of paranoia. Array indexes, pointers, and buffer lengths must be checked.
Not all application programs are permitted to do everything, and the code for a systemcall must check the caller's "capabilities" when required (most of the time, a process running for the root super-user can do anything).
The kernel code running a systemcall is in process context—the least restrictive context for kernel code. Systemcall code can do things that may require the thread to sleep while some I/O event happens, or can access virtual memory that may need to be paged in.
When copying data in and out, you're dependent on pointers provided by the application, and they may be garbage. If we use a bad pointer, we'll get an exception, which could crash the kernel—not acceptable.
So copying inputs from user space or outputs to user space can be done safely with functions copy to user()/copy from user(), provided for that purpose. If the user did pass a bad address, that will cause a kernel exception.
The kernel, though, maintains a list of the functions10 that are trusted to do dangerous things: copy to user() and copy from user() are on that list. When the exception handler sees that the exception restart address is in one of these functions, it returns to a carefully constructed error handler. The application will be sent a rude signal, but the kernel survives.
(It's actually a list of instruction address ranges.)
Return from a system call is done by a return through the end of the syscall exception handler, which ends with an eret. The important thing is that the change back to user mode and return to the user instruction space happen together.
|Figure 14.1. Memory map for a Linux Thread|
How Addresses Get Translated in
Before getting down to how it's done, we should get an overview of the job. Figure 14.1 above sketches out the memory map seen by a Linux thread on a 32-bit Linux/MIPS system. The map has to fit onto what the hardware does, so user-accessible memory is necessarily in the bottom half of the map.
(The 64-bit Linux/MIPS map is built on the same principles, but the more complicated hardware map makes it relatively confusing.)