A system call is "just" a subroutine implemented by the kernel for a
user-space program. Some system calls just return information known in
the kernel but not outside (
the
accurate time of day, for example).
But there are two reasons why it's more difficult than that. The
first has to do with security - the kernel is supposed to be robust in
the face of faulty or malicious application code.
The second reason has to do with stability. The Linux kernel should be able to
run
applications built for it, but should also be able to run any
application built for the same or a compatible architecture at any time
in the past. A system call, once defined, may only be removed after a
lot of argument, work, and waiting.
But back to security. Since the system call will run in kernel mode,
the entry to kernel mode must be controlled. For the MIPS architecture that's done
with a
software-triggered exception from the syscall instruction, which
arrives in the kernel exception handler with a distinctive code (8 or
"Sys") in the CPU's Cause(ExcCode) register field.
The lowest-level exception handler will consult that field to decide
what to do next and will switch into the kernel's system call handler.
That's just a single entry point: The application sets a numeric
argument to select which of several hundred different syscall functions
should be called. Syscall values are architecture-specific: The MIPS
system call number for a function may not be the same as the x86 number
for the same function.
What Happens on a System Call
System call arguments are passed in registers, as far as possible. It's
good to avoid unnecessary copies between user- and kernel-space. In the
32-bit MIPS system:
1) The syscall
number is put in v0.
2) Arguments are
passed as required by the "o32" ABI. Most of the time, that means that
up to four arguments are passed in the registers a0" a3; but there are
corner cases where something else happens. (Many of the darkest corners of argument
passing don't affect the kernel, because it doesn't deal in
floating-point values or pass or return structures by value.)
Although the kernel uses something like o32 to define how arguments
are passed into system calls, that does not mean that userland programs
are obliged to use o32.
Good programming practice requires that userland calls to the OS
should always pass through the C runtime library or its equivalent, so
that the system call interface need only be maintained in one place.
The library can translate from any ABI to the o32-like syscall
standard, and that's done when running 32-bit software on 64-bit MIPS
Linux.
3) The return
value of the syscall is usually in v0. But the pipe(2) system call
returns an array of two 32-bit file descriptors, and it uses v0 and
v1—perhaps one day another system call will do the same.
(I hear a chorus of "Over our dead
bodies!" from the kernel maintainers.)
All system calls that can ever fail return a summary status (0 for good, nonzero for bad) in a3.
4) In keeping
with the calling conventions, the syscall preserves the values of those
registers that a32 defines as surviving function calls.
Making sure the kernel implementation of a system call is safe involves
a healthy dose of paranoia. Array indexes, pointers, and buffer lengths
must be checked.
Not all application programs are permitted to do everything, and the
code for a systemcall must check the caller's "capabilities" when
required (most of the time, a process running for the root super-user
can do anything).
The kernel code running a systemcall is in process context—the least
restrictive context for kernel code. Systemcall code can do things that
may require the thread to sleep while some I/O event happens, or can
access virtual memory that may need to be paged in.
When copying data in and out, you're dependent on pointers provided
by the application, and they may be garbage. If we use a bad pointer,
we'll get an exception, which could crash the kernel—not acceptable.
So copying inputs from user space or outputs to user space can be
done safely with functions copy to user()/copy from user(), provided
for that purpose. If the user did pass a bad address, that will cause a
kernel exception.
The kernel, though, maintains a list of the functions10 that are
trusted to do dangerous things: copy to user() and copy from user() are
on that list. When the exception handler sees that the exception
restart address is in one of these functions, it returns to a carefully
constructed error handler. The application will be sent a rude signal,
but the kernel survives.
(It's actually a list of
instruction address ranges.)
Return from a system call is done by a return through the end of the
syscall exception handler, which ends with an eret. The important thing
is that the change back to user mode and return to the user instruction
space happen together.
 |
| Figure
14.1. Memory map for a Linux Thread |
How Addresses Get Translated in
Linux/MIPS Systems
Before getting down to how it's done, we should get an overview of the
job. Figure 14.1 above
sketches out the memory map seen by a Linux thread on a 32-bit
Linux/MIPS system. The map has to fit onto what the hardware does, so
user-accessible memory is necessarily in the bottom half of the map.
(The 64-bit Linux/MIPS map is
built on the same principles, but the more complicated hardware map
makes it relatively confusing.)