Getting down to basics: Running Linux on a 32-/64-bit RISC architecture - Part 2 -

Getting down to basics: Running Linux on a 32-/64-bit RISC architecture – Part 2


Let me tell you a story. Well, in fact, let me tell you severalstories/threads illustrating how the MIPS hardware provides thelow-levelfeatures that prop up the Linux kernel – 

Story/thread #1 -The life and times of an interrupt: What happens when some piece ofhardware signals the CPU that it needs attention? Interrupts can causechaos when an interrupt-scheduled piece of code interrupts another codehalfway through some larger operation, so that leads on to a section onthreads, critical regions, and atomicity.

Story/thread #2 – What happens on a systemcall: After a userland application programcalls a kernel subroutine, what happens and how is that kept secure?

Story/thread #3 – How addresses get translated in Linux/MIPS: A fairly long story ofvirtual memory and how the MIPS hardware serves the Linux memory map.

(The first two threads are coveredin thispart in this series and the third covered next in Part 4. Part 1, ifyou remember, was a highlevel view of GNU/Linux and how it operates on the MIPS 32k/64karchitecture. )

Thread #1: The Life and Times of anInterrupt
It all starts when something happens in the real world: Maybe youtapped a key on the keyboard. The device controller hardware picks upthe data and activates an interrupt signal.

That wends its way – probably via various pieces of hardware outsidethe CPU, which are all different and not interesting right now – untilit shows up as the activation of one of the CPU's interrupt signals,most often one of Int0-5*.

The CPU hardware polls those inputs on every clock cycle. The Linuxkernel sometimes disables interrupts, but not often; most often, theCPU is receptive. The CPU responds by taking an interrupt exception thenext time it's ready to fetch an instruction.

It's sometimes difficult to get across just how fast this happens. A500-MHz MIPS CPU presented with an interrupt will fetch the firstinstruction of the exception handler (if in cache) within 3 to 4clocks: That's 8 nanoseconds.

Seasoned low-level programmers know that you can't depend on an 8-nsinterrupt latency in any system, so we'll discuss below what gets inthe way. As the CPU fetches instructions from the exception entrypoint, it also flips on the exception state bit SR(EXL), which willmake it insensitive to further interrupts and puts it inkernel-privilege mode. It will go to the general exception entry point,at 0x8000.0180.1

(Of course, it can't be thatsimple; some MIPS CPUs nowprovide a dedicated interrupt-exception entrypoint, and some will even go to a different entry point according towhich interrupt signal was asserted. But we'll keep it simple – manyOSs still do. )

The general exception handler inspects the Cause register and inparticular the Cause(ExcCode) field: That has a zero value, showingthat this was an interrupt exception. The Cause(IP7-2) field showswhich interrupt input lines are active, and SR(IM7-2) shows which ofthese are currently sensitized.

There ought to be at least one active, enabled interrupt; thesoftware calculates a number corresponding to the number of the activeinterrupt input that will become Linux's irq number.

(In many cases, the exceptionhandler may consult system-dependent external registers to refine theinformation about the interrupt number. )

Before calling out into something more like generic code, the mainexception handler saves all the integer register values – they belongto the thread that was just interrupted and they will need to berestored before that thread is allowed to continue, somewhere in thesystem's future.

(Some MIPS CPUs can avoid thisoverhead for very low level interrupt handlers using shadow Registers ).

The values are saved on the kernel stack of the thread that wasrunning when the interrupt happened, and the stack pointer is setaccordingly. Some key CP0 register values are saved too; those includeSR, EPC, and Cause.

With the SR(EXL) bit set, we're not really ready to call routines inthe main kernel. In this state we can't take another exception, so wehave to be careful to steer clear of any mapped addresses, for example.

(This is not quite inevitable. TheMIPS architecture has some carefully designed corners that make itpossible to nest exceptions in carefully controlled conditions. ButLinux doesn't use them .)

Now that we have saved registers and established a stack we canchange SR, clearing SR(EXL) but also clearing SR(IE) to avoid taking asecond interrupt – after all, the interrupt signal we responded to isstill active.

Now we're ready to call out to do IRQ(). It's a machine-dependentroutine but written in C, with a fairly well standardized flow. doIRQ() is passed a structure full of register values as a parameter. Itsjob is to selectively disable this interrupt in particular andacknowledge any interrupt management hardware that needs a confirmationthat the interrupt has been serviced.

Assuming there is a device ISR (interrupt service routine)registered on this irq number, do IRQ() calls on to handle IRQ event().It's undesirable to run for long with all interrupts disabled; someother device out there may need fast response to its interrupt.

But we're now into an area where it all depends on the nature of thedevice whose interrupt we're handling. If this handler is known to bevery short and efficient (SA INTERRUPT set), we can just go ahead andrun it with all interrupts disabled; if it's going to take longer, thenwe'll re-enable interrupts in general – do IRQ() has disabled theparticular interrupt we're handling.

Multiple interrupt handlers can be registered for this one irqnumber; if so, they're called in turn.

Once the interrupt handlers are finished, handle IRQ event()disables interrupts again and returns (that's what do IRQ() expects).After some more opportunity for machine-dependent tidying up, thatcalls ret from intr() to unwind the interrupt exception.

Before finally returning from the interrupt, we check whethersomething that happened in an ISR means that we should call thescheduler. Linux has a global flag, needs resched, for this purpose;the code is not quite so simple as that because rescheduling from athread interrupted in kernel mode (kernel preemption) may causetrouble.

In fact, if the interrupted thread was executing with kernelprivilege – as shown by the saved value of SR(KSU) – we won'treschedule if the global preemt count is nonzero.

If the system doesn't schedule another thread, we just carry on andreturn from the interrupt. We restore all the saved register values. Inparticular, we restore the post exception value of SR to take us backinto exception mode, and we restore EPC, which holds the restartlocation. Then we just run a MIPS eret.

If the system does schedule another thread, then our interruptedthread is parked exactly as it is, teetering on the edge of returningfrom the interrupt. When our interrupt-victim thread is selected to runagain, it will burst out, back to whatever it was doing wheninterrupted.

Some device drivers just grab a little data from the device and sendit up to the next level; others may do quite a lot of processing atinterrupt time. But ISRs should not run for too long in interruptcontext.

Even after we've re-enabled other interrupts, the driver hasunconditionally seized the attention of the CPU without the OSscheduler getting to apply its policies.Where there's more than a fewtens of instructions' worth of work to do in the ISR, it would bebetter to defer that interrupt processing until the scheduler candecide whether there may be something more vital to do.

Linux traditionally splits the ISR into a “top half,” run atinterrupt time, and a “bottom half,” deferred until after arescheduling opportunity. (The name could be confusing. Linux used tohave a subsystem called “bottom half ” and with functions whose namesstarted “bh,” but that's obsolete and was finally omitted from the 2.6kernel. But we still use “bottom half ” as a generic term for workmoved out of the interrupt handler.

Modern Linux kernels have a system called softirq, which can arrangeto run one of 32 functions; a couple of very demanding devices havetheir own softirq bottom halves, but most share a more flexible systemcalled tasklets, built on top of softirqs.

Both of these are implemented in machine-independent code and areways to arrange that a secondary interrupt handler will run soon afterthe real ISR returns.

A tasklet is ultimately called from the interrupt handler and can'tdo anything that might cause the unrelated interrupt-victim thread tosleep. Tasklet codemust not do anything that might need waiting for.But there's always more than one way to do it. An ISR can also putextra driver tasks on a work queue.

The work queue is serviced by a regular kernel thread, and workqueue functions can do anything, even operations that might make thethread sleep. Since this is all machine-independent, we won't dwell onthe details here; it's described well in Linux Kernel Development byRobert Love (Sam's Publishing).

High-PerformanceInterrupt Handling. Compared with a lightweight OS,there's a fair amount of extra overhead in a Linux ISR. Linuxunconditionally saves all registers and makes several levels offunction calls (for example, isolating interrupt controller handling indo IRQ()).

It's possible to do useful work in a MIPS interrupt handler thatnever leaves exception mode (that is, it keeps SR(EXL) set) – butthat's not directly supported by conventional Linux.

It's also possible to build a MIPS OS that is careful never todisable all interrupts for more than a handful of instructions, butLinux relies on disabling all interrupts at the CPU to avoid interruptsat uncomfortable places.

There are several efforts out there to make Linux more responsive tointerrupts. While reducing the worst-case interrupt latency – the timeit takes from the hardware interrupt being asserted to the ISR beingentered – is important, it's not the whole job. To be responsive, thesystem also has to be prompt about scheduling the thread that isconcerned with the input or output.

One project (the “low-latency patches” project) is evolutionary,aiming to improve average performance by a patient and diligent processof refining kernel locking/preemption control. Some of the low-latencywork is in the mainstream 2.6+ Linux kernel.

Other projects (various kinds of “real-time” Linux) arerevolutionary, commonly jacking up the Linux OS so that it doesn't seereal interrupts at all. Interrupts are handled underneath by a sort ofRTOS, of which the whole Linux kernel is but one thread.

When Linux disables interrupts, it doesn't really disable hardwareinterrupts; it just prevents Linux ISRs from being scheduled by theRTOS layer. It's worth noting that these two approaches arecomplementary – there's no reason not to pursue both.

Threads,Critical Regions, and Atomicity. A thread in Linux has to beaware that the data it's manipulating may also be getting attentionfrom other, concurrent, activity. One of the main difficultiesencountered in implementing a multitasking OS is finding and protectingall such sequences. The concurrent activity can come from:

Code run by another CPU: Ina real multiprocessor system

Code run in interrupt context: Thatis, code run on this same CPU as a result of an interrupt.

Code run by another thread thatpreempts this one: More subtle, this is code run by some otherkernel thread which (no doubt as a result of some interrupt) gotscheduled over ours

The problem arises when a sequence of operations on data will beselfconsistent when completed, but transiently creates someinconsistent state that other software cannot safely interpret oroperate on.

What software wants is the ability to mark a data transformation asatomic: Any outside view should either see the whole change made, ornone of it. Linux provides a small range of simple atomic operations:Those operations can be specially implemented for a particulararchitecture, if the architecture offers some particularly neat way ofdoing them.

But to make complex operations atomic, Linux uses locks. A lock is aparticularly simple form of what is called a semaphore in the moregeneral case ” but it probably helps to start with Linux's simpler andmore minimal definition, to wit:

A lock is a ticket to operate on a particular chunk of data, and solong as all software accessing the data cooperates, only one threadwill get to work on it at once. Any contending accessor will be heldwhen they try to acquire the lock for the data.

The piece of code that does the want-to-be atomic sequence ofoperations on the data is called a critical region. So what we'll do isto get anyone wanting to access contended data to do something likethis:

/* do your stuff in the criticalregion */

Every time this code is called, there's some time spent in thoseacquire and release operations, however they are implemented. Whencontention actually happens, the thread arriving at the acquire pointand finding the lock currently unavailable must be held up, and theother thread arriving at the release must somehow pass the message onthat it's now OK to continue.

When the contender is known to be running on a separate CPU and thecritical region consists of a handful of instructions, it may make mostsense to get the temporarily frustrated acquirer to spin in a loopwatching for the condition to clear. That's a spinlock, and Linuxprovides them for SMP.

When the contender could be a separate thread on the same CPU, youcan't spin (while the waiting thread spins, the thread holding the lockcould not be rescheduled, never gets to finish and release the lock,sowe're all deadlocked forever).

So you need amore heavyweight lock, where a thread that fails toacquire the lock marks itself as not currently runnable and calls theOS scheduler.

Moreover, the release lock() routine must arrange that any otherthread waiting for the lock is told it can have another go. The codethat suspends and wakes Linux threads is not CPU dependent. But thecode that tests the lock state and sets the lock in the commonuncontended case depends on an atomic test-and-set operation, whoseimplementation is done with special MIPS tricks, described below.

Where the contender might be in interrupt context (called, ultimately, from the interruptentry point and borrowing the context of some randomly choseninterrupted thread to do so), Linux depends on disabling therelevant interrupt during the critical region.

And since it's difficult to mask just one interrupt, that oftenmeans disabling all interrupts around the critical region. As mentionedpreviously, code called in interrupt context is expected to bedisciplined and avoid doing things that will cause trouble.

In fact, interrupt handlers are expected to have only simple,stylized interactions with the rest of the kernel. What does the MIPSarchitecture bring to help implement simple atomic operations andlocks?

MIPSarchitecture and atomic operations. MIPS has theload-linked/store-conditional pair. You use them to implement anarbitrary read-modify-write sequence on a variable (just read using ll and write using sc).

The sequence is not atomic, in itself. But the store will do nothingunless it is certain that the sequence did run atomically, and screturns a value the software can test. If the test shows the sc failed,the software can retry the RMW sequence until it finally does succeed.

These instructions were invented primarily for multiprocessorsystems, as an alternative to a guaranteed-atomic RMW operation.They're a good choice there, because they avoid the overhead of asystem-wide atomic transaction, which can stop every memory access in alarge multiprocessor to make sure none of those accesses touch the dataof interest.

The implementation is fairly simple. ll sets a per-CPU link bit and(for multiprocessor or hardware multithreading) records the loadaddress so the CPU can monitor accesses to it. sc will succeed – doingthe store and returning a “1” value – only if the link bit is stillset. The CPU must clear the link bit if it detects that the variable is(or may have been) updated by some unrelated software.

In a multiprocessor system, the external-access detection isimplemented by the snooping logic that keeps the caches coherent. Butwithin a single CPU, the link is broken by any exception. (In most implementations, the link bit isin fact cleared by the eret instruction, which is essential to returnfrom the exception.)

Linux's atomic inc(&mycount) uses the instructions to do anatomic increment of any integer variable:

atomic_inc:     ll v0, 0(a0)                          # a0 has pointer to 'mycount'     addu v0, 1     sc v0, 0(a0)     beq v0, zero,atomic_inc
    nop     jr ra     nop

If you emerge from this routine, you emerge having added 1 tomycount in a sequence that – it is guaranteed – was not interfered withby any SMP partner CPU, a local interrupt allowing other software torun, or another thread running on a hardware-multithreaded machine.

It is possible for the routine to spin for a period of time if it iscontinually frustrated by external writes or interrupts, but – becausethe sequence between the ll and sc is very short – it's extraordinarilyunlikely to have to try as many as three times.

You can probably see from the above that the ll/sc test is notsuitable for use on very complicated operations, where the load andstore are far apart. If you wait long enough, all links are broken.

Linux Spinlocks. Spinlocks use an atomic test-and-set operation to implement thesimplest, cheapest locking primitive that protects multiple CPUs fromeach other. It's an exercise for the reader to track down how ll/sc canbe used to construct an efficient spinlock.

Unlike the atomic operations themselves, spinlocks are quite uselesson uniprocessor (that is, single-threaded uniprocessor) systems.

Historically, Linux kernels were built to be SMP-safe before kernelpreemption was permitted. “Kernel preemption” is when an interruptcauses a thread running in the kernel to be descheduled and another onerun.

It's quite possible to build a Linux system in which a thread in thekernel runs either until its timeslice expires, it sleeps voluntarily(waiting some event), or it attempts to return to user mode.

In fact, standard Linux kernels before 2.6 did that and were stillmore responsive than a contemporary Windows system with a backgroundtask running. But kernel preemption is better still, if you can make itwork properly.

Along the road to the 2.6 release of the kernel, George Anzingerobserved that almost all the critical regions that need softwareprotection from kernel preemption also needed protection from SMPconcurrency, so these regions are already delimited by spinlockoperations.

Spinlocks are macros, so they can be disabled (and will compile away to nothing)for a uniprocessor system. When a 2.6+ kernel is built with the CONFIGPREEMPT configuration defined, the spinlock operations gain the sideeffect of disabling preemption between their acquire and release.

Well, that was nearly it. “Almost all” is a dangerous phrase. Justoccasionally, a piece of data manipulation is inherently SMP-safe (mostobviously when it addresses per-CPU data) so it doesn't need spinlocks,but it does need defense against preemption, since other kernel code onthe same CPU is accessing the same data.

Making the 2.6 kernel reliable with kernel preemption still requireda fair amount of patient combing through code and even more patientdebugging.

Next in Part 3: What happens on aSystem Call
To read Part 1, go to “GNU/Linux from eight miles high”

Thisseries of articles is based on material from “SeeMIPS Run Linux,” by Dominic Sweetman, used with the permission ofthepublisher, Morgan Kaufmann/Elsevier, which retains full copyrights. Itcan be purchased on line.

Dominic Sweetman is asoftware/hardware boundary expert based in London, England, whopreviously served as managing director at Algorithmics Ltd.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.