There are subtler issues beyond those discussed in
We work at that fuzzy interface between hardware and software, whichcreates additional problems due to the interactions of our code and thedevice. Some cause erratic and quite impossible-to-diagnose crashesthat infuriate our customers.
The worst bugs of all are those that appear infrequently, that can'tbe reproduced. Yet a reliable system just cannot tolerate any sort ofdefect, especially the random one that passes our tests, perhapsdismissed with the “ah, it's just a glitch” behavior.
Potential evil lurks whenever hardware and software interactasynchronously. That is, when some physical device runs at its ownrate, sampled by firmware running at some different speed.
I was poking through some open-source code and came across a typicalexample of asynchronous interactions. The RTEMS real time operatingsystem provided by OAR Corporation is anicely written, well organized product with a lot of neat features.
But the timer handling routines, at least for the 68302distribution, are flawed in a way that will fail infrequently butpossibly catastrophically. This is just one very public example of theproblem I constantly see buried in proprietary firmware.
The code is simple and straightforward, and looks much like anyother timer handler.
There's an interrupt service routine invoked when the 16 bithardware timer overflows. The ISR services the hardware, increments aglobal variable named timer_hi, and returns.
Therefore, timer_hi maintains the number of times the hardwarecounted to 65536. Function read_timer returns the current “time” (theelapsed time in microseconds as tracked by the ISR and the hardwaretimer). It, too, is delightfully free of complications.
Like most of these sorts of routines it reads the current contentsof the hardware's timer register, shifts Timer_hi left 16 bits, andadds in the value read from the timer. That is, the current time is theconcatenation the timer's current value and the number of overflows.
Suppose the hardware rolled over 5 times, creating five interrupts.timer_hi equals 5. Perhaps the internal register is, when we callread_timer, 0x1000. The routine returns a value of 0x51000. Simpleenough and seemingly devoid of problems.
But let's think about this more carefully. There are really two thingsgoing on at the same time. Not concurrently, which means “apparently atthe same time,” as in a multitasking environment where the RTOS dolesout CPU resources so all tasks appear to be running simultaneously.
No, in this case the code in read_timer executes whenever called,and the clock-counting timer runs at its own rate. The two areasynchronous.
A fundamental rule of hardware design is to panic wheneverasynchronous events suddenly synchronize. For instance, when twodifferent processors share a memory array there's quite a bit ofconvoluted logic required to ensure that only one gets access at anytime. If the CPUs use different clocks the problem is much trickier,since the designer may find the two requesting exclusive memory accesswithin fractions of a nanosecond of each other.
This is called a “race” condition and is the source of many grayhairs and dramatic failures.One of read_time r'srace conditions might be:
It reads the hardware and gets, let's say, a value of 0xffff.
Before having a chance to retrieve the high part of the time fromvariable timer_hi, the hardware increments again to 0x0000.
The overflow triggers an interrupt. The ISR runs timer_hi is now0x0001, not 0 as it was just nanoseconds before.
The ISR returns, our fearless read_timer routine, with no idea aninterrupt occurred, blithely concatenates the new 0x0001 with thepreviously read timer value of 0xffff, and returns 0x1ffff—a hugelyincorrect value.
Alternatively, suppose read_timer is called during a time wheninterrupts are disabled—say, if some other ISR needs the time. One ofthe few perils of writing encapsulated code and drivers is that you'renever quite sure what state the system is in when the routine getscalled. In this case:
read_timer starts. Thetimer is 0xffff with no overflows.
Before much else happens it counts to 0x0000. With interrupts offthe pending interrupt gets deferred.
read_timer returns avalue of 0x0000 instead of the correct 0x10000, or the reasonable0xffff.
So the algorithm that seemed so simple has quite subtle problems,necessitating a more sophisticated approach. The RTEMS RTOS, at leastin its 68 k distribution, will likely create infrequent but seriouserrors.
Sure, the odds of getting a misread are small. In fact, the chanceof getting an error plummets as the frequency we call read_timerdecreases. How often will the race condition surface? Once a week?Monthly?
Many embedded systems run for years without rebooting. Reliableproducts must never contain fragile code. Our challenge as designers ofrobust systems is to identify these sorts of issues and createalternative solutions that work correctly, every time.
What options are available?
Fortunately a number of solutions do exist. The easiest is to stop thetimer before attempting to read it. There will be no chance of anoverflow putting the upper and lower halves of the data out of sync.This is a simple and guaranteed solution. We will lose time. Since thehardware generally counts the processor's clock, or clock divided by asmall number, it may lose quite a few ticks during the handful ofinstructions executed to do the reads.
The problem will be much worse if an interrupt causes a contextswitch after disabling the counting. Turning interrupts off during thisperiod will eliminate unwanted tasking, but increases both systemlatency and complexity.
I just hate disabling interrupts, system latency goes up, andsometimes the debugging tools get a bit funky. When reading code a redflag goes up if I see a lot of disable interrupt instructions sprinkledabout. Though not necessarily bad, it's often a sign that either thecode was beaten into submission (made to work by heroic debugginginstead of careful design), or there's something quite difficult andodd about the environment.
Another solution is to read the timer_hi variable, then the hardwaretimer, and then reread timer_hi. An interrupt occurred if both variablevalues aren't identical. Iterate until the two variable reads areequal. The upside: correct data, interrupts stay on, and the systemdoesn't lose counts.
The downside: in a heavily loaded, multitasking environment, it'spossible that the routine could loop for rather a long time beforegetting two identical reads. The function's execution time isnondeterministic. We've gone from a very simple timer reader tosomewhat more complex code that could run for milliseconds instead ofmicroseconds.
Another alternative might be to simply disable interrupts around thereads. This will prevent the ISR from gaining control and changingtimer_hi after we've already read it, but creates another issue.
We enter read_timer and immediately shut down interrupts. Supposethe hardware timer is at our notoriously-problematic 0xffff, andtimer_hi is zero. Now, before the code has a chance to do anythingelse, the overflow occurs. With context switching shut down we miss therollover.
The code reads a zero from both the timer register and fromtimer_hi, returning zero instead of the correct 0x10000, or even areasonable 0x0ffff. Yet disabling interrupts is probably indeed a goodthing to do, despite my rant against this practice.
With them on there's always the chance our reading routine will besuspended by higher priority tasks and other ISRs for perhaps a verylong time. Maybe long enough for the timer to roll over several times.So let's try to fix the code. Consider the following:
unsigned int low,high;
We've made three changes to the RTEMS code. First, interrupts areoff, as described. Second, you'll note that there's no explicitinterrupt re-enable. Two new pseudo-C statements have appeared, whichpush and pop the interrupt state. Trust me for a moment—this is just amore sophisticated way to manage the state of system interrupts.
The third change is a new test that looks at something called”timer_overflow,” an input port that is part of the hardware. Mosttimers have a testable bit that signals an overflow took place. Wecheck this to see if an overflow occurred between turning interruptsoff and reading the low part of the time from the device. With aninactive ISR variable timer_hi won't properly reflect such an overflow.
We test the status bit and reread the hardware count if an overflowhad happened. Manually incrementing the high part corrects for thesuspended ISR. The code then concatenates the two fixed values andreturns the correct result—every time. With interrupts off we haveincreased latency. However, there are no loops; the code's executiontime is entirely deterministic.
Unhappily, race conditions occur anytime we're need more than one readto access data that's changing asynchronously to the software. Ifyou're reading X and Y coordinates, even with just 8 bits ofresolution, from a moving machine there's some peril they could beseriously out of sync if two reads are required. A ten-bit encodermanaged through byte-wide ports potentially could create a similarrisk.
Having dealt with this problem in a number of embedded systems overthe years, I wasn't too shocked to see it in the RTEMS RTOS. It's apretty obscure issue, after all, though terribly real and potentiallydeadly.
For fun I looked through the source of uC/OS, another very popularoperating system whose source is on the net (
Some of you, particularly those with hardware backgrounds, may beclucking over an obvious solution I've yet to mention. Add an inputcapture register between the timer and the system; the code sets a”lock the value into the latch” bit, then reads this safely unchangingdata. The register is nothing more than a parallel latch, as wide asthe input data.
A single clock line drives each flip-flop in the latch, when strobedit locks the data into the register. The output is fed to a pair ofprocessor input ports.
When it's time to read a safe, unchanging value the code issues a”hold the data now” command, which strobes encoder values into thelatch. So all bits are stored and can be read by the software at anytime, with no fear of things changing between reads. Some designers tiethe register's clock input to one of the port control lines.
The I/O read instruction then automatically strobes data into thelatch, assuming one is wise enough to ensure the register latches dataon the leading edge of the clock.
The input capture register is a very simple way to suspend movingdata during the duration of a couple of reads. At first glance it seemsperfectly safe. However, a bit of analysis shows that for asynchronousinputs it is not reliable. We're using hardware to fix a softwareproblem, so we must be aware of the limitations of physical logicdevices.
To simplify things for a minute, let's zoom in on that input captureregister and examine just one of its bits. Each gets stored in aflip-flop, a bit of logic that might have only three connections: datain, data out, and clock. When the input is a one, strobing clock puts aone at the output.
However, suppose the input changes at about the same time clockcycles. What happens?
The short answer is that no one knows.
Next in Part 3: Metastable States and interrupt latency
To read Part 1 in this series, go to
JakobEngblom (firstname.lastname@example.org)is technical marketing manager atat Virtutech
He was a contributor of material to “