Asynchronicity - Embedded.com

Asynchronicity

Race conditions are surprisingly easy to create, even when programmers know of their dangers. Here's a frequent timer bug to watch out for.

What makes firmware reliable? Certainly, a lot of requirements must be met. One is that it works properly at the boundary conditions, when parameters passed to functions are at extreme values, or, in a real-time system, when system loading skyrockets to its max. Too many systems become brittle when pushed to the edge; worse, it's almost impossible to create tests for these rare-but-possible conditions.

We work at that fuzzy interface between hardware and software, which creates additional problems due to the interactions of our code and the device. Some cause erratic and quite-impossible-to-diagnose crashes that infuriate our customers. The worst bugs of all are those that appear infrequently, that can't be reproduced. Yet a reliable system just cannot tolerate any sort of defect, especially the random one that passes our tests, perhaps dismissed with the “ah, it's just a glitch” excuse.

Potential evil lurks whenever hardware and software interact asynchronously. That is, when some physical device runs at its own rate, sampled by firmware running at a different speed.

I was poking through some open-source code recently and came across a typical example of asynchronous interactions. The RTEMS real-time operating system provided by OAR Corp. (ftp://ftp.oarcorp.com/pub/rtems/releases/4.5.0/) is a nicely written, well-organized product with a lot of neat features. But the timer handling routines, at least for the Motorola 68302 processor, are flawed in a way that will fail infrequently, but possibly catastrophically. This is just one public example of a problem I constantly see in proprietary firmware.

The code is simple and straightforward, and looks like any other timer handler. An ISR is invoked when the 16-bit hardware timer overflows. The ISR services the hardware, increments a global variable named Timer_interrupts, and returns. So Timer_interrupts maintains the number of times the hardware counted to 65,536.

Function Read_timer() returns the current “time” (the elapsed time in microseconds as tracked by the ISR and the hardware timer). It, too, is delightfully free of complications. Like most of these sorts of routines, it reads the current contents of the hardware's timer register, shifts Timer_interrupts left 16 bits, and adds in the value read from the timer. That is, the current time is the concatenation of the timer's current value and the number of overflows.

Suppose the hardware rolled over five times, creating five interrupts. Timer_interrupts equals five. Perhaps the internal register is 0x1000 when we call Read_timer(). The routine returns a value of 0x51000. Simple enough and seemingly devoid of problems.

Race conditions

But let's think about this more carefully. Two things are going on at the same time. Not concurrently, which means “apparently at the same time,” as in a multitasking environment, where the RTOS doles out CPU resources so all tasks appear to be running simultaneously. No, in this case the code in Read_timer() executes whenever called, and the clock-counting timer runs at its own rate. The two are asynchronous.

A fundamental rule of hardware design is to panic whenever asynchronous events suddenly synchronize. For instance, when two different processors share a memory array, quite a bit of convoluted logic is required to ensure that only one gets access at any time. If the CPUs use different clocks the problem is much trickier, since the designer may find the two requesting exclusive memory access within fractions of a nanosecond of each other. This is called a race condition and is the source of many gray hairs and dramatic failures.

One of Read_timer()'s race conditions might cause a problem if:

  • It reads the hardware and gets, for example, a value of 0xffff.
  • Before having a chance to retrieve the high part of the time from variable Timer_interrupts, the hardware increments again to 0x0000.
  • The overflow triggers an interrupt. The ISR runs. Timer_interrupts is now 0x0001, not 0x0000 as it was just nanoseconds before.
  • The ISR returns. Our fearless Read_timer() routine, with no idea an interrupt occurred, blithely concatenates the new 0x0001 with the previously read timer value of 0xffff, and returns 0x1ffff-instead of 0x0ffff or 0x10000.

Or, suppose Read_timer() is called when interrupts are disabled-say, if some other ISR needs the time. One of the few perils of writing encapsulated code and drivers is that you're never quite sure what state the system is in when the routine gets called. In this case:

  • Read_timer() starts. The timer is 0xffff with no overflows.
  • Before much else happens, it counts to 0x0000. With interrupts off, the pending interrupt gets deferred.
  • Read_timer() returns a value of 0x0000 instead of the correct 0x10000, or the reasonable 0xffff.

So the algorithm that seemed so simple has subtle problems, necessitating a more sophisticated approach. The RTEMS RTOS, at least in its 68k distribution, might create rare but serious errors.

Sure, the odds of getting a mis-read are small. In fact, the chance of getting an error plummets as the frequency with which we call Read_timer() decreases. How often will the race condition surface? Once a week? Monthly?

Many embedded systems run for years without rebooting. Reliable products must never contain fragile code. Our challenge as designers of robust systems is to identify these sorts of issues and create alternative solutions that work correctly, every time.

Just weeks ago an engineer told me his team spent three months tracking down this sort of race condition, also in a timer driver. The bug appeared so infrequently it seemed a ghost, but their safety-critical product could not crash-ever. Can you imagine the cost of three extra months of debugging?

Options

Fortunately, a number of solutions do exist. The easiest is to stop the timer before attempting to read it. There will be no chance of an overflow putting the upper and lower halves of the data out of sync. This is a simple and guaranteed solution.

The timer, though, will start to lose track of time, since we've disabled it for at least a number of microseconds. The problem will be much worse if an interrupt causes a context switch after disabling the counting. Turning interrupts off during this period will eliminate unwanted tasking, but increases both system latency and complexity.

I just hate disabling interrupts; system latency goes up and sometimes the debugging tools get a bit funky. When reading code, a red flag goes up if I see a lot of disable interrupt instructions sprinkled about. Though not necessarily bad, it's often a sign that either the code was beaten into submission (made to work by heroic debugging instead of careful design), or there's something quite difficult and odd about the environment.

Another solution is to read the Timer_interrupts variable, then the hardware timer, and then re-read Timer_interrupts. An interrupt occurred if both variable values aren't identical. Repeat until the two variable reads are equal. The upside: correct data, interrupts stay on, and the system doesn't lose counts. The downside: in a heavily loaded, multitasking environment, it's possible that the routine could loop for a rather long time before getting two identical reads. The function's execution time is non-deterministic. We've gone from a very simple timer reader to somewhat more complex code that could run for milliseconds instead of microseconds.

Another alternative might be to simply disable interrupts around the reads. This will prevent the ISR from gaining control and changing Timer_interrupts after we've already read it, but it creates another problem.

We enter Read_timer() and immediately shut down interrupts. Suppose the hardware timer is at our notoriously problematic 0xffff, and Timer_interrupts is zero. Now, before the code has a chance to do anything else, the overflow occurs. With context switching shut down, we miss the rollover. The code reads a zero from both the timer register and from Timer_interrupts, returning zero instead of the correct 0x10000, or even a reasonable 0x0ffff.

Yet disabling interrupts is probably a good thing to do, despite my rant against this practice. When they're on, it's always possible that our reading routine will be suspended by higher priority tasks and other ISRs for perhaps a very long time. Maybe long enough for the timer to roll over several times. So let's try to fix the code. Consider the following:

unsigned long Read_timer(void) {unsigned short low, high;push_interrupt_state;disable_interrupts;low=inword(Timer_register);high=Timer_interrupts; if (inword(timer_overflow)) {++high;low=inword(timer_register);}pop_interrupt_state;return (((ulong)high)<<16 + (ulong)low);}

We've made three changes to the RTEMS code. First, interrupts are off, as described.

Second, you'll note that there's no explicit interrupt re-enable. Two new pseudo-C statements have appeared to push and pop the interrupt state. Trust me for a moment-this is just a more sophisticated way to manage the state of system interrupts.

The third change is a new test that looks at something called “timer_overflow,” an input port that is part of the hardware. Most timers have a testable bit that signals that an overflow took place. We check this to see if an overflow occurred between turning interrupts off and reading the low part of the time from the device. With an inactive ISR variable, Timer_interrupts won't properly reflect such an overflow.

We test the status bit and reread the hardware count if an overflow did occur. Manually incrementing the high part corrects for the suspended ISR. The code then concatenates the two fixed values and returns the correct result. Every time.

With interrupts off we have increased latency. However, there are no loops; the code's execution time is entirely deterministic.

Push state?

But what's all of this pushing and popping? And where's the enable interrupts instruction?

Good software design encapsulates actions. We localize all access to particular resources with driver routines. If you looked at firmware from 15 or 20 years ago you'd be appalled at how so many developers casually sprinkled I/O instructions throughout the code. Today most (not all, unhappily) of us would use a single routine-like Read_timer()-every time we wanted access to the timer.

Encapsulation implies, though, that the one driver must be quite generic and work properly regardless of the system's state. It shouldn't corrupt an LED's status, for example.

What if, for some reason we can't anticipate when writing this driver, someone calls it with interrupts already disabled? Using the conventional disable/enable pair will cause the system state to change when it returns. That could be catastrophic.

To safely disable/re-enable interrupts, save the interrupt state first, issue a disable instruction, and then pop the saved interrupt state back into the processor status word. Use the pragma or similar construct offered by most cross compilers to gain access to these low-level hardware functions. Build a macro that generates a bit of in-line assembly if the compiler is so brain-dead it can't handle interrupts intrinsically.

Other RTOSes

Unhappily, race conditions occur anytime we need more than one read to access data that's changing asynchronously to the software. If you're reading X and Y coordinates, even with just eight bits of resolution, from a moving machine, they could be seriously out of sync if two reads are required. A 10-bit encoder managed through byte-wide ports could create a similar risk.

Having dealt with this problem in a number of embedded systems over the years, I wasn't too shocked to find it in the RTEMS source. It's a pretty obscure issue, after all, though terribly real and potentially deadly. For fun I looked through the source of C/OS, another very popular operating system whose source is on the 'Net (see www.ucos-ii.com). C/OS never reads the timer's hardware. It only counts overflows as detected by the ISR, as there's no need for higher resolution. There's no chance of an incorrect value.

Some of you, particularly those with hardware backgrounds, may be clucking over an obvious solution I've yet to mention. Add an input capture register between the timer and the system; the code sets a “lock the value into the latch” bit, then reads this safely unchanging data.

That solution, too, is fraught with peril and in many instances will not work. More next month!

Jack G. Ganssle is a lecturer and consultant on embedded development issues. He conducts seminars on embedded systems and helps companies with their embedded challenges. He founded two companies specializing in embedded systems. Contact him at .

Return to July 2001 Table of Contents

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.