CMP EMBEDDED.COM

Login | Register     Welcome Guest  
HOME DESIGN PRODUCTS COLUMNS E-LEARNING CONFERENCES CODE FORUMS/BLOGS NEWSLETTERS CONTACT FEATURES RSS RSS

Back to the Basics - Practical Embedded Coding Tips: Part 2
Asynchronous Hardware/Firmware, race conditions and solutions



Embedded.com
There are subtler issues beyond those discussed in Part 1 that result from the interaction of hardware and software. These may not meet the classical definition of reentrancy, but pose similar risks, and require similar solutions.

We work at that fuzzy interface between hardware and software, which creates additional problems due to the interactions of our code and the device. Some cause erratic and quite impossible-to-diagnose crashes that infuriate our customers.

The worst bugs of all are those that appear infrequently, that can't be reproduced. Yet a reliable system just cannot tolerate any sort of defect, especially the random one that passes our tests, perhaps dismissed with the "ah, it's just a glitch" behavior.

Potential evil lurks whenever hardware and software interact asynchronously. That is, when some physical device runs at its own rate, sampled by firmware running at some different speed.

I was poking through some open-source code and came across a typical example of asynchronous interactions. The RTEMS real time operating system provided by OAR Corporation is a nicely written, well organized product with a lot of neat features.

But the timer handling routines, at least for the 68302 distribution, are flawed in a way that will fail infrequently but possibly catastrophically. This is just one very public example of the problem I constantly see buried in proprietary firmware.

The code is simple and straightforward, and looks much like any other timer handler.

int timer_hi;
interrupt timer(){
    ++timer_hi;}

long read_timer(void){
    unsigned int low, high;
    low =inword(hardware_register);
    high=timer_hi;
    return (high<<16 + low);}

There's an interrupt service routine invoked when the 16 bit hardware timer overflows. The ISR services the hardware, increments a global variable named timer_hi, and returns.

Therefore, timer_hi maintains the number of times the hardware counted to 65536. Function read_timer returns the current "time" (the elapsed time in microseconds as tracked by the ISR and the hardware timer). It, too, is delightfully free of complications.

Like most of these sorts of routines it reads the current contents of the hardware's timer register, shifts Timer_hi left 16 bits, and adds in the value read from the timer. That is, the current time is the concatenation the timer's current value and the number of overflows.

Suppose the hardware rolled over 5 times, creating five interrupts. timer_hi equals 5. Perhaps the internal register is, when we call read_timer, 0x1000. The routine returns a value of 0x51000. Simple enough and seemingly devoid of problems.

Race Conditions
But let's think about this more carefully. There are really two things going on at the same time. Not concurrently, which means "apparently at the same time," as in a multitasking environment where the RTOS doles out CPU resources so all tasks appear to be running simultaneously.

No, in this case the code in read_timer executes whenever called, and the clock-counting timer runs at its own rate. The two are asynchronous.

A fundamental rule of hardware design is to panic whenever asynchronous events suddenly synchronize. For instance, when two different processors share a memory array there's quite a bit of convoluted logic required to ensure that only one gets access at any time. If the CPUs use different clocks the problem is much trickier, since the designer may find the two requesting exclusive memory access within fractions of a nanosecond of each other.

This is called a "race" condition and is the source of many gray hairs and dramatic failures.One of read_timer's race conditions might be:

It reads the hardware and gets, let's say, a value of 0xffff.

Before having a chance to retrieve the high part of the time from variable timer_hi, the hardware increments again to 0x0000.

The overflow triggers an interrupt. The ISR runs timer_hi is now 0x0001, not 0 as it was just nanoseconds before.

The ISR returns, our fearless read_timer routine, with no idea an interrupt occurred, blithely concatenates the new 0x0001 with the previously read timer value of 0xffff, and returns 0x1ffff—a hugely incorrect value.

Alternatively, suppose read_timer is called during a time when interrupts are disabled—say, if some other ISR needs the time. One of the few perils of writing encapsulated code and drivers is that you're never quite sure what state the system is in when the routine gets called. In this case:

read_timer starts. The timer is 0xffff with no overflows.

Before much else happens it counts to 0x0000. With interrupts off the pending interrupt gets deferred.

read_timer returns a value of 0x0000 instead of the correct 0x10000, or the reasonable 0xffff.

So the algorithm that seemed so simple has quite subtle problems, necessitating a more sophisticated approach. The RTEMS RTOS, at least in its 68 k distribution, will likely create infrequent but serious errors.

Sure, the odds of getting a misread are small. In fact, the chance of getting an error plummets as the frequency we call read_timer decreases. How often will the race condition surface? Once a week? Monthly?

Many embedded systems run for years without rebooting. Reliable products must never contain fragile code. Our challenge as designers of robust systems is to identify these sorts of issues and create alternative solutions that work correctly, every time.

1 | 2 | 3

Rate this article: Low High
Current rating
  • .
Embedded.com Career Center
Looking for a new job?
SEARCH JOBS

Browse all jobs

SPONSOR
RECENT JOB POSTINGS



TECH PAPER
WEBINAR
WEBINAR
WEBINAR




 :