
Solving the Software Safety Paradox
by Doug Brown
Embedded systems running safety-critical applications have a quandary.
How can the software know that its operating correctly? Can a
malfunctioning system diagnose itself and either correct the problem or
halt itself? This article examines some techniques that allow the software
to monitor
itself, with a bias toward those techniques that can be easily
implemented in almost any embedded system.
In Joseph Hellers novel
Catch-22
, an Army Air Corp doctor
faces a perplexing dilemma. One of the WWII bomber squadrons pilots
is obviously insane, but he cant be grounded unless he asks to be
grounded; however, if he asks to be grounded, then the rules state that
this request in itself is evidence of sanity. In other words, he cant
be groundedCatch-22.
Embedded systems running safety-critical
applications face a similar paradox. How can the software know that it is
operating correctly, especially if it isnt? Can the system perform a
sanity check on itself? This has been a major concern for manufacturers and
regulators of safety-critical embedded systems since the beginning of the
computer age, and it is a concern that is growing daily.
From heart
defibrillators to avionics suites, from nuclear power plant controls to the
anti-lock braking system in
most new cars, microprocessors have become an
integral part of everyday systems upon which thousands, perhaps millions,
of lives depend. The safe operation of these systems cannot be taken for
granted due to the potentially catastrophic consequences of a malfunction.
Given the serious nature of a failure in one of these systems, is there
anything we can do, beyond careful design practices and extensive testing,
to ensure their correct operation? Can a malfunctioning system diagnose
itself and either correct
the problem or halt itself before behaving in an
erroneous or potentially dangerous manner?
The answer to both questions
is a qualified yes. Many simple techniques exist, which allow the software
to monitor itself for correct operation. This article examines some of
these techniques with a bias toward those that can be easily implemented in
almost any embedded system.
CPU self-test
I have a friend who compares testing the CPU to holding a gun to your head,
pulling the trigger, and then
asking, How dead am I? While I
find his analogy amusing, I disagree with it.
I admit that
microprocessor failures are rare. They are so well-tested prior to shipment
that a CPU failure is practically unheard of. Even when they do fail, it
will most likely be a catastrophic failure of the entire processor;
nevertheless, it is possible for a microprocessor to fail in varying
degrees, just like any other piece of hardware. In most cases, it
isnt the microprocessor itself, but an external
anomaly, such as a
solder splash, a bent pin, or an error in the gate array interface, that
causes the failure. In one instance, the CPU of a robotic welder failed,
causing an extraneous jump to a move routine that resulted in the
destruction of a small jack.
1
One can only imagine the grim
result if a human had been standing in place of the jack.
So how can a
software program test the CPU on which it is executing? Granted, a paradox
exists in that the processor must be running well enough to execute
the
self-test, but by starting with very simple instructions, verifying their
correct operation, and then proceeding to more complex instructions,
testing the CPU is indeed possible.
The need to test individual
instructions dictates that the CPU test must be written in machine
assembler code instead of a higher-level language. Because branching is the
main cornerstone of any program, it makes sense to start with a simple
unconditional branch or jump instruction. If the branch instruction
doesnt
execute correctly, then the CPU is inherently unreliable.
Listing 1
shows a portion of a CPU test for a 68332
microcontroller. (The complete CPU test can be found on-line at
www.embedded.com/ code.htm
.) This test was written as a subroutine
that can be called from C functions, so some assumptions were made that the
return instruction operates correctly.
In concept, and in fact, a CPU
self-test is quite simple. The branch always instruction should cause the
microprocessor to
jump around the fail code segment to the cpu0 label. If
the branch instruction fails, then the processor will fall into the fail
segment and execute the failure code. You should keep the failure code
fairly simple because if this segment is ever executed, the CPU is assumed
to be unreliable. This segment could simply turn on an LED and then halt
the processor.
Once it has been confirmed that an instruction works
correctly, the subsequent tests can use the instruction. The Branch
on equal/not equal
zero test illustrates this concept. If the beq
instruction fails to execute correctly, the processor uses the bra
instruction to branch to the fail label. If the beq instruction executes
correctly, the processor branches to the cpu10 label and begins the next
test, which can now use the clr, bne, and beq instructions. In this manner,
each successive test builds upon the successes of the previous tests to
perform evermore complex operations.
As a minimum, a CPU self-test should
test each instruction that
may be used by the application; verify the
processors ability to set, reset, and interrogate condition flags;
verify the correct execution of math operations; and, finally, check the
addressability and the read-write capabilities of each register. The
register test should check for cross-talk between individual bits, bytes,
and words. Use at least two bit patterns that are complementary to each
other.
Considering the complexity of some of todays
microprocessors, testing every possible
instruction in the CPUs
instruction set may not be practical. Fortunately, its only necessary
to test those instructions that may be used by the application. For
example, if the CPU includes floating-point instructions but your
application does not perform any floating-point operations, then it
isnt necessary, although its still a good idea, to test the
floating-point operations. To determine which instructions should be
tested, writing a program that parses the output of the compiler and
creates a list of all the generated instructions is a fairly simple
process.
CPU tests are typically run at power-on, following hardware
initialization but before anything else, and during background processing.
Because the test uses individual instructions, the time required to execute
the test can be determined with a high degree of precision by simply adding
the execution time of the assembler instructions.
In heavily regulated
industries, such as medical device manufacturing, CPU self-tests also
provide a substantial economic benefit. Some federal regulations require a
test and inspection of each component of a device. On manufacturing lines
that use thousands of microprocessors each day, the cost of testing each
microprocessor before installation is prohibitive. Implementing a CPU
self-test avoids the cost of implementing a separate test line, while still
satisfying federal regulations.
Guarding against illegal jumps
Probably the first, and simplest, safety technique learned by many
embedded
programmers consists of filling unused program memory with either halt
instructions or illegal instructions. This technique guards against illegal
jumps outside of program space.
Consider the case where unused ROM is
filled with halt instructions. If the program jumped out of bounds and
executed the halt instruction, the system will stop. This is extremely
helpful if it occurs in a debugging environment, since you can trace back
through the stack to determine the cause of the illegal jump. In a
production environment, a watchdog timer could trigger a reset of the unit.
If illegal instructions are used to fill unused program memory, a jump
outside of program space would execute the illegal instruction and trigger
the illegal instruction trap. If the error occurred while youre in a
debug environment, the illegal instruction vector can be used to start a
debugging function. If the error occurs in a production environment, the
illegal instruction trap could vector to a function that attempts to
safe the system. For heart defibrillators, the safe function
could disable the charger and transfer relays. For robotic systems, the
safe function could shut down all the robots motors.
Regardless of
which instruction is used, filling unused ROM provides cheap insurance
against jumps outside of code space.
ROM tests
A ROM test is used to verify the integrity of a program or data stored in
ROM. It is a basic safety precaution that should be performed at power-on
prior to
executing the main application. If the ROM test fails, then we
want to avoid executing the main application because we have no idea what
is programmed in the device.
In addition to performing the ROM test at
power-on, running the ROM test continuously in the background makes sense
in some cases. This is especially true if your device has the ability to
alter its own programming (for instance, flash memory or EEPROMs).
ROM
tests use a checksum or a cyclic redundancy check (CRC) to guard against
corrupted
memory. The technique consists of calculating the checksum or CRC
over the entire ROM space, including unused memory, and storing the result
in a known location. Unused ROM space should be filled with halt or illegal
instructions as I discussed previously. When the test is run, boot code
calculates a checksum or CRC and compares the calculated value to the
stored value. If the values dont compare, then the ROM has been
corrupted. The basic ROM test is simple to comprehend and implement, but
some
careful trade-offs need to be made when deciding on whether to use a
checksum or CRC.
A checksum is simple to implement and is fast; however,
its very easy for multiple errors to cancel each other out. A CRC, on
the other hand, provides a much better integrity check, but is slower. Some
devices require a fast start and cannot tolerate the time required to
complete a CRC check. One potential solution is to use both.
Figure 1
illustrates a program memory layout. The last eight bytes are occupied by a
stored CRC value and a checksum value respectively. The checksum is
computed over the entire contents of ROM, including the CRC value. At
power-up, the checksum is calculated and compared to the stored value. This
provides a quick integrity check that confirms the system can continue the
boot process. Once the boot process is complete, a background task can run
the CRC check continually. This provides a higher integrity check and also
makes sure that the program ROM doesnt become corrupted during
program execution.
Watchdog timers
In its simplest form, a watchdog timer
2
is a circuit or function
that counts down from a preset time until it expires or is reset (tickled).
The idea is for the executing software to tickle the watchdog at
appropriate intervals to prevent its expiration. If the software gets
caught in a loop, or is prevented from tickling the watchdog, the timer
expires and causes a processor reset. In this form, a watchdog timer
isnt strictly a software
technique. It requires a hardware circuit to
reset the processor if it isnt tickled prior to expiration; however,
many real-time operating systems (RTOSes) provide a software watchdog
capability that can be applied only to software tasks. On the hardware
side, watchdog timers have become so common that many chips have a watchdog
timer built in.
Watchdog timers are most effective in cyclic systems
where the watchdog is tickled each time through the loop. Watchdogs are
somewhat more difficult to implement in
multitasking systems due to their
non-determinism.
3
Ive seen some multitasking systems that
use an interrupt to tickle the watchdog. This approach defeats the whole
purpose for having one in the first place. If all the tasks were blocked
and unable to run, the interrupt method would continue to service the
watchdog and the reset would never occur. A better solution is to use a
separate monitor task that not only tickles the watchdog, but monitors the
other system tasks as well. For example,
a task that must run at least once
per second could increment a counter. The monitor task reads the counter
and safes the system if the task counter stops incrementing at
the expected rate. If the monitor task was blocked, it would not tickle the
watchdog and the system would reset.
A potential window of vulnerability
exists with watchdog systems as theyre implemented in most systems.
Many systems only check for failure to strobe the watchdog too slow; if the
system is running at a
higher rate than intended, the traditional watchdog
wont catch it. A better technique is to use a windowed
watchdog.
The windowed watchdog resets the processor if it is tickled
either too fast or too slow. For example, if your task must nominally
execute every 50 +/- 10% milliseconds, a windowed watchdog circuit would
reset the processor if it was tickled before 45ms, or if 55ms passed
without being tickled. Although windowed watchdogs are more difficult to
implement, they provide a higher degree of
protection than standard
watchdog timers.
Redundant storage and cross checks of critical variables
All of the techniques I have discussed to this point guard against gross
errors in the system, but what about more subtle errors, such as corruption
of variables?
Weve all heard stories of bit flips that were caused
by cosmic rays or EMI. Some engineers with whom Ive spoken argue that
cosmic ray bit flips are rare in todays modern memories. Other
engineers argue that with bit
densities going up and low-voltage devices
becoming more prevalent, bit flips are more likely to occur than ever.
Another, and perhaps more common, cause of memory corruption is a rogue
pointer, which can run wild through memory leaving a trail of corrupted
variables in its wake. Regardless of the cause, the designer of
safety-critical software must consider the threat that sometime, somewhere,
a variable will be corrupted. If one of the corrupted variables happened to
store a critical parameter, such as the
energy setting for a heart
defibrillator or the fluid level in a chemical mixing vat, the results
could be quite unsatisfactory.
A technique commonly used in
safety-critical software to guard against variable corruption is redundant
storage of critical variables. Three copies of each critical variable are
stored in memory. Ideally, each copy should be stored in a separate memory
type using different data formats. For example, most microprocessors have a
small amount of onboard (internal) RAM in addition
to the RAM designed onto
the printed circuit board by the hardware designers. This provides two
separate memories for storage of critical variables. One copy can be stored
in internal RAM and the two other copies can be stored in the external RAM.
Of the two copies in external RAM, one is stored using an inverted bit
pattern.
Prior to using the variable, the three copies are compared. If
any of the three variables do not agree, a two-of-three voting determines
which value should be used. If none of the
three values match, then a
default value is used.
An extra level of protection is provided by
grouping critical variables together and keeping a CRC over each group. In
case of a mismatch, the CRC can be used to determine which copy is
corrupted. The good copies can then be used to repair the corrupted
copy.
Two drawbacks are evident when keeping redundant copies of
variables. One is the additional time required to access the variable
copies and perform the integrity checks. The second drawback is the
shear
number of variables used in embedded systems. Keeping redundant copies of
all variables is impractical, so the designer must determine beforehand
which variables are critical to the safe operation of the application.
Stack checks
A stack check guards against stack overflow or the corruption of a
tasks stack. Stacks that overflow quickly are normally caught early
in the development process; however, memory leaks can cause stacks to fill
slowly until they eventually overflow. I once had
a situation in which a
stack would overflow only after several days of continuous operation. A
stack-monitoring task alerted me to the problem and allowed me to correct
it prior to the product shipment.
Many commercial RTOSes contain
facilities and functions that support stack checking, but it can be
implemented without an RTOS as well. The basic technique consists of
initializing the stack with a known hex pattern. One commercial RTOS pours
C0FFEE into the stack at initialization. Although this is a cute
pun, I
prefer to use a pattern that corresponds with an illegal instruction on the
microprocessor. This is for the same reasons I mentioned earlier in
guarding against illegal jumps.
Once the stack has been initialized with
a known pattern, a simple monitor function can be written to examine the
stack, determine where the pattern has been overwritten, and calculate the
remaining stack space. If the stack margin shrinks below some predetermined
amount, the monitor function will call an error processing
routine.
Stack
monitors require little overhead to implement and are an effective means of
guarding against stack overflows.
Program checks
One of the primary functions of a microprocessor is computing the result of
mathematical equations. The equations could be the calculations to
determine the correct energy for a defibrillator shock, aircraft vectors in
an air traffic control system, or control rod positions in a nuclear power
plant. Whatever the purpose of the system, the correct
computation of these
equations is often central to its safe operation. This raises a troubling
question: when doing a mathematical calculation, how do we know the result
is correct?
Most software development organizations try to address this
problem by extensive testing. Testing is important and should not be
omitted in favor of other methods, but it does have the shortcoming of
being incomplete. All possible combinations of numbers cannot be covered.
Another situation occurs if your production line starts using
a different,
yet supposedly compatible, microprocessor. A program that used to operate
correctly on the previous microprocessor fails when the upgraded part is
installed. If you think this isnt a problem, remember the Pentium
Floating-Point Bug? Programs that operated correctly on a 486 processor
could fail when executed on a Pentium processor.
An alternative to
testing is using mathematical proofs to prove that an algorithm is correct.
The problem with proving correctness is the complexity involved
in proving
even the simplest algorithm. Additionally, a proof only proves that the
algorithm is correct. It cannot catch processor errors.
A third, and
probably the best alternative, is to use program checks
4
to
verify calculation results. Program checks address the basic problem of
verifying that a calculation is correct before the result is utilized. The
technique is one that you were taught in elementary school.
| Figure 1: Program Memory Layout
|
|
To this day,
I can still hear Mrs. Stewart admonishing my second grade class to
Check your work. Check subtraction by
adding the result to one
of the numbers in the original problem to get the other number. (for
example, If 10 3 = 7 then 7 + 3 had better equal 10.) Check division
by plugging your answer back into a multiply, and so on. In their simplest
form, program checks are the same thing you were taught in grade school.
When checking calculations, be sure to allow tolerances for checking
your answers. Digital arithmetic isnt exact! The designer must decide
on an error tolerance and only perform the error
processing if the result
falls outside of this tolerance.
Program checks have two drawbacks.
First, program checks roughly double your computation time because each
calculation is essentially performed twice: once for the original
calculation, and again, in reverse, to check the result. In a system where
processor throughput is an issue, checking all of the calculations may not
be practical. Second, if the check signals a mismatch, were not sure
where the error has occurred, whether in the
original calculation or in the
check calculation. We only know that there is a problem and further
analysis is necessary to determine the cause. Despite these drawbacks,
program checks offer a powerful means for programs to check themselves for
correct operation.
Simple, yet effective
These safety self-tests Ive described are only a few of the
techniques that you can employ to have software check itself for proper
executionmany others exist as well. The techniques here are some of
the
simplest, yet most effective. They can be easily implemented in most
embedded systems. Although these safety self-tests can detect a bad
microprocessor, bad ROM, or wild software, they certainly are not a
guarantee of correct operation. They cant diagnose design flaws, nor
can they compensate for requirements errors. Nevertheless, they are
effective tools in designing safe systems and are a welcome addition to a
programmers toolbox.
Doug Brown is a senior project engineer at
Physio-Control. Contact him
at
ddbrown@physio-control.com
| References
|
- Neumann, Peter G.
Computer Related Risks
. Reading, MA: Addison Wesley, 1995, Chapter 2.8.1, p. 66.
- Santic, John S., Watchdog Timer
Techniques,
Embedded Systems Programming
, April 1995, p. 58.
- Lowell, Augustus P., The Care and Feeding of Watchdogs,
Embedded Systems Programming
, April 1992, p. 38.
- Blum, Manuel and Sampath Kannan, Designing Programs that Check Their Work,
Journal of the Association for Computing Machinery
, January 1995, p. 269.
|
| Listing 1: Portion of CPU test for a 68332 microcontroller
|
_cpu_test:
/* Branch Always Test */
bra cpu0 /* branch always */
move.l #FALSE,d0 /* if falls through, error */
rts /* exit */
/* Any failures from the tests below branch to this fail location. */
fail:
movem.l
(sp)+,d0-d7/a0-a6 /* restore registers */
/* d0 was included for testing*/
move.l #FALSE,d0 /* error return */
rts
cpu0:
movem.l d0-d7/a0-a6,-(sp) /* push registers - test later*/
/* d0 is included to test later */
/* Branch on equ/ne zero Test */
clr.l d0 /* clear a register */
bne fail /* should not branch */
beq cpu10 /* should branch */
bra fail /* go fail */
cpu10:
/* Move / Compare / Branch on Conditions / Jmp Tests */
move.l #LONGPOSVALUE,d0 /* set value in register */
cmp.l #LONGPOSVALUE,d0 /* equal ? */
bne fail /* should not branch */
bge cpu20 /* should branch */
bra fail /* go fail */
cpu20:
bgt fail /* should not branch */
cmp.l #0,d0 /* equal ? */
beq fail /* should not branch */
blt fail /* should
not branch */
bgt cpu30 /* should branch */
bra fail /* go fail */
cpu30:
bge cpu31 /* should branch */
bra fail /* go fail */
cpu31:
jmp cpu32 /* should jump */
bra fail /* go fail */
cpu32:
/* Test movem.l and cmpm */
cmp.l 4(sp),d1 /* regs d1 - a6 must compare */
bne fail /* if no compare - fail */
cmp.l 8(sp),d2
bne fail
cmp.l
12(sp),d3
bne fail
cmp.l 16(sp),d4
bne fail
cmp.l 20(sp),d5
bne fail
cmp.l 24(sp),d6
bne fail
cmp.l 28(sp),d7
bne fail
cmpa.l 32(sp),a0
bne fail
cmpa.l 36(sp),a1
bne fail
cmpa.l 40(sp),a2
bne fail
cmpa.l 44(sp),a3
bne fail
cmpa.l 48(sp),a4
bne fail
cmpa.l 52(sp),a5
bne fail
cmpa.l 56(sp),a6
bne fail
|
|