An exception primer - Embedded.com

An exception primer

Most developers are somewhat familiar with the notion of exception handling. Software is rarely implemented with perfect hardware in a perfect environment, and the possibility of an anomalous event or error condition has to be considered. Exception handling in embedded applications is often vitally important; a confused processor can generate expensive and even life-threatening results. Having a program give up and post an Abort , Retry , Fail? message is rarely an acceptable option in avionics or medical applications.

The way exceptional conditions are handled depends on the supporting software. In general, the exception handler is an operating-system resource. In a UNIX environment, for example, an error during a program's execution (such as an invalid memory access) will cause the operating system to generate a core file that's written to the disk, then halt program execution.

That's simple exception handling: a fault is recognized, a file is generated to indicate the system state at the time of the fault, the process is killed, and the operating system returns to the user prompt. The operator is then expected to take appropriate action, such as determining the cause of the error and re-executing the program.

In contrast to the UNIX scenario, embedded systems call for sophisticated exception handling techniques. Embedded usually implies that the system performs a known, defined function in the real world. The operating personnel will likely be untrained in computer engineering, or perhaps not even present. As a result, the system must be able to resolve the anomalous condition and continue operation.

An exceptional methodology
The first step in any exception-handling scheme is to define what will constitute an exception. The definition depends largely on whether the underlying operating system is a simple monitor, a kernel such as VRTX or pSOS, or a full-blown operating system such as UNIX.

The existing software foundation will influence your design a great deal, as monitors and kernels generally provide no error handling at all other than reporting that an error occurred during a system call. UNIX, on the other hand, provides a rich set of tools through which complex error-handling systems can be built. When a serious fault occurs, the state of the system is written to the disk in the form of a core file and signals are sent to parent processes to alert them to the error condition.

As comfortable as the UNIX tools may be, the most common environments for embedded applications are kernels and monitors. Accordingly, the examples here are for a kernel-based system.

The exception handler
An exception is, by definition, an anomalous condition. In practice, this term covers a wide variety of sins. Let's say a message is posted to a message queue through a kernel call. If it's posted successfully, the kernel returns a message to the caller that no error occurred. But what if the message queue is full or the request specifies a nonexistent message queue? Exceptions aren't limited to kernel calls; they can be defaults at the end of a case statement (to represent an invalid parameter) or other program flow-type problem.

One simple approach to exception handling is to establish a rigorous methodology at the beginning of the project. A central handler can easily be used throughout the system, as represented by the following pseudocode:

err = system_operation();if( err ) sys_fault();else{  normal_processing();}   

This simple example demonstrates a consistent approach for checking and handling anomalous conditions. Wherever an unrecoverable error could exist, a check is made. If an error occurs, the system-wide exception handler is called.

From this basic method we can build a system in which each exception is assigned a unique error code to identify the line that caused the exception. Using the exception handler is as simple as a normal function call:

sys_fault( SYS_CRASH_001, err );   

The field SYS_CRASH_001 is a unique number within the system and forms a unique exception ID (a long integer, in this example). Implementers of a project can build a central file, similar to the following, listing exception IDs and keep track of their use in the system:

/* qpost, sys_chk_msg() */#define SYS_CRASH_001 0x0001/* bad indx, sys_prcs_msg() */#define SYS_CRASH_002 0x0002/* qpend, sys_snd_msg() */#define SYS_CRASH_003 0x0003   

The err field represents additional information that the user desires to pass to the exception handler; for kernel errors, this could be the error returned from the kernel call. For software flow errors, such as the default of a case, it could be the value used in the case.

The methodology defined thus far offers the following:

• A consistent approach to invoking exceptions.

• The ability to identify (through the ID code) the exact line of code that caused the exception condition.

• A single hook–in this case, sys_ fault() –through which all exceptions pass, enabling us to change the way exceptions are handled in one central procedure.

The actual exception processing can be approached in several ways. During the debug and implementation phases, the programmer needs to know what exception occurred and be able to examine the state of the system at the exact moment the exception occurred. The best way to do this is to print a crash report that gives the exception condition, then freeze the system by returning to the debugger/monitor.

After the project has been completed and shipped to the field, it's usually highly undesirable for the system to freeze when an exception occurs. Exceptions in a production mode may best be handled by recording the exception and automatically restarting the system (advanced exception handling is discussed later). The crash report can subsequently be retrieved to determine the cause of the exception.

Any project will require exception handling tailored to its environment; the point is to build a methodology that allows the programmer to easily incorporate error handling into a project during implementation. It should provide a means of isolating the problem, locating it within the source code during debugging, and checking the system during production to determine the errors that have occurred but were recovered from.

Listings 1 and 2 illustrate a simple exception handler that formats the crash report shown in Figure 1 and checks a front panel switch to determine how to handle the exception.

Editor's Note: Listing 1 is split in three parts. The code is also available in a Word file.

View the full-size image

View the full-size image

View the full-size image

View the full-size image

Listing 2 is called upon an exception to build a stack frame that meets the definition in Listing 1 and call the csys_ fault() procedure with a pointer to the stack frame. The crash report can be either sent to the screen or appended to a crash file. In the first case, the handler jumps to the debugger (debug mode); in the second, the handler executes a warm boot (production mode).

The crash report illustrated here provides a great deal of information to the system implementer. It displays the crash and error codes, allowing the implementer to check the central crash file for the mnemonic of the crash code and locate the line of source code in which the exception occurred. The error parameter can then be used to determine the cause of the error. The register and stack information is useful in tracing the history of the problem; if the system is implemented in a high-level language, many of the parameters used are passed or stored on the stack.

More important than this information is the fact that the system halted in a controlled fashion. The application decided that an 'unrecoverable fault condition existed and raised the exception; the crash report was printed and system execution halted. Without a methodology of this kind, a fault might not cause a system halt until subsequent actions have been taken–actions that may alter or destroy critical information needed to isolate the cause of the fault.

In production mode, where the system can't be frozen, the crash report may be stored for later use and the system automatically restarted. Crash reports provide a history of system errors that can point up consistencies and patterns.

Advanced handling
The exception handler described here either halts the system upon detection of a fault or records the exception parameters and restarts the system. This is sufficient from a development and testing perspective but doesn't handle all the situations that may arise. For example, some systems can't tolerate a complete restart; they enter a “limp-along” mode in which they partially recover from the fault without disrupting operation entirely. And some exceptions may not be fatal-the system merely notes them and continues processing. In this case, the exception may need a priority associated with it. High-priority exceptions may require the system to be halted or restarted, while those of low priority are simply logged.

Recovery from an exception without disruption of the entire system can be academic when the code and concepts required to recover from a fatal fault are more complex than the code that generated the fault. If non-disruptive recovery is essential, the system engineer should evaluate how the system will be implemented and decide which run-time environment and language should be used. Ada, with its many run-time resources for handling exceptions, is often the best language for systems that require a high degree of fault tolerance.

Fault recovery can be built into multitasking systems. When the exception handler is invoked, it determines whether a complete system restart or simply a flush and restart of the task that raised the exception is needed. In the latter case, if any outstanding operating-system resources (such as memory partitions) need to be reclaimed, a handler must be invoked to dean them up and return them to the operating system. This is normally handled by a procedure associated with each task that can be restarted. The procedure knows which resources may need reclaiming and how to check the task's internal structures to determine the resources in use at the time of the exception. Once the resources have been reclaimed, the task's data structures need to be reinitialized to a known state (typically the cold-start condition) and the task restarted.

One remaining issue is the interaction and synchronization with other tasks in the system. Since these tasks weren't restarted and are unaware of the faulted task's restart, they could be waiting on that task for processing that was in progress when the exception occurred. One way to handle this problem is to issue a global fault message alerting the other tasks of the restart and asking that any requests being processed by that task be restarted as well.

Our exception handler could be improved by assigning priorities to the exception or classes of exceptions. If we assume that the exception is a long (32-bit) integer, we can define a crash code as follows:

typedef struct{  char class;  char pri;  short fault;} SYS_EXCEPTION;   

Since SYS_EXCEPTION can be treated as a long integer, it can still be passed to functions as in our original example.

The class field could be used to determine the type of exception that has occurred: either a report-only, an explicit hard-boot fault, or an exception to be handled based on its priority.

The pri field, on the other hand, can be used to provide dynamic exception-handling characteristics. Upon fault, a master current exception priority field (a global variable set by the user or programmer) is checked. If the class field of the exception indicates “handle by priority,” the exception handler compares the priority passed when the exception was raised against the current exception priority. Exceptions of equal or higher priority halt the system, while exceptions of lower priority are simply noted.

The fault field is the unique crash code for the exception. A SYS_EXCEPTION data structure would be built at compile time by oring flags to the crash code as follows:

sys_fault((long) (SYS_RPT_ONLY        | SYS_PRT_3        | SYS_CRASH_001), err );   

This fault is report-only with a priority of 3 and a crash code of 1. The values of the flags are such that their bit positions are aligned with the proper fields of the SYS_EXCEPTION data structure.

For both of these concepts–recovery and classes of exceptions–the procedure csys_fault() shown in Listing 1 requires enhancement to meet the individual needs and requirements of the system. The point is to have the central exception handler in place. The programmer can then start with simple fault handling and, if necessary, increase the capabilities of the exception handler to meet the needs of a growing system.

The method's result
The purpose of exception handling is to build a consistent environment that helps isolate and identify problems. The central exception handler can be updated and modified to support the current system condition, whether it be during implementation or after the product is shipped. During system implementation, an exception handler should allow the programmer to identify problems when they occur. Once the project is completed. The exception handler should record problems as they occur, restart the system, and recover automatically from the exception.

When this article was written, Thomas Besemer was a consultant in Sunnyvale, Calif., specializing in embedded system design and implementation. He has worked on factory automation installations in many roles, including hardware engineer, programmer, and project manager. To find out what he's doing today, go to http://thomas-iv.com/founder.html.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.