RTOS diagnostics and error checking - Embedded.com

RTOS diagnostics and error checking


Error handling is unlikely to be a major feature of any operating system intended for embedded systems applications. This is an inevitable result of resource limitations – and all embedded systems have some kind of constraints. It is also logical, as only a limited number of embedded systems have the opportunity to behave like a desktop system – i.e. offer the user an opportunity to decide what to do next in the event of some exceptional event.

In Nucleus SE there are broadly three types of error checking:

  • Facilities to “sanity check” the selected configuration – just to make sure that selected options are consistent

  • Optionally included code to check runtime behavior

  • Specific API functions that facilitate the design of more robust code

These will all be covered in this article along with some ideas about user-implemented diagnostics.

Configuration Checks

Nucleus SE is designed to be very user-configurable so that it can be tailored to make the best use of available resources. This configurability is a challenge, as the number of options, and the interdependencies between them, is quite large. As has been described in many of the previous articles, most user configuration of Nucleus SE is performed by setting #define constants in the file nuse_config.h .

In order to help identify configuration errors, a file – nuse_config_check.h – is included (i.e. by means of a #include into nuse_config.c ), which performs a number of consistency checks on the #define symbols. Here is an extract from this file:

/*** Tasks and task control ***/
    #error NUSE: invalid number of tasks – must be 1-16
    #error NUSE: NUSE_Task_Relinquish() selected – not valid with
                                           priority scheduler
    #error NUSE: NUSE_Task_Resume() selected  – task suspend not
    #error NUSE: NUSE_Task_Suspend() selected  – task suspend not
    #error NUSE: Initial task state enabled – task suspend not
/*** Partition pools ***/
    #error NUSE: invalid number of partition pools – must be 0-16
        #error NUSE: NUSE_Partition_Allocate() enabled – no
                                partition pools configured
        #error NUSE: NUSE_Partition_Deallocate() enabled – no
                                partition pools configured
        #error NUSE: NUSE_Partition_Pool_Information() enabled –
                                no partition pools configured

Scroll or drag the corner of the box to expand as needed.

The checks performed include the following:

  • Verification that at least one, but no more than sixteen tasks have been configured

  • Confirmation of that selected API functions are not inconsistent with the chosen scheduler or other options

  • Verification that no more than sixteen instances of other kernel objects has been specified

  • Confirmation that API functions have not been selected for objects that are not instantiated at all

  • Ensuring that API functions for signals and system time are not selected when these facilities have not been enabled

  • Verification of selected scheduler type and associated options

In all cases, detection of an error results in the compilation of a #error statement. This normally results in the compilation being terminated with the specified message.

This file does not make it impossible to create an illogical configuration but renders it highly unlikely.

API Parameter Checking

Like Nucleus RTOS, Nucleus SE has the facility to optionally include code to verify API function call parameters at run time. Normally this would only be employed during initial debugging and testing as the memory and runtime overhead would be undesirable in production code.

Parameter checking is enabled by setting NUSE_API_PARAMETER_CHECKING in nuse_config.h to TRUE . This enables the compilation of the required additional code. Here is an example of an API function’s parameter checking:

STATUS NUSE_Mailbox_Send(NUSE_MAILBOX mailbox, ADDR *message,
                           U8 suspend)
      STATUS return_value;
         if (mailbox >= NUSE_MAILBOX_NUMBER)
               return NUSE_INVALID_MAILBOX;
         if (message == NULL)
               return NUSE_INVALID_POINTER;
            if ((suspend != NUSE_NO_SUSPEND) &&
                  (suspend != NUSE_SUSPEND))
                  return NUSE_INVALID_SUSPEND;
               if (suspend != NUSE_NO_SUSPEND)
                     return NUSE_INVALID_SUSPEND;

Scroll or drag the corner of the box to expand as needed.

This parameter checking may result in the return of an error code from the API function call. These are all negative values of the form NUSE_INVALID_xxx (e.g. NUSE_INVALID_POINTER ) – a complete set of definitions is included in nuse_codes.h .

Extra application code (perhaps conditionally compiled) may be included to process these error values, but it would probably be better to use the data monitoring facilities of a modern embedded debugger to detect them.

Parameter checking introduces overheads in memory (the extra code) and run time performance, so its use is somewhat intrusive. As the full source code to Nucleus SE is available to the developer, checking and debugging can be done “by hand” on production code, if absolute precision is required.

Task Stack Checking

So long as the Run to Completion scheduler is not in use, Nucleus SE has a task stack checking facility available, which is similar to that provided in Nucleus RTOS, and provides an indication of remaining stack space. This API call – NUSE_Task_Check_Stack() – was described in detail in a previous article. Some ideas on stack error checking are discussed in the “User Diagnostics” section later in this article.

Version Information

Nucleus RTOS and Nucleus SE have an API function which simply returns version/release information about the kernel.

Nucleus RTOS API Call

Service call prototype:

CHAR *NU_Release_Information(VOID);




Pointer to NULL-terminated version string

Nucleus SE API Call

This API call supports the key functionality of the Nucleus RTOS API.

Service call prototype:

char *NUSE_Release_Information(void);




Pointer to NULL-terminated version string

Nucleus SE Implementation of Release Information

The implementation of this API call is almost trivial. A pointer to the string constant NUSE_Release_Info , which is declared and initialized in nuse_globals.c , is returned.

This string takes the form Nucleus SE – Xyymmdd , where:

X is the release status: A = alpha; B = beta; R = released
yy is the year of the release
mm is the month of release
dd is the day of release

Compatibility with Nucleus RTOS

Nucleus RTOS includes an optional facility for maintaining a history log. The kernel records details of various system activities. API functions are provided to enable the application program to:

  • enable/disable history saving

  • make a history entry

  • retrieve a history entry

This capability is not supported by Nucleus SE.

Nucleus RTOS also includes some error management macros that perform assertions and provide a means by which a user-define fatal error function may be called. These are conditionally included in an OS build. Nucleus SE does not support this type of facility.

User Diagnostics

So far in this article we have looked at the diagnostic and error checking facilities provided by Nucleus SE itself. This is now a good opportunity to consider how user-defined or application-oriented diagnostics may be implemented using the facilities provided by the kernel and/or applying our knowledge of its internal structure and implementation.

Application Specific Diagnostics

Almost any application program could have additional code added to check its own integrity at run-time. Using a multi-tasking kernel, having a specific task do this job is very convenient and straightforward. Obviously, diagnostics that are very specific to the application are not within the scope of this series of articles, but we can consider some broad ideas.

Memory Checks

The correct operation of memory is obviously critical to the integrity of any processor-based system. Equally obviously, a catastrophic failure would prevent any software, let alone diagnostics, from running at all. But there are situations when a degree of failure occurs and is a major concern but does not totally prevent code execution. Testing memory is quite a complex topic, which is way beyond the scope of this article, so I can only give some general ideas.

The two most common faults that can develop specifically in RAM are: “stuck bits” – where a bit has the value zero or one and cannot be changed; and “cross talk” – where adjacent bits interfere with one another. These can both be tested by writing and reading back appropriate test patterns to each RAM location in turn. Some testing can only really be performed on start-up, before even a stack is established; for example, a “moving ones” test, where each bit in memory is set to one and every other bit is checked to ensure that it is zero. Other byte by byte pattern testing can be performed on the fly, so long as you ensure that a context switch cannot occur while the RAM location is corrupted. Using the Nucleus SE critical section delimiting macros NUSE_CS_Enter() and NUSE_CS_Exit() , which are defined in nuse_types.h , is straightforward and portable.

Various types of ROM are also subject to occasional failure, but there is limited checking that software can do. A checksum generated when code is built would be useful. This could be checked at start-up and perhaps also during runtime.

A failure in memory addressing logic can affect both ROM and RAM. A test specifically for this may be devised, but it is most likely to show up during the other tests described above.

Peripheral Device Checks

Outside of the CPU, peripheral circuitry may be subject to failure. This will vary greatly, of course, from one system to another, but it is very common for devices to have some means to verify their integrity with diagnostic software. For example, a communications line may have a loop-back mode, whereby any data written to it is immediately returned.

Watchdog Servicing

Embedded systems designers commonly include a “watchdog” circuit. This is a peripheral device which either interrupts the CPU and expects a prompt response, or (better) requires a periodic access initiated by the software. In either case, a common result of a watchdog “biting” is a system reset.

Using a watchdog effectively in a multi-tasking environment is challenging. Simply servicing it from one task only confirms that this particular task is functioning. One way around this is to implement a “supervisor task” – an example of this is described later in this article.

Stack Overflow Checking

Unless you select the Run to Completion scheduler, a Nucleus SE application will include a stack for each task. The integrity of these stacks is vital, but RAM is likely to be limited, so getting the size to be optimal is essential. It is possible, but very difficult, to statically predict the stack requirements for each task – it would need to have sufficient capacity for the demands of the most nested function call combined with the most stack-hungry interrupt service routine. A simpler approach is to perform exhaustive run-time testing.

There are broadly two approaches to stack verification. If a sophisticated embedded software debugger is in use, the stack boundaries may be monitored, and any transgressions detected. The location and size of Nucleus SE stacks is readily accessible via the global data structures in ROM: NUSE_Task_Stack_Base[] and NUSE_Task_Stack_Size[] .

The alternative is to perform run-time testing. The usual approach is to add “guard words” at the end of each stack – normally these would be the first location in each stack data area. These words are initialized to a recognizable non-zero value. Then, a diagnostic routine/task checks whether they have changed and takes appropriate action. The over-writing of a guard word does not mean the stack has actually overflowed but is an indication that it is about to do so; thus, the software is likely to continue executing long enough to take corrective action or warn the user.

A Supervisor Task

Although Nucleus SE does not reserve any of the possible sixteen tasks for its own use, the user may choose to dedicate one to diagnostics. This may be a low priority task, which just takes advantage of any “spare” CPU time; or it may be a high priority task which runs for a short time occasionally, and thus ensures that diagnostics are always performed on a regular basis.

Here is how an example might function:

The supervisor task’s signal flags are used to monitor the operation of six critical tasks in the system. These tasks each use a specific flag (bit 0 to bit 5) and are required to set it on a regular basis. The supervisor task clears all the flags and then suspends itself for a period of time. When it resumes, it expects all six tasks to have “checked in” by setting the appropriate flag; it seeks an exact match to the value b00111111 (from nuse_binary.h ). If this is satisfied, it clears the flags and suspends again. If not, it calls the critical error handling routine, which may perform system reset, for example.

An alternative implementation may have used an event flag group. This would make sense if signals were not used elsewhere in the application (as a RAM overhead would be incurred for all tasks) and even more so if event flags were employed for other purposes.

Tracing and Profiling

Although many modern embedded software debuggers are quite customizable and could be made “RTOS aware”, debugging a multi-threaded application can still be challenging. A widely used approach is post-execution profiling, where the (RTOS) code is instrumented so that a detailed audit of its operations may be retrospectively analyzed. Typically, implementing such a facility involves two components:

  1. Additional code is added to the RTOS (i.e. it is “instrumented”) to log activity. Typically, this will be enclosed in preprocessor directives to facilitate conditional compilation. This code stores a few bytes of information when a significant event occurs – like an API call or a context switch. This information would be things like:

  2. – The current address (PC).

    – Current task ID (index).

    – Index of any other object involved.

    – A code indicating the operation that was performed.

  3. A task dedicated to emptying the buffer of profile information to external storage – typically a host computer.

The analysis of this captured data would also need some work, but this may be as simple as using an Excel spreadsheet.

In the next article we will review Nucleus SE compatibility with Nucleus RTOS in detail.

Colin Walls has nearly forty years’ experience in the electronics industry, largely dedicated to embedded software. A frequent presenter at conferences and seminars and author of numerous technical articles and two books on embedded software, Colin is an embedded software technologist with Mentor, a Siemens business, and is based in the UK. His regular blog is located at: http://blogs.mentor.com/colinwalls. He may be reached by email at colin_walls@mentor.com

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.