CMP EMBEDDED.COM

Login | Register     Welcome Guest RFID World  esc india  TeardownTV
 



General Purpose Watchdog Timer Component for a Multitasking System

by Dale Lantrip and Larry Bruner

The software service method and service interval for the watchdog timer vary from processor to processor. Here are two C++ classes that make up a general purpose component to use the watchdog timer.

Several popular microprocessors contain watchdog timers that must be serviced regularly to preclude resets by hardware circuitry in the processors. John Santic's article, "Watchdog Timer Techniques" (ESP, April 1995, pp. 58-69) addressed various hardware implementations of watchdog timers. As noted by the author, the software service method and service interval for the timer varies from processor to processor. The software system is responsible for servicing the watchdog at the interval, like a heartbeat, to keep the system alive. The system can increase the pulse of the heartbeat by servicing the watchdog several times within the span of the service interval, but if the software fails to service the watchdog for a span of time greater than the service interval, hardware circuits will reset the processor. Triggering a reset implies the system is unreliable: in an unknown state, caught in a loop with interrupts masked, or busy handling a barrage of constant interrupts. As a last resort to handle an unpredictable situation, the processor resets the software system using the watchdog timer mechanism.

This article presents a portable software component design comprising two C++ classes that use the watchdog timer as a system health monitor. The component has been designed for a software system comprising multiple tasks. C++ makes no provision in the language for tasking; therefore, an executive or real-time operating system (RTOS) is necessary to schedule, dispatch, and prioritize system tasks before this solution can be fully implemented in an embedded system. For the purpose of this article, the software was developed using Borland C++ on a PC, and the thread classes provided by Borland (Win32) were used to simulate an RTOS multitasking environment on an embedded system target.

Central to the design is the Watchdog_Manager class, which is responsible for monitoring registered synchronous tasks and the state of an "active" flag to consistently service or terminate service of the watchdog timer. Task registration is performed by declaring an object of the Watchdog_Entry class and using the Register member function of the class to define a relationship between the manager and the registered entry. The manager class continuously monitors the activity of all registered entry class objects. The definition of a synchronous task, in this context, is any task that executes in the system at a consistent periodic frequency. Contained within the watchdog manager class is a synchronous task which executes at an interval just under the service interval for the watchdog timer. An asynchronous task would be any task that is not synchronous within the system. Tasks which execute as a result of events in the system other than time can be considered asynchronous. For example, an I/O routine in a system that polls every 30ms for incoming messages would be considered synchronous, while an I/O handler that executes whenever an applicable interrupt is received would be considered asynchronous.

Often a visual aid is more effective than mere words to communicate the design of a software component. Because this component focuses primarily on the tasks of the system in which it executes, the relationship among those tasks is depicted in Figure 1. Following successful initialization of the component, each synchronous task monitored by the component is required to update a task entry list (internal to the manager) via the Watchdog_Entry::Ping member function within the period defined for the task during entry registration (the period and reset code for each Registered task is statically maintained within the internal task list). Any task in the system may directly inhibit the servicing of the hardware watchdog by the manager using the Watchdog_Manager::Terminate member function; for asynchronous tasks, using the terminate function is the only provision available. If the task the RTOS within the service interval, the manager will read through the task entry list and examine the termination status. If termination has not been requested and none of the entries in the task list have exceeded their defined periods, the manager will ping the hardware watchdog timer. Otherwise, the manager will save the reset code elected by either the task that requested termination or the task entry that did not ping before its defined period was exceeded.

The Static Watchdod Manager Class

Multiple manager objects with the ability to service the watchdog timer in tandem are not desirable in this component. To enforce a single Watchdog_Manager class per processor, the manager has been defined as a static class. The task defined by the Watchdog_Manager class has the sole responsibility of servicing the watchdog timer at the specified interval. Because the system will perish if the watchdog timer is not serviced by the expected interval, the task associated with the Watchdog_Manager class must run quickly with minimal interference. Logically, the task which invokes the Watchdog_Manager::Run member function to service the watchdog timer must be assigned the highest priority provided by the RTOS.

Any synchronous task in the system to be monitored by the Watchdog_Manager must register by declaring an object of the Watchdog_Entry class. This situation establishes an indirect relationship between the manager and the monitored synchronous task. The manager does not actually monitor synchronous tasks per se; it reads an internal data structure updated by objects of the Watchdog_Entry class. With this design, any synchronous task in the system can establish a virtual watchdog timer through its corresponding Watchdog_Entry class object. A registered synchronous task services this virtual watchdog with the Ping member function prior to the expiration of its registered timeout period (this timeout period is essentially a service interval). Failure of an entry to ping within its registered timeout period has a domino effect on the system. The manager will detect the entry timeout and will react by terminating any subsequent servicing of the actual hardware watchdog timer.

A more direct interface to the Watchdog_Manager is provided for asynchronous tasks. The Terminate method is provided as a mechanism by which any task may halt servicing of the hardware watchdog timer by the manager. To provide a direct mechanism by which an asynchronous task may inform the manager of the need to reset the processor, the Terminate member function is public. Functionally, Terminate is also used to inhibit service of the timer due to a Watchdog_Entry timeout. However, because Terminate is a public member function of the Watchdog_Manager class, the method can be used directly by synchronous and asynchronous tasks alike.

LISTING 1
Watchdog_Manager public member function prototype.


static void
Initialize
(Milliseconds period,
Ping_HW_Watchdog_Function
service_HW_watchdog,
Save_Reset_Code_Function
save_reset_code,
Time_Function current_time,
Delay_Function delay);
// Called at system init,
// before Watchdog_Manager::Run.
static void
Terminate
(Unsigned_32 reset_code);
// Called by asynchronous reset events.
// Will inhibit service of the watchdog
// thereby inducing a reset.
static void
Run
(void);
// Function that executes every Period
// msecs checking the registration list
// of synchronous tasks. Will induce a
// reset if any of the tasks fail to
// report an iteration within itýs
// "registered" period or if a
// Terminate is requested.
// NOTE: Intended to be called from a
// thread or task, sole purpose is to
// service the watchdog, i.e.
// no "return" from this function!

Three public member functions are provided by the Watchdog_Manager: Initialize, Run, and Terminate (see Listing 1). Because there are no corresponding objects of this static class, Initialize should be called at system construction to initialize the manager class with system-specific items:

  • period: This is the interval at which the hardware watchdog timer should be serviced, in milliseconds. As a guideline, period should be less than the actual service interval provided by the processor to allot for system overhead (such as task context switching, system clock tick resolution, and so on). As an example, if the documentation for the processor states an expected service interval of 62.5ms, selecting a period of 60ms would yield an overhead margin of 2.5ms. A value of zero for this initialization item effectively deactivates the class to provide an "off" switch for servicing the watchdog timer. Many processors have programmable watchdog timers which may be disabled entirely during software development. This software mechanism of the component is provided to conform with the hardware disable mechanism
  • service_HW_watchdog: This is the address of the function executed by the Watchdog_Manager class to service, or ping, the hardware watchdog timer. Service methods vary from processor to processor. To ensure reusability among these processors, the manager must be provided with a function to service the watchdog specific to the processor on which the software is installed and running. This function should require no input parameters at invocation and should not return a value as a result of execution
  • save_reset_code: This is the address of a function, which when executed, will save the single input parameter (an unsigned 32-bit number) for retrieval after a reset resulting from failure to ping the watchdog timer. The purpose of this function is to provide the user of the component a means of saving information for retrieval after a watchdog timer reset on the processor. The reset code is an unsigned 32-bit value, and is available for use by the user of the component to quickly determine the cause of a failure. With the exception of the hexadecimal value FFFF_FFFF, the reset code value range is available to the user of the component to convey relevant information following a watchdog reset (such as the unique identifier of the task that prompted the watchdog timer reset). Because the manner in which this information may be saved or retrieved is specific to the processor and the overall system design, this input function provides a means through which the system software designer can customize the data feedback method. The input function should not return a value as a result of execution
  • current_time: This is the address of a function which, when called, returns the current system time in milliseconds since system initialization. Because processors and clocks vary in interpretation of time, and all the time periods used by this component are expressed in terms of milliseconds, this function provides the component with a consistent method of time comparison. The time comparison algorithm internal to the component has no provision for a clock rollover-for peak reliability and exhaustive usage, the current time function should return a value of zero if called at the exact moment of system initialization. This function should require no input parameters at invocation and should return the number of milliseconds that have elapsed since system initialization
  • delay: This is the address of a function which, when called with a single input parameter of x milliseconds, will resume execution of the statement following the call to this function no longer than x milliseconds after the function invocation. Furthermore, as a result of executing this function, the calling task must relinquish control of the processor to the RTOS. The desired effect is that the task in which this function is called would not be a candidate for continued execution until the delay period expires. In addition, if the calling task is assigned the highest system priority, the desired result is immediate reinstatement of the calling task after the delay period expires, interrupting any lower priority task in progress. This function allows the manager task, primarily responsible for servicing the hardware watchdog timer based on system conditions, to minimally occupy the CPU at highest system priority at each service interval

The Initialize member function does not initialize the hardware watchdog timer (if that is required by the target processor). Often a system is designed to perform a variety of functions at initialization before loading and/or passing control to the application software. These functions may include hardware checks such as testing RAM to ensure reliability; downloading the application code from ROM to RAM; defining task control blocks for the RTOS; and initializing hardware modules. In a C++-based system, this initialization includes system level object construction, static class initialization, task creation, and definition of system constants. Any initialization of the watchdog timer for the processor should be done during construction. If the system startup exceeds the service interval for the watchdog timer, the initialization logic is responsible for providing the necessary watchdog timer service until the RTOS assumes control of the processor to dispatch application tasks. The Initialize member function of the static Watchdog_Manager class is executed at system startup to define parameters that are specific to the target system, so the class can properly service the watchdog on behalf of the application following system startup.

The Terminate member function can be called at any time by any task in the system to block subsequent servicing of the watchdog timer by the Watchdog_Manager, forcing a watchdog timer reset by the processor. The function accepts as the single input parameter a reset code, which is an unsigned 32-bit value ranging from zero through FFFF_FFFE. The reset code is defined as an input parameter to several member functions of this component. Although the reset code is not essential to the operation of this component, there is a historical aspect to the code's origin in this software. On several occasions during the field test of a system in which the predecessor of this component was used, resets occurred due to watchdog timeouts. On those occasions, it was difficult to determine the task responsible for the reset and isolate the cause of the system failure. Thus, the component was supplemented with the Save_Reset_Code function (executed by Terminate) and reset codes to provide the ability to quickly identify the cause of a watchdog reset. The reset codes used can be any value in the 32-bit range (other than the reserved value FFFF_FFFF) that will uniquely identify the cause of a reset condition. Examples of reset codes are the task entry point of registered synchronous tasks, or a discrete value for each watchdog entry and termination condition in the system. Upon initialization of the component, the Save_Reset_Code function is used to apply the reserved value FFFF_FFFF by default.

The Run member function should be executed immediately after the Initialize member function. Run should be executed from the highest priority task assigned by the RTOS in the system. The Run function contains the logic that checks current system conditions to determine whether or not the hardware watchdog service function should be executed. The Run function is the heart of the Watchdog_Manager and ultimately decides "to be, or not to be" for the system. This routine will continuously ping the hardware watchdog timer at the defined service interval unless:

  1. The defined service interval (period) is less than or equal to zero (providing the developer an off switch for the component when used on processors that permit the hardware watchdog timer to be disabled)
  2. A reset has been triggered via the Terminate member function or the detection of the timeout of a registered watchdog entry
  3. The RTOS is unable to resume execution of the Watchdog_Manager::Run task after a call to the delay function (usually due to a processor lockup)

In the event of a reset condition, the Save_Reset_Code function will be executed to save the code presented by the last execution of Terminate or the first watchdog entry timeout detected. If there is a processor lock-up, the reset code defaults to the reserved FFFF_FFFF value.

There are two private member functions of the Watchdog_Manager : Register and Ping. These private functions are visible only to the friend class Watchdog_Entry. The Watchdog_Entry class has public member functions by the same name, which in turn invoke the corresponding Watchdog_Manager private member functions. These public member functions, which provide an indirect interface between synchronous tasks and the manager class, are the next topics of discussion.

The Watchdog Entry Class

The reason for the dual functions of these two classes was primarily to define a relationship of any number of watchdog entry class objects to a single static manager class, which monitors each registered entry object. This situation provides a level of class abstraction and maintains the integrity of the internal data structure representing each monitored entry.

LISTING 2
Watchdog_Entry public member function prototypes.


void
	Register
		(Unsigned_32 reset_code,
			Milliseconds period);
//used by synchronous tasks to register
//their timeout periods, once registered
//a task must report within the period
//to prevent an induced reset.
void
Ping
	(void);
//used to indicate completion of a syn
//chronous task iteration, MUST be
//called every period after
//registration.

In this implementation, the data structure for each entry is read by Watchdog_Manager::Run to detect a timeout, and updated by the private member functions Register and Ping, which can only be relayed by the corresponding public member functions of the Watchdog_Entry object (see Listing 2). Protection is also provided for the unique entry identifier of each registered Watchdog_Entry object monitored by the manager. The identifier is assigned to the object by the manager through registration and used subsequently by the object to ping the manager. The entry identifier is hidden within the class object with no direct external access provisions to protect it from accidental corruption by the system. The primary purpose of this class is to provide a direct mechanism for synchronous tasks to be monitored by the manager establishing the conceptual virtual watchdog timer relationship between the entry and the manager. The manager should be properly initialized and running before either of the two public member functions of this class can be used by the synchronous tasks of the system.

The Register member function is used to initialize the virtual watchdog relationship between the entry and the manager class. The input parameters of the function define the timeout period for the entry and the reset code to be posted by the Save_Reset_Code function of the manager class should the entry timeout. Register should be called only once per entry object; there is no logic in the class to alert the user to double registration of an entry, should it occur, but double registration would invariably lead to a watchdog timer reset. The timeout period selected for an entry should allow for system overhead as would the period selected in the manager for servicing the hardware watchdog timer. For example, if the synchronous task, for which the entry is defined, executes every 20ms, a timeout period of 22ms would allot a system overhead of 2ms. Many factors influence the timeout period for a given synchronous task: system priority, criticality to the system overall, and conditions handled by the task, just to name three. For instance, low priority tasks that are not critical to the system and conditionally perform activities that are background in nature may be assigned a timeout period two or three times their execution frequency, whereas high priority tasks critical to mission operation may be permitted a 10% overrun for their timeout period.

The Ping member function is used to service the virtual watchdog timer established by the Register function of the entry (therefore, an entry should not ping before it is registered). Ping must be called at or before the defined timeout period relative to either the initial entry registration or the previous entry ping. If the task for which the entry is defined fails to ping the manager within the defined timeout period of the entry, the manager will detect the timeout condition and terminate further servicing of the watchdog.

The interaction between the entry and the manager pivots on the maintenance of the data structure for each entry hidden within the manager class. The data structure consists of three fields: reset_code, period, and last_reported_time. When the entry is registered, the reset_code and period for the entry are defined in the data structure as described in the explanation of the Register member function. The Ping member function uses the Current_Time function provided to Watchdog_Manager::Initialize to update the last_reported_time field of the data structure. When the RTOS next resumes the Watchdog_Manager::Run task to service the physical watchdog timer, the data structure of each entry monitored by the manager is examined to determine whether or not the last ping occurred on time. The last_reported_time value for the entry is subtracted from the current system time. If the time difference exceeds the entry's period field value, then Watchdog_Manager :: Terminate is called with the value of the entry's reset_code field; the code is saved and the servicing of the hardware watchdog timer is permanently inhibited.

Design considerations and caveats

Although the manager could have been implemented to support an almost limitless number of watchdog entry objects by using a dynamic allocation scheme, this implementation uses an internal array to statically limit the number of watchdog entry objects in the system to 64. The justification for static allocation by the compiler is that some embedded system designers frown on the use of dynamic allocation due to the overhead associated with manipulating the heap for memory assignment at runtime. The number of monitored entries was limited to 64 because we felt that an allowance of 64 synchronous tasks in a system is quite liberal. Also, there is a point at which the overhead associated with the manager checking system conditions at highest system priority per service interval would likely impede the system it was designed to support. For example, if there were 250 synchronous tasks with corresponding watchdog entry objects, and the timeout test performed by the manager for each entry required 4ms, the overall check to detect an entry time-out would be 1ms. If the service interval for the watchdog timer were 10ms, the manager task would digest 10% of the processor throughput-that's quite significant for a supporting system component.

To maximize efficiency, Watchdog_Manager::Run has been solely designed to detect timeouts of registered watchdog entries. However, one known circumstance exists in which this design would not detect a timeout by a watchdog entry object. Suppose the watchdog service interval of the manager exceeds a defined timeout period of a registered watchdog entry by a factor greater than two. At time zero, the watchdog manager has just successfully serviced the watchdog timer by executing the Run task. The entry in question could fail to ping within its registered timeout period initially, then recover in its next task run and ping within its registered timeout period, before the Run task executes to service the watchdog. Because the internal data structure only stores the most recent ping time, the manager would not detect the timeout by the entry (see Figure 2). If such an entry reports late habitually, due to task dynamics, over time it is likely that the timeout will be detected and the manager will terminate servicing of the timer inducing a watchdog reset-likely, but not certain.

However, if it is essential that all entry timeouts in the system be detected, the Ping member functionality could be expanded to test for a timeout condition and invoke Terminate whenever an entry exceeds the timeout period. The tradeoff is additional overhead in the Ping function. Adding this capability to Ping cannot replace the check by Run. If the timeout check were only present in Ping and an monitored synchronous task failed altogether due to a system exception, the timeout would go completely undetected. Minimally, timeout conditions must be detected by Run, and to enhance detection capabilities under some circumstances, Ping could be updated to include an additional check. Perhaps a nice compromise would be to add a member function that pings with a timeout check for those tasks which are susceptible to this condition. The remainder of the tasks in the system could use the regular Ping function, which has minimal system overhead.

Originally, Ping provided an input parameter to change the reset code defined at registration for the entry, if desired. The thought behind the design was to allow a user of the component to define a timeout period for an entry below the time of the synchronous task. This definition may be desirable for designs that incorporate only two synchronous tasks in the system for foreground and background processing. The foreground processing performs each mission critical activity. In this scenario, it may be very beneficial to the system designers to determine whether or not each activity completed in the time allotted. If there are five activities in the foreground processing and the foreground processing runs every 50ms, each activity could be allotted 10ms to complete, and then Ping the manager through the entry, each with a different reset code. With a timeout period for the entry of 12ms, any activity that did not complete could be tracked using the unique reset code. Ultimately, we decided to remove this capability because it was provided only for a very specific design, and the overhead associated with providing a different reset code per Ping did not merit its inclusion in the final draft.

Another consideration in the design of this component which did not make the final draft is the implementation of a Cancel member function to remove a registered entry from the monitor list. This removal could be accomplished, but there were several deterrents in the implementation tested and no strong opinion of added benefit to the component. Other member functions that are beyond the scope of this component, but could be encapsulated within the design, are functions to reset and/or halt the processor directly. As is, the component provides an indirect means of resetting the processor through a watchdog timer reset. This reset's drawback is that it is not immediate. If there is a need for immediate reset of the processor under certain circumstances, and the target processor provides a software switch to reset the processor, a Reset member function could be added to the component, and the software function that performs the software processor reset could be passed into the component, just as the Service_HW_Watchdog function is during initialization. Likewise, if there is a need to halt the processor altogether, a similar alteration to the component could be incorporated.

And it's a wrap...

Although every attempt was made to provide a general-purpose component, every reader will likely define "general purpose" differently. The goal was to isolate those factors that are specific to the target processor as the initialization parameters of the static Watchdog_Manager class. Given the differences in processors and RTOS packages available, this is not a "plug-and-play" solution, but it should be very close. The software files that comprise the classes described are available, along with the Borland test case used to test the behavior of the component, from ftp://ftp.embedded.com/pub/1997/lantrip.txt. Reader feedback is welcome; a great sense of accomplishment will be derived from knowing that this solution minimized development time with reliable and efficient results.

Dale Lantrip is a senior software engineer for Superior Software Systems, a consulting firm centered in Indianapolis, IN. Dale can be reached at dlantrip@atd.gmeds.com.

Larry Bruner is a senior software engineer for Superior Software Systems of Indianapolis, IN. Larry can be reached at lbruner@superior-sw.com.

Embedded.com Career Center
Ready to take that job and shove it?
SEARCH JOBS

Browse all jobs

SPONSOR
RECENT JOB POSTINGS


 :