General Purpose Watchdog Timer Component for a Multitasking System
by Dale Lantrip and Larry Bruner
The software service method and service
interval for the watchdog timer vary from
processor to processor. Here are two C++
classes that make up a general purpose component
to use the watchdog timer.
Several popular microprocessors
contain watchdog
timers that must be serviced
regularly to preclude
resets by hardware circuitry
in the processors. John Santic's article,
"Watchdog Timer Techniques" (ESP,
April 1995, pp. 58-69) addressed various
hardware implementations of
watchdog timers. As noted by the
author, the software service method
and service interval for the timer varies
from processor to processor. The software
system is responsible for servicing
the watchdog at the interval, like a
heartbeat, to keep the system alive.
The system can increase the pulse of
the heartbeat by servicing the watchdog
several times within the span of
the service interval, but if the software
fails to service the watchdog for a span
of time greater than the service interval,
hardware circuits will reset the
processor. Triggering a reset implies
the system is unreliable: in an
unknown state, caught in a loop with
interrupts masked, or busy handling a
barrage of constant interrupts. As a last
resort to handle an unpredictable situation,
the processor resets the software
system using the watchdog timer
mechanism.
This article presents a portable software
component design comprising
two C++ classes that use the watchdog
timer as a system health monitor. The
component has been designed for a
software system comprising multiple
tasks. C++ makes no provision in the
language for tasking; therefore, an
executive or real-time operating system
(RTOS) is necessary to schedule,
dispatch, and prioritize system tasks
before this solution can be fully implemented
in an embedded system. For
the purpose of this article, the software
was developed using Borland C++ on a
PC, and the thread classes provided by
Borland (Win32) were used to simulate
an RTOS multitasking environment
on an embedded system target.
Central to the design is the
Watchdog_Manager class, which is
responsible for monitoring registered
synchronous tasks and the state of an
"active" flag to consistently service or
terminate service of the watchdog
timer. Task registration is performed
by declaring an object of the
Watchdog_Entry class and using the
Register member function of the class
to define a relationship between the
manager and the registered entry. The
manager class continuously monitors
the activity of all registered entry class
objects. The definition of a synchronous
task, in this context, is any task
that executes in the system at a consistent
periodic frequency. Contained
within the watchdog manager class is a
synchronous task which executes at an
interval just under the service interval for the watchdog timer. An asynchronous
task would be any task that is not
synchronous within the system. Tasks
which execute as a result of events in
the system other than time can be considered
asynchronous. For example, an
I/O routine in a system that polls every
30ms for incoming messages would be
considered synchronous, while an I/O
handler that executes whenever an
applicable interrupt is received would
be considered asynchronous.
Often a visual aid is more effective
than mere words to communicate the
design of a software component.
Because this component focuses primarily
on the tasks of the system in
which it executes, the relationship
among those tasks is depicted in Figure
1. Following successful initialization of
the component, each synchronous task
monitored by the component is
required to update a task entry list
(internal to the manager) via the
Watchdog_Entry::Ping member function
within the period defined for the task
during entry registration (the period
and reset code for each Registered task
is statically maintained within the internal
task list). Any task in the system
may directly inhibit the servicing of the
hardware watchdog by the manager
using the Watchdog_Manager::Terminate
member function; for asynchronous
tasks, using the terminate function is
the only provision available. If the task
the RTOS within the service interval,
the manager will read through the task
entry list and examine the termination
status. If termination has not been
requested and none of the entries in the
task list have exceeded their defined
periods, the manager will ping the
hardware watchdog timer. Otherwise,
the manager will save the reset code
elected by either the task that requested
termination or the task entry that did
not ping before its defined period was
exceeded.
The Static Watchdod Manager Class
Multiple manager objects with
the ability to service the
watchdog timer in tandem
are not desirable in this component. To
enforce a single Watchdog_Manager class
per processor, the manager has been
defined as a static class. The task
defined by the Watchdog_Manager class
has the sole responsibility of servicing
the watchdog timer at the specified
interval. Because the system will perish
if the watchdog timer is not serviced
by the expected interval, the task
associated with the Watchdog_Manager
class must run quickly with minimal
interference. Logically, the task which
invokes the Watchdog_Manager::Run
member function to service the watchdog
timer must be assigned the highest
priority provided by the RTOS.
Any synchronous task in the system
to be monitored by the Watchdog_Manager
must register by declaring an object of
the Watchdog_Entry class. This situation
establishes an indirect relationship
between the manager and the monitored
synchronous task. The manager does
not actually monitor synchronous tasks
per se; it reads an internal data structure
updated by objects of the Watchdog_Entry
class. With this design, any synchronous
task in the system can establish a
virtual watchdog timer through its corresponding
Watchdog_Entry class object.
A registered synchronous task services
this virtual watchdog with the Ping
member function prior to the expiration
of its registered timeout period (this
timeout period is essentially a service
interval). Failure of an entry to ping
within its registered timeout period has a domino effect on the system. The
manager will detect the entry timeout
and will react by terminating any subsequent
servicing of the actual hardware
watchdog timer.
A more direct interface to the
Watchdog_Manager is provided for asynchronous
tasks. The Terminate method
is provided as a mechanism by which
any task may halt servicing of the hardware watchdog timer by the manager.
To provide a direct mechanism by
which an asynchronous task may
inform the manager of the need to reset
the processor, the Terminate member
function is public. Functionally,
Terminate is also used to inhibit service
of the timer due to a Watchdog_Entry
timeout. However, because Terminate
is a public member function of the
Watchdog_Manager class, the method can
be used directly by synchronous and
asynchronous tasks alike.
LISTING 1
Watchdog_Manager public member
function prototype.
static void
Initialize
(Milliseconds period,
Ping_HW_Watchdog_Function
service_HW_watchdog,
Save_Reset_Code_Function
save_reset_code,
Time_Function current_time,
Delay_Function delay);
// Called at system init,
// before Watchdog_Manager::Run.
static void
Terminate
(Unsigned_32 reset_code);
// Called by asynchronous reset events.
// Will inhibit service of the watchdog
// thereby inducing a reset.
static void
Run
(void);
// Function that executes every Period
// msecs checking the registration list
// of synchronous tasks. Will induce a
// reset if any of the tasks fail to
// report an iteration within itýs
// "registered" period or if a
// Terminate is requested.
// NOTE: Intended to be called from a
// thread or task, sole purpose is to
// service the watchdog, i.e.
// no "return" from this function!
Three public member functions are
provided by the Watchdog_Manager: Initialize, Run, and Terminate (see
Listing 1). Because there are no corresponding
objects of this static class,
Initialize should be called at system
construction to initialize the manager
class with system-specific items:
- period: This is the interval at which the hardware watchdog timer should be serviced, in milliseconds. As a guideline, period should be less than the actual service interval provided by the processor to allot for system overhead (such as task context switching, system clock tick resolution, and so on). As an example, if the documentation for the processor states an expected service interval of 62.5ms, selecting a period of 60ms would yield an overhead margin of 2.5ms. A value of zero for this initialization item effectively deactivates the class to provide an "off" switch for servicing the watchdog timer. Many processors have programmable watchdog timers which may be disabled entirely during software development. This software mechanism of the component is provided to conform with the hardware disable mechanism
- service_HW_watchdog: This is the address of the function executed by the Watchdog_Manager class to service, or ping, the hardware watchdog timer. Service methods vary from processor to processor. To ensure reusability among these processors, the manager must be provided with a function to service the watchdog specific to the processor on which the software is installed and running. This function should require no input parameters at invocation and should not return a value as a result of execution
- save_reset_code: This is the address of a function, which when executed, will save the single input parameter (an unsigned 32-bit number) for retrieval after a reset resulting from failure to ping the watchdog timer. The purpose of this function is to provide the user of the component a means of saving information for retrieval after a watchdog timer reset on the processor. The reset code is an unsigned 32-bit value, and is available for use by the user of the component to quickly determine the cause of a failure. With the exception of the hexadecimal value FFFF_FFFF, the reset code value range is available to the user of the component to convey relevant information following a watchdog reset (such as the unique identifier of the task that prompted the watchdog timer reset). Because the manner in which this information may be saved or retrieved is specific to the processor and the overall system design, this input function provides a means through which the system software designer can customize the data feedback method. The input function should not return a value as a result of execution
- current_time: This is the address of a function which, when called, returns the current system time in milliseconds since system initialization. Because processors and clocks vary in interpretation of time, and all the time periods used by this component are expressed in terms of milliseconds, this function provides the component with a consistent method of time comparison. The time comparison algorithm internal to the component has no provision for a clock rollover-for peak reliability and exhaustive usage, the current time function should return a value of zero if called at the exact moment of system initialization. This function should require no input parameters at invocation and should return the number of milliseconds that have elapsed since system initialization
- delay: This is the address of a function which, when called with a single input parameter of x milliseconds, will resume execution of the statement following the call to this function no longer than x milliseconds after the function invocation. Furthermore, as a result of executing this function, the calling task must relinquish control of the processor to the RTOS. The desired effect is that the task in which this function is called would not be a candidate for continued execution until the delay period expires. In addition, if the calling task is assigned the highest system priority, the desired result is immediate reinstatement of the calling task after the delay period expires, interrupting any lower priority task in progress. This function allows the manager task, primarily responsible for servicing the hardware watchdog timer based on system conditions, to minimally occupy the CPU at highest system priority at each service interval
The Initialize member function
does not initialize the hardware watchdog
timer (if that is required by the target
processor). Often a system is
designed to perform a variety of functions
at initialization before loading
and/or passing control to the application
software. These functions may
include hardware checks such as testing
RAM to ensure reliability; downloading
the application code from
ROM to RAM; defining task control
blocks for the RTOS; and initializing
hardware modules. In a C++-based
system, this initialization includes system
level object construction, static
class initialization, task creation, and
definition of system constants. Any
initialization of the watchdog timer for
the processor should be done during
construction. If the system startup
exceeds the service interval for the
watchdog timer, the initialization logic
is responsible for providing the necessary
watchdog timer service until the
RTOS assumes control of the processor
to dispatch application tasks. The
Initialize member function of the static
Watchdog_Manager class is executed
at system startup to define parameters
that are specific to the target system, so
the class can properly service the
watchdog on behalf of the application
following system startup.
The Terminate member function can
be called at any time by any task in the
system to block subsequent servicing
of the watchdog timer by the
Watchdog_Manager, forcing a watchdog
timer reset by the processor. The function accepts as the single input parameter
a reset code, which is an unsigned
32-bit value ranging from zero through
FFFF_FFFE. The reset code is defined as
an input parameter to several member
functions of this component. Although
the reset code is not essential to the
operation of this component, there is a
historical aspect to the code's origin in
this software. On several occasions
during the field test of a system in
which the predecessor of this component
was used, resets occurred due to
watchdog timeouts. On those occasions,
it was difficult to determine the
task responsible for the reset and isolate
the cause of the system failure.
Thus, the component was supplemented
with the Save_Reset_Code function
(executed by Terminate) and reset codes
to provide the ability to quickly identify
the cause of a watchdog reset. The
reset codes used can be any value in
the 32-bit range (other than the
reserved value FFFF_FFFF) that will
uniquely identify the cause of a reset
condition. Examples of reset codes are
the task entry point of registered synchronous
tasks, or a discrete value for
each watchdog entry and termination
condition in the system. Upon initialization
of the component, the
Save_Reset_Code function is used to
apply the reserved value FFFF_FFFF by
default.
The Run member function should be
executed immediately after the
Initialize member function. Run
should be executed from the highest
priority task assigned by the RTOS in
the system. The Run function contains
the logic that checks current system
conditions to determine whether or not
the hardware watchdog service function
should be executed. The Run function
is the heart of the Watchdog_Manager
and ultimately decides "to be, or not to
be" for the system. This routine will
continuously ping the hardware watchdog
timer at the defined service interval
unless:
- The defined service interval (period) is less than or equal to zero (providing the developer an off switch for the component when used on processors that permit the hardware watchdog timer to be disabled)
- A reset has been triggered via the Terminate member function or the detection of the timeout of a registered watchdog entry
- The RTOS is unable to resume execution of the Watchdog_Manager::Run task after a call to the delay function (usually due to a processor lockup)
In the event of a reset condition, the
Save_Reset_Code function will be executed
to save the code presented by the
last execution of Terminate or the first
watchdog entry timeout detected. If
there is a processor lock-up, the reset
code defaults to the reserved FFFF_FFFF
value.
There are two private member functions
of the Watchdog_Manager : Register
and Ping. These private functions are
visible only to the friend class
Watchdog_Entry. The Watchdog_Entry
class has public member functions by
the same name, which in turn invoke
the corresponding Watchdog_Manager
private member functions. These public
member functions, which provide
an indirect interface between synchronous
tasks and the manager class, are
the next topics of discussion.
The Watchdog Entry Class
The reason for the dual functions
of these two classes was primarily
to define a relationship of
any number of watchdog entry class
objects to a single static manager class,
which monitors each registered entry
object. This situation provides a level
of class abstraction and maintains the
integrity of the internal data structure
representing each monitored entry.
LISTING 2
Watchdog_Entry public member
function prototypes.
void
Register
(Unsigned_32 reset_code,
Milliseconds period);
//used by synchronous tasks to register
//their timeout periods, once registered
//a task must report within the period
//to prevent an induced reset.
void
Ping
(void);
//used to indicate completion of a syn
//chronous task iteration, MUST be
//called every period after
//registration.
In this implementation, the data
structure for each entry is read by
Watchdog_Manager::Run to detect a timeout,
and updated by the private member
functions Register and Ping, which
can only be relayed by the corresponding
public member functions of the
Watchdog_Entry object (see Listing 2).
Protection is also provided for the
unique entry identifier of each registered
Watchdog_Entry object monitored
by the manager. The identifier is
assigned to the object by the manager
through registration and used subsequently
by the object to ping the manager.
The entry identifier is hidden
within the class object with no direct
external access provisions to protect it
from accidental corruption by the system.
The primary purpose of this class
is to provide a direct mechanism for
synchronous tasks to be monitored by
the manager establishing the conceptual
virtual watchdog timer relationship
between the entry and the manager.
The manager should be properly initialized
and running before either of
the two public member functions of
this class can be used by the synchronous
tasks of the system.
The Register member function is
used to initialize the virtual watchdog
relationship between the entry and the
manager class. The input parameters of
the function define the timeout period
for the entry and the reset code to be
posted by the Save_Reset_Code function
of the manager class should the entry
timeout. Register should be called only
once per entry object; there is no logic
in the class to alert the user to double
registration of an entry, should it
occur, but double registration would
invariably lead to a watchdog timer
reset. The timeout period selected for
an entry should allow for system overhead
as would the period selected in
the manager for servicing the hardware
watchdog timer. For example, if the
synchronous task, for which the entry
is defined, executes every 20ms, a
timeout period of 22ms would allot a
system overhead of 2ms. Many factors
influence the timeout period for a
given synchronous task: system priority,
criticality to the system overall, and
conditions handled by the task, just to
name three. For instance, low priority
tasks that are not critical to the system
and conditionally perform activities
that are background in nature may be
assigned a timeout period two or three
times their execution frequency,
whereas high priority tasks critical to
mission operation may be permitted a
10% overrun for their timeout period.
The Ping member function is used to
service the virtual watchdog timer
established by the Register function of
the entry (therefore, an entry should
not ping before it is registered). Ping
must be called at or before the defined
timeout period relative to either the initial
entry registration or the previous
entry ping. If the task for which the
entry is defined fails to ping the manager
within the defined timeout period
of the entry, the manager will detect
the timeout condition and terminate
further servicing of the watchdog.
The interaction between the entry
and the manager pivots on the maintenance
of the data structure for each
entry hidden within the manager class.
The data structure consists of three
fields: reset_code, period, and
last_reported_time. When the entry is
registered, the reset_code and period
for the entry are defined in the data
structure as described in the explanation
of the Register member function.
The Ping member function uses the
Current_Time function provided to
Watchdog_Manager::Initialize to update
the last_reported_time field of the data
structure. When the RTOS next
resumes the Watchdog_Manager::Run task
to service the physical watchdog timer,
the data structure of each entry monitored
by the manager is examined to
determine whether or not the last ping
occurred on time. The
last_reported_time value for the entry
is subtracted from the current system
time. If the time difference exceeds the
entry's period field value, then
Watchdog_Manager :: Terminate is called
with the value of the entry's reset_code
field; the code is saved and the servicing
of the hardware watchdog timer is
permanently inhibited.
Design considerations and caveats
Although the manager could
have been implemented to support
an almost limitless number
of watchdog entry objects by using
a dynamic allocation scheme, this
implementation uses an internal array
to statically limit the number of watchdog
entry objects in the system to 64.
The justification for static allocation
by the compiler is that some embedded
system designers frown on the use of
dynamic allocation due to the overhead
associated with manipulating the heap
for memory assignment at runtime.
The number of monitored entries was
limited to 64 because we felt that an
allowance of 64 synchronous tasks in a
system is quite liberal. Also, there is a
point at which the overhead associated
with the manager checking system
conditions at highest system priority
per service interval would likely
impede the system it was designed to
support. For example, if there were
250 synchronous tasks with corresponding
watchdog entry objects, and
the timeout test performed by the manager for each entry required 4ms, the
overall check to detect an entry time-out
would be 1ms. If the service interval
for the watchdog timer were 10ms,
the manager task would digest 10% of
the processor throughput-that's quite
significant for a supporting system
component.
To maximize efficiency,
Watchdog_Manager::Run has been solely
designed to detect timeouts of registered
watchdog entries. However, one
known circumstance exists in which
this design would not detect a timeout
by a watchdog entry object. Suppose
the watchdog service interval of the
manager exceeds a defined timeout
period of a registered watchdog entry
by a factor greater than two. At time
zero, the watchdog manager has just
successfully serviced the watchdog
timer by executing the Run task. The
entry in question could fail to ping
within its registered timeout period initially,
then recover in its next task run
and ping within its registered timeout
period, before the Run task executes to
service the watchdog. Because the
internal data structure only stores the
most recent ping time, the manager
would not detect the timeout by the
entry (see Figure 2). If such an entry
reports late habitually, due to task
dynamics, over time it is likely that the
timeout will be detected and the manager
will terminate servicing of the
timer inducing a watchdog reset-likely,
but not certain.
However, if it is essential that all
entry timeouts in the system be detected,
the Ping member functionality
could be expanded to test for a timeout
condition and invoke Terminate whenever
an entry exceeds the timeout period.
The tradeoff is additional overhead
in the Ping function. Adding this capability
to Ping cannot replace the check
by Run. If the timeout check were only
present in Ping and an monitored synchronous
task failed altogether due to a
system exception, the timeout would
go completely undetected. Minimally,
timeout conditions must be detected by
Run, and to enhance detection capabilities under some circumstances, Ping
could be updated to include an additional
check. Perhaps a nice compromise
would be to add a member function
that pings with a timeout check for
those tasks which are susceptible to
this condition. The remainder of the
tasks in the system could use the regular
Ping function, which has minimal
system overhead.
Originally, Ping provided an input
parameter to change the reset code
defined at registration for the entry, if
desired. The thought behind the design
was to allow a user of the component
to define a timeout period for an entry
below the time of the synchronous
task. This definition may be desirable
for designs that incorporate only two
synchronous tasks in the system for
foreground and background processing.
The foreground processing performs
each mission critical activity. In
this scenario, it may be very beneficial
to the system designers to determine
whether or not each activity completed
in the time allotted. If there are five
activities in the foreground processing
and the foreground processing runs
every 50ms, each activity could be
allotted 10ms to complete, and then
Ping the manager through the entry,
each with a different reset code. With a
timeout period for the entry of 12ms,
any activity that did not complete
could be tracked using the unique reset
code. Ultimately, we decided to
remove this capability because it was
provided only for a very specific
design, and the overhead associated
with providing a different reset code
per Ping did not merit its inclusion in
the final draft.
Another consideration in the design
of this component which did not make
the final draft is the implementation of
a Cancel member function to remove a
registered entry from the monitor list.
This removal could be accomplished,
but there were several deterrents in the
implementation tested and no strong
opinion of added benefit to the component.
Other member functions that are
beyond the scope of this component,
but could be encapsulated within the
design, are functions to reset and/or
halt the processor directly. As is, the
component provides an indirect means
of resetting the processor through a
watchdog timer reset. This reset's
drawback is that it is not immediate. If
there is a need for immediate reset of
the processor under certain circumstances,
and the target processor provides
a software switch to reset the
processor, a Reset member function
could be added to the component, and
the software function that performs the
software processor reset could be
passed into the component, just as the
Service_HW_Watchdog function is during
initialization. Likewise, if there is a
need to halt the processor altogether, a
similar alteration to the component
could be incorporated.
And it's a wrap...
Although every attempt was
made to provide a general-purpose
component, every reader
will likely define "general purpose"
differently. The goal was to isolate
those factors that are specific to the target
processor as the initialization parameters
of the static Watchdog_Manager
class. Given the differences in processors
and RTOS packages available,
this is not a "plug-and-play" solution,
but it should be very close. The software
files that comprise the classes
described are available, along with the
Borland test case used to test the
behavior of the component, from ftp://ftp.embedded.com/pub/1997/lantrip.txt.
Reader feedback is welcome; a great
sense of accomplishment will be
derived from knowing that this solution
minimized development time with
reliable and efficient results.
Dale Lantrip is a senior software engineer
for Superior Software Systems, a
consulting firm centered in
Indianapolis, IN. Dale can be reached
at dlantrip@atd.gmeds.com.
Larry Bruner is a senior software engineer
for Superior Software Systems of
Indianapolis, IN. Larry can be reached
at lbruner@superior-sw.com.