A systems approach to embedded code fault detection

In technical literature, there are many valid ways to define a system, and an embedded system in particular. For the purposes of this article we will use one of the most general and classic model of systems theory: a system is an interconnection of subunits that can be modeled by data and control inputs, a state machine, and data and control outputs (Figure 1 ).

Figure 1: Input-state-output system model

What turns this into an embedded system is that most of it is hidden and inaccessible, and is often characterized by real-time constraints: a sort of a black box which must react to a set of stimuli with an expected behavior according to the end user's (customer's) perception of the system. Any deviation from this behavior is reported by the customer as a defect, a fault.

At the time of their occurrence, faults are characterized by their impact on product functionality and by the chain of events that led to their manifestation. Deciding how to handle a fault when it pops up and how to compensate for its effects is typically a static design issue, dealing with allowed tolerance level and system functional constraints. On the other hand, collecting run-time information about the causes that resulted in the system misbehavior should be a dynamic process, as much as possible flexible and scalable.

As a matter of fact, fault handling strategies are commonly defined at the system design stage. They result from balancing the drawbacks naturally arising when it is allowed a minimal degree of divergence from the expected behavior, and sizing the threshold above which the effects have to be considered as unacceptable.

When the degradation of the performance resulting from the fault is such that countermeasures are needed, recovering actions have to be well defined by specifications. In fact the occurrence of an unexpected behavior, by its nature unwanted, does not necessarily mean a complete loss of functionality, that’s why establishing back-off solutions is typically part of the design process.

Depending on the system’s nature, recovery actions can be handled by implementing redundancy or allowing temporary degradation of the service while performing corrections.On the contrary, defining strategies to collect data for debug purposes is a process that often is left out, trusting the filtering applied at test stages. This deficiency is usually due to strict timing constraints at the development phase and a lack of system resources, resulting from fitting a maximum of functions into the product. Often, when there is a need to deal with malfunctioning in the field, strategies are decided and measures are set up on the fly.

One of the problems with embedded systems is that they are indeed embedded; that is, information accessibilityis usually far from being granted: during the several test phases a product goes through, the designers/troubleshooters usually can make use of intrusive tools, like target debuggers and oscilloscopes, to isolate a fault. When the product instead is in service, it is often impossible to use such instruments and the available investigation tools may be not sufficient to easily identify the root cause of the problem within a time that is reasonable from the customer’s perspective. Moreover, establishing some sort of strict synchronization between recording instruments and internal fault detection is not always possible, with the result that data collected at the inputs/outputs cannot clearly be tied to the fault occurrence itself and has to be correlated manually.

The other problem is the fault localization. While the system grows in complexity, possible deviations from the expected operation increase. This is one reason (but not the only one) why the more complex the system is, the bigger can be the distance between the fault and its symptoms. Symptoms alone are not sufficient to identify the root cause of a problem. The relevant information is hidden in the inputs and in the status of the system when the fault occurred, but in most cases this information is gone forever.

Furthermore, symptoms may contribute to getting closer to the faulty area, but can also mislead the investigation. It could happen that the evolution of a system degenerates before the user perceives it and the symptoms noticed are secondary effects of the uncontrolled process generated by the fault. Thus, providing some mechanism to enhance self-awareness in the system is a real benefit in the error localization process. A specific section of this paper will be dedicated to this aspect.

Summarizing, the fault handling process typically puts the accent on fault prevention (techniques to minimize the number of failures), and fault tolerance issues (how the system should react to avoid loss of performance after a failure), since undoubtedly they’re the core part of the reliability of a system. The techniques to speed-up troubleshooting during integration tests and maintenance phases, which constitute the core of the error detection process, are often left to a heuristic approach that the following chapters of this paper will try to systemize within an organic view.

The aim of this paper is to suggest a simple approach to the problem of fault detection and provide some hints on how to design debug features into embedded systems which have real-time constraints and suffer from lack of memory resources. We will discuss:

  • Fault localization: which elements are necessary to isolate a fault
  • How to collect useful data from a fault
  • How to trace runtime events
  • Post mortem debug and diagnostics.
  • Fault Localization

Error notification in an embedded system typically describes a fact. Underflow of a buffer, timeout expiration, wrong signal contents, unexpected/out of range data: this is the data normally associated with a fault notification. Sometimes additional info like timestamp, affected channel/task or software component that is reporting the error are also included in the communication and that’s it. In other words, fault notification often turns into a description of the “front end” of an unexpected behavior. From a maintenance perspective there’s a gap between such reporting and the real source of the problem: what has caused the buffer underflow? Why the timeout has expired? Why did the task receive that signal with wrong content?

The analysis might have to go back and implement trap patches in the code to catch the root cause and wait for the fault to happen again. This will lead to an iterative approach, which relies on the very weak assumption that the fault is easily reproducible. Sometimes it is, sometimes it is not, sometimes the fault happens once every two or three days. Then you need to talk to the customer and explain that it will take some weeks to isolate the fault.

However, this scenario is not that bad. It is worse when you don’t even detect the faulty component and its behavior affects the surrounding devices. An example would be the corruption of a shared memory. Far worse is a pointer that goes out of control and writes unwanted data on memory locations that are also used by other HW/SW components. In that case, not having a fault notification in time and from the right component can slow down the investigation. Fault notification, then, presents two aspects that must be taken into account when designing embedded software:

  • the need for useful information to enhance the ability to catch the root cause at the first occurrence.
  • the capability of the system to recognize misbehaviors at an early stage to introduce some mechanisms to increase system’s self-awareness.

What does “useful information” mean? How can we increase chances to identify the fault at the first occurrence? As already stated, a good hint would be having an idea of what kind of solicitations were stressing our system and what our system was doing when stimulated in such a way. Or speaking more technically, an analysis and a classification of the system inputs and their relationship with the system states.

Possible combinations of data feeding the input of a complex system may be infinite; nevertheless they may be grouped in classes. Each input belonging to a specific class will be generally handled by the same part of the system and will affect it always in a similar way. For instance, a voice detector will operate always in a certain way if a voice with a certain power level is recognized.

What makes the difference here is the transition between input classes that (could) make the system work in a different way. At a certain power level, the voice detector will reveal some speech. Transitions between classes are what we are interested in, since they change the working condition of our system.

Within each class what we could look for are errors already present at the input side; and the amount of time in which the system is working on the edge, handling some “limit” conditions. While input errors are easy to take care of, for the edge conditions let’s take another example: if we have a function that handles a packet interface and is synchronous in its output, we probably have also a requirement for the maximum allowed jitter between packets, let’s say we expect data each 10ms +/- 5ms.

We design our function in order to be flexible, allowing even a greater jitter, +/- 7ms. It could happen that if the transmitting side has a temporary drift problem and delays packets (one each 13 ms), after a while we will probably have some buffering problems. That means we were prepared to handle statistical jitter but not a drift. One can say that the main problem is not in our system, but also in this case providing a method to evaluate whether the jitter is a real statistical jitter and being prepared to notify that the expected jitter was biased, is a way to take care of the persistence of a set of input that stresses our system before saturating/depleting our buffers.

That’s for the input side, what’s still left is the system working mode. Our complex architecture will be able to perform many tasks on many input classes, it will switch smoothly among those and probably it will be composed of a large set of state machines. Experience teaches that you can have the best SW evaluation tools and the best practices to make your state machines fault proof, but still in that nasty condition, in that particular case, (and according to Moore’s Law only at your most important customer premises) the state machine will experience an unexpected behavior.

Common sense suggests two best practices: test and retest and try all the worst condition you can imagine, and report the recent history of your state machines when you notify a fault.

It is not always the case that a state machine model is the best way to represent system evolution. Sometimes it is easier to detect system behavior from the status of some variables. More often it would be much clearer if we’d have the stack dumped or the program counter evolution. In the next chapter we will give more details on that, here we’ll stop at the consideration that it is important to know to some extent what the system status was and had been within a certain period before the fault. Summing up then:

  • Classify your inputs
  • Focus on transition between classes
  • Register corrupted stimuli
  • Record edge conditions
  • Trace system evolution history

The second item to be explored in this artice is the problem of detection at an early stage. This is a very tricky point as it relies on designer capability to create mechanisms to make the system check its behavior during its normal operation. Some ideas, the more general-purpose ones, will be described in this paper, but many others can be invented and fit better the application they will be used for. The basic rule is to keep MIPS consumption under control. That means fast and compact routines and providing the possibility to easily activate/deactivate each early detection mechanism implemented.

In the following we will refer to an architectural scheme with the aim to balance between the need of discussing general concepts and the need to offer concrete examples. So let’s assume we are dealing with an architecture as shown in Figure 2 :

Figure 2: Generic embedded system scheme

In the dashed box there’s a DSP system, supervised by a microcontroller that takes care of control functions and O&M. We will assume also an interrupt driven IPC (Inter Processor Communication) that is based on a common memory interface. Let’s imagine also that DSPs could have one or more internal memories according to the classic DSP standards. With this example in mind let’s go into the details of the so-called “self-awareness” features.

Load metering
Meeting deadlines in real-time applications is the basic requirement. There can be many deadlines, and many of those deadlines can be the tasks we have to ensure completion of. The measurements of the overall MIPS consumption, average and peak, can be considered as a means to guarantee early detection of fault due to deadlines not met. Setting thresholds above which a warning should be raised is a good way to prevent dangerous situations. If we add in the warning signal some information on system history and status, we can also highlight critical paths that could be passed through only due to peculiar sequences of inputs. Such metering features, used in conjunction with logs of the execution flow mentioned above, can help characterize the areas of the application on which to focus in terms of optimization.

There are many ways to implement a MIPS load meter; here we suggest one simple method that could be applicable in architectures in which the operating system foresees an idle state. A timer can be used to sample the time value just before entering the idle state and when this state is left. The difference (xj ) is accumulated until the timer expires:

The timer period can be chosen according to the scheduling scheme, for example if the operating system is preemptive with 1ms preemption, the timer period could be 1ms. Having Tk =0 does not mean the peak load is 100%, but only that in one timer period one process was running all the time until another process with higher priority has been scheduled . In order to have a measure of the load of the processor it is necessary to average a sequence of Tk values. The window for the average can be chosen taking into account the number of processes in the application. In order not to consume MIPS in performing the average computation, you can use a simple moving average filter of this kind:

Every time the timer expires, this formula is computed, Tk is zeroed, and the computation of A k minimum value is performed:

The lower the M value, the greater the peak load in the system.

Memory supervision
One consideration is data consistency background checks. In systems with many tasks, most of them exchanging data through messaging and processing their own data chunks locally, the number of pointers handled by the code can easily become very large. But depending on the system architecture it is not always possible to grant access supervision to control that whoever is accessing a section has the right to do it, or if it is allowed to do it at the time it is doing it. Performing background checks periodically, at least on locations containing constant sections, it is an easy way to get nearer to the safe side.

For example, in the architecture depicted in Figure 2 , the microcontroller can be easily delegated to perform this task on the common memory, and the DSP can rely on this to always trust the consistency of the data it accesses from there. Following the same approach, whenever memory size allows, it is a safe approach to provide guard areas filled with fixed patterns at the borders of sections containing variables. Such checks permits activation of recovering actions as soon as possible in case of memory corruption, for example diverting the processing to other resources in case redundancy is implemented in the system, well before the problem escalates to a system failure.

It is worth normally partitioning the constant sections in several wide chunks and performing background checksum calculations on such macro areas. Following a similar approach on the volatile areas, it is often easy and fast to inspect them to confirm the presence of the expected fixed patterns, for example at the boundaries of allocated areas at memory release stage. Such an approach can be also quite useful at debug stage, once a memory corruption is detected.

In fact, in case one of the constant regions systematically fails, the affected memory range can be further split into sub-sections and checksums performed on them. This configurable approach to data integrity control is effective in narrowing down the area impacted by memory corruption, down to a range for which memory-polling strategies can be effectively used to highlight the code section causing the corruption. Figure 3 illustrates such scalable approach, a first example of convergence between monitoring and debug features.

Figure 3: Scalable constant memory check

Stack supervision
Exceeding stack boundaries is usually detected by HW-supported mechanisms that, depending on the core architecture, may result in a reset applied to the processing unit or an interrupt being flagged. In case reset is supported, handling the exception is part of the post-mortem fault strategies. In case of an interrupt being triggered, the only actions that can be taken are the ones that can be implemented within the interrupt routine, and generally, it takes more that one subroutine to send error notification externally.

We know we are already out of stack so if we still allow our system to use (if any) stack adjacent memory for the function call(s), we could overwrite and corrupt some memory area. If we are lucky, our stack error notification will reach its destination, but if we are not lucky, the error will not be notified or it will trigger some unpredictable execution flow and will mislead the root cause analysis. A simple but effective approach is to leave a dummy area after the stack bottom, whose size should be the one strictly needed to handle the stack error, allowing the notification to be sent out to guarantee proper escalation.

In addition, handling of the stack area can benefit from filling its boundaries with a fixed pattern, as described in the previous section, when referring to memory with variable content. The neighboring of the stack can be inspected in this way, and the amount of data that exceeded the limit causing the exception can be reported as part of the error message itself, or can be examined after reset as part of the post-mortem debug process. Moreover, in case there should be no HW support available for stack supervision, background checks on such a guard area can perform this critical task.

Memory Diagnostics
Detecting memory corruption at an early stage is useful to allow proper recovery action as soon as possible. As we’ve seen such checks can be performed by the code either periodically on large areas or in a more punctual way directly when handling memory, e.g. when reserving/releasing dynamic data sections.

However, knowing that a memory area contains an unexpected value it is useless on its own, when it comes to debugging: for a detective being aware that a murder has been committed is not as interesting as knowing the murderer’s identity. In addition to the features presented so far, a further technique can be used to integrate/substitute what has been described to monitor system resources.

Whenever a HW timer or any kind of regular timing source is available, with frequency comparable to the machine clock and linked to an interrupt line, the set up of a program counter polling routine can be an effective way to trace program execution stages, concurrently with the execution of consistency checks. Especially in the case when the impacted area has been isolated, setting up a strict polling routine can make the difference in terms of fault localization, both in terms of promptness of reply and in terms of pointing to the code area to investigate.

Figure 4 depicts the principle of continuously inspecting the affected area by means of regular code interruptions, the different tasks being represented by the horizontal colored bars.

Figure 4: Periodical memory polling

The polling routine will have the double role of checking data integrity and recording the sections most recently interrupted. Handling a circular buffer to store the program counter values will result in a valid clue at system debugging, once the corruption has been detected. Figure 5 shows a flow diagram for the interrupt routine. This function has to be as fast as possible so the context saving should be minimized to the registers that are really used/affected by the routine. According to the compare type that has been configured by the diagnostic request (could be “equal to” “less then” “greater then” and so on), the actual content of a memory location is verified versus an expected one (or an expected threshold).

Figure 5: Memory diagnostic flow diagram

Fault Reporting
Whatever the implementation strategyfollowed, collecting information to investigate the cause of a faultconsists mainly of recording the status of the system at erroroccurrence and the external stimuli that preceded the erroneousbehavior. Since this process should be selectively activated/configured,some sort of communication channel with the system to test has to bepresent to allow direct access triggering the debug features. Of course,such a communication channel usually will be the same used to collectthe tracings.

Nowadays most of the systems, even the most simple,include at least one direct link to the application and it is usuallyfeasible to use such a link without need to provide an extra accesspoint. In the hypothetical architecture reported in Figure 2 , we will assume that the microcontroller will handle such a link to and from the DSPs.

Recordingeither internal state or inputs using the product itself will require,as already highlighted, system resources stolen from the pool of thoseavailable to perform the main tasks. The design of such tracing featuresin systems with strict performance constraints could be then nearly asdelicate as developing the product itself.

In a typicalcommercial embedded application, hardware resources could be quiteoverused because of the obvious aim to maximize the system density.That’s why inserting tracing features often leads to considerable designeffort to minimize the resources dedicated, both in terms of memoryconsumption and performance profiling. It is important then tounderstand which information to consider relevant (and to what extent)and to produce highly optimized code, possibly both in terms of programsize and in terms of execution time.

Performance and memoryconsumption should be configurable and scalable. There is no point inhaving background tasks always running when their impact on system loadis noticeable and no fault has been reported. In addition, even whenconfigured as active, such tracings should be scalable to allow astep-by-step approach, depending on the investigation needs andaccording to the principle that each fault has its own peculiarcharacteristics.

Internal System State
There are mainlytwo kinds of data sets useful to fault investigation, when it comes toreporting system internal state. First focus on the status of theapplication at fault occurrence, possibly tracing the history in themoments that preceded the malfunction. Second, inspecting memory,looking for data consistency, current values or even simply to collecttracing information accumulated locally.

Producing a snapshot ofthe state of an embedded system is then a matter of characterizing theprogram section currently executing and having available some means toaccess memory areas, dumping them if the system has been configured to.

Dependingon the hardware and software architecture this could be a complicatedtask, so we will abstract here from specific system considerations anddeal with general guidelines, applicable with extensions to any specificconfiguration, but assuming that it is possible to access the devicecore layer (like memories, HW resources, operating system).

State machines  Whatever the task your system was designed to perform, most probably itwill be executed implementing at least one state machine. Collectinginformation about the evolution of a state machine is a matter ofrecording the sequence of the states it has passed through and theevents that caused the transitions. Depending on the processing poweryou have available and on the bandwidth of your communication channel,tracing state machines can be either performed on transition basis,reporting them one by one, or circularly tracing the most recent into aninternal log. The first approach is useful also during design phases,to monitor whether your system evolution matches the expected behavior,while testing it against the requirements. The second is usually moredebug-oriented, typically stealing fewer resources, since the data iscollected internally and read out only when needed. The evident drawbackin the latter case is that the history of the transition is limited bythe circular buffer size; in systems with fast transitions, this couldbe limiting.

Program execution   In terms ofcharacterizing current program execution, dumping the stack it is acommonly used technique, both in case the exception is capable oftriggering an interrupt and in case the reporting of the fault isinstead software driven, so that its localization is already known aterror detection. In fact the stack dump always produces a snapshot ofthe function calls chain and some local parameter values, synchronouswith fault occurrence, as is shown in Figure 6 , where it is assumed that all parameters are passed on the stack.

Thedepth of the stack to dump is a variable whose value depends on thecode section in which the malfunctioning was detected. That’s why thisparameter should be configurable to accommodate the needs of thespecific section under investigation. But even providing a configurablesize for the log, dumping the stack may not be a not valid tool to tracethe program flow. Consider for example systems with pre-emptiveoperating systems, for which the change of context is not evendeterministic: fixing a length for such trace could result in incompletedata due to scheduling issues. To obviate such a limit, when there isdirect access to the operating system circular buffers can beimplemented in the scheduling core to maintain a report tracing of thecontext change applied, at least to some extent.

Figure 6:  Stack dumping info

As depicted in Figure 7 ,having access to the scheduler automatically provides a means to recordsystem evolution in terms of execution flow. Such information is highlydesirable when the fault is triggered by well defined sequences in theexecution chain.

Figure 7: Scheduling tracing

Havingaccess to the scheduling scheme allows you to register system resourcesavailability at context switching, highlighting bottleneck situations.Stack snapshot and program flow traces provide good debug hints in casethe sequence of the sections running determined the error conditions. Asalready illustrated in the memory diagnostic paragraph, in case asource of periodic interruption is available, the related interruptservice routine can be used to keep trace of the running code section,while performing in parallel configurable checks on system resources.Configuring the polling frequency permits either tracing in details thesection running when the error occurred or recording the task switchingbefore its execution.

As illustrated in Figure 5 , thesystem’s evolution is traced in the Program Counter buffer; it could beavailable for inspection once the fault has occurred. The drawback ofsuch an approach is the higher consumption of system resources: considerfor example Figure 8 that depicts the increase in the load in areal system when a timer interrupt is dedicated to such PC pollingfeature. The measures show the negative effects of incrementing thefrequency of the polling, as well as the different contributionaccording to the load status of the system while the polling wasactivated.

Figure 8:  Performance impacts of PC polling

Thecurves shape depends also on the priority assigned to the interruptionand on the amount of data the polling routine should collect. On theother hand, such a direct approach to program flow tracing is generallymore compact in terms of size of dedicated code compared to the case ofinserting traces directly at scheduler core level.

Figure 8 shows also how important is to characterize the MIPS consumptionprofile of the system. If the application works already under heavystress conditions, the polling interval cannot go beyond a certain valuewithout endangering its correct behavior, making this feature verypowerful, but at the same time tricky.

Memory Incase the code section affected by the fault is delimited, theinvestigation area can be narrowed and focus can be moved to the memoryareas used by it. But even more generally, when the fault pops up everytime in different sections, implementing memory logs that aresynchronous with the error is a highly useful feature.

Providingconfigurable dumps at fault occurrence is a simple and powerful methodto check actual data value and/or consistency. The pointers to thememory areas to dump can be directly configured while enabling thefeature, in case the sections to be inspected are statically allocated.Otherwise, the system can be programmed to extract addresses from theinput parameters of the function reporting the fault, in case this isalways the same. This can be done retrieving memory pointers fromregisters or stack; the dumping procedure will be slightly morecomplicated, but generally still easy to implement.

Theinspection of the memory logs can show whether the problem has arisenbecause of unexpected stimuli or if it has to be ascribed to a bug inthe data processing chain. If the data collected is inconsistent, thepolling method described in the diagnostic section can be set up tohighlight the area of the code spoiling the incoherent locations. Thecomplexity of such a check is reduced in case memory areas in the systemare statically allocated but could increase quite a lot, or even beimpossible, if the area is dynamically allocated.

In case thedata collected contains values within the allowed range, local testingof the faulty section can be performed off-line, replicating the stimulitraced in the recordings. This could be a good clue in case of complexdata processing functions, acting on chunk of information whose possiblecombinations are so huge that a complete set of coverage test resultscan be too wide to be fully implemented.

External inputs
Thetechniques mentioned above can be used when previous attempts toisolate/reproduce the fault failed and suspects converge to an internalmisbehavior. But the first approach to bug fixing is usually to collectinformation on how the system was stimulated before the fault occurred -that is, how the external world was “perceived” by the application.

Asalready stated, embedded systems by their nature often have lowresources available in terms of tracing recent inputs, but on the otherhand, their input ports are not always easily accessible externally, sosome sort of internal recording is often needed.

The evidence of adevice generating an unexpected output doesn’t exclude the possibilitythat the device itself did not receive a wrong stimulus. While testing asystem, the focus in the test patterns is generally on input sequencesthat will be most probably received on the field. Usually negative testsare considered a best effort activity, since possible disturbances tothe input are infinite. Also for this reason it can be then crucial tobe able to collect the data received by the system to characterize thefault.

The possibility of recording the external input within theapplication will be useful in case it is not possible to trace theinput stream using other tools, but also helpful in case the datacollected is not synchronous with the fault. This typically occurs whenthe error itself doesn’t result in an evident alteration of the output,so that pointing out the exact stimuli causing the fault is not an easytask.

Once again implementing input tracing features is a matterof trading-off between the resources dedicated to the task and thelevel of details the tracing should reveal. A careful analysis has to beperformed at design stage to highlight those input parametersconsidered critical. Such parameters have to be effectively logged intolimited circular buffers and reported at the debug stage.

Usuallythe extraction of such parameters will not be configurable due to thedifficulty of allowing at the same time selection at run-time, whilestill meeting low memory consumption requirements. It will be typicallyat design stage that both the parameters to extract and their mappinginto dedicated packed data structure will be defined.

Environment probing
Anotheradvantage of developing systems equipped with built-in features tocollect recent stimuli is that the same input tracing features can beused also to implement low cost field environment probing notnecessarily related to the manifestation of a fault. The usefulness ofsuch probing is especially evident for systems in which the quality ofthe input is not guaranteed, but a deteriorated solicitation affects theoutput in a way that is perceived externally as a malfunctioning of thedevice processing the data.

Collecting internal reports on thequality of the incoming data is generally achievable by small add-ons towhat already is implemented to record the external input. Buffers with asimpler structure can be used to trace the parameters selected. Nooptimized packaging scheme is needed since the occurrence of themonitored events will be generally not so frequent to saturate the areadedicated to the logs. In this way simple features such as plaincounters can be provided, or even more complex statistical recordings,e.g. traces to correlate in time occurrences of multiple events ofdifferent nature.

As an example, consider the case of a customersuddenly complaining about the Quality of Service of your speechprocessing product at his premises. Enabling a log on your input andreporting that your application is detecting a high percentage of theincoming frames as affected by CRC errors can speed up investigation.Measures can be immediately diverted towards the part of your systemproviding frames to the processing devices, or even to the traffic linkthat feed the complete system. Once again, this kind of report helps tospeed up integration at the product development phase, typicallyhighlighting at an early stage possible interworking issues with therest of the product architecture.

Post Mortem debug
Faulttolerance in an embedded system sometimes makes it difficult toretrieve useful data for troubleshooting. Suppose that when one of theDSPs notifies a fault, the microcontroller attempts a recovery byresetting the DSP. In this case, besides the information associated withthe error indication, there are no means to access DSP internalstructures, since the reboot operation has cleared them up.

Itis not recommended to exclude completely this kind of fault recoveryaction to avoid the propagation of the fault to other DSPs. On the otherhand, we need to access some extra information before it vanishes. Whatwe would like to have is a trick to occasionally disable the effect of arestart: a common way to do it is to implement a conditional branch atthe DSP start up code, as is shown in Figure 9. The condition on thisbranch will be the value of a certain common memory location RecoveryFilter_flag , whose default will make our DSP complete the boot and go into operating mode.

Ifthe microcontroller instead has modified the flag via an O&Mcommand, then the DSP will not complete the normal startup and it willenter instead into an idle mode for debug purposes. This debug mode willallow access to the DSP internal structures in the local memory, so aminimum HW/SW configuration to grant this access must be started.

Figure 9: Recovery filter implementation

Oncethe DSP affected by the fault has been “parked” in debug mode, all theneeded investigation steps can be performed in parallel while the restof the system is still handling the main task. When the needed data hasbeen retrieved, the faulty device can be once again inserted in the pooland returned to normal operation.

Cost considerations The impact on a project budget is of course the first aspect taken inconsideration when it comes to implement features not strictly requestedby the end user. Volume and deployment forecasts for the producttrigger the decision to invest some design effort to guarantee propersupport. Product life cycle also has a role since future developments onthe platform can benefit from the chosen features at design phases tocome. Table 1 shows an excerpt from a real project cost sheet. Costs arefrom a large project, several months span, implementing a completelynew data protocol on a legacy platform with known load limitations.Microcontroller costs include design of tools to decode logs, actualfeatures were mostly in DSP.

Table 1: Design&Test actual figures in man-hours

Case study – Transcoder Application
TheGSM transcoder is located between the core network and the radio basestation (BTS). It performs coding of PCM speech samples coming from thenetwork into speech parameters that, packed into frames, will be sent tothe BTS in the downlink direction. The BTS will forward the frames to amobile terminal over the air interface after which will occurextraction and decoding of speech parameters coming from the BTS into aPCM stream in the uplink direction. The Adaptive Multi-Rate (AMR) codecis one of the algorithms used for the coding and decoding operations.This algorithm adapts the bit rate (Codec Mode) according to the radiochannel condition.

In the application under analysis, speechcoding and decoding algorithms are implemented in a cluster of DSPs andare executed on several systems-on-chips (SoCs). A board processorcontrols the SoCs, and it collects alarms and debugging information fromthe DSPs and sends them to an O&M interface.

From thisinterface, it is possible to save a log file, which can bepost-processed by a parsing SW program. Each DSP can perform coding anddecoding for N full duplex channels (uplink and downlink directions). Inorder to boost performance, critical parts of the code are implementedin assembly language; this is a critical and error-prone development. Inthis scenario, application errors could be of two kinds:

  • errors affecting only the channel being served
  • errors affecting the whole processor and thus impacting all the channels handled by that unit

A typical example of the second kind is memory corruption (bad stack handling, wrong pointer assignment).

Let’ssuppose that our DSP code is affected by a flaw which happens tocorrupt the data memory after several days of operation. Themicrocontroller is able to perform memory supervision on the sharedmemory and reports a memory corruption causing a severe fault, whichresets all DSPs. At post-mortem debugging, a memory dump on all DSPmemories reveals that the corruption also affects the address 0x0000 ofsome local memories.

Null pointers usually cause this kind ofcorruption but it is quite difficult to find the bug in a live complexassembly code, like a speech codec algorithm, without any clue. Luckily,the DSP application implements a fault-reporting feature that outputsincoming frame information (control data carried by input frames) and amemory diagnostic feature.

As shown in the code below, the firstprobable corruption is in 0x0000, where we can use the memorydiagnostic on that address so that debug information can be caught assoon as the corruption in 0x0000 is detected. When the diagnostictriggers, the DSP reports the content of a buffer with the last frame'scontrol information, which suggest that the fault happened when the DSPwas decoding a 5,90KBit/s speech frame incoming from the BTS.

Code Sample 1

PAM_ERROR_LOG: Cic = 1 Pam = 1 LogType = UL channel info(2)
PamLog[0-40] =
0x0004 0x5c10 0xc200 0x5c50 0xc200 0x5c10 0xc200 0x5c50 0xc200 0x5c10 0xc200 0x5c50 0xc200
0x5c10 0xc200 0x5c50 0xc200 0x5c10 0xc200 0x5c50 0xc200 0x5c10 0xc200 0x5c50 0xc200 0x5c10 0xc200
0x5c50 0xc200 0x5c10 0xc200 0x5c50 0xc200 0x5c10 0xc200 0x5c50 0xc200 0x5c10 0xc200 0x5c50 0xc200
UL Channel Info Decoding : UL Frame Info Adaptive Speech:
Queue[ 0]: (Speech) Queue[ -1]: (Speech) Queue[ -2]: (Speech)
0x5c50 DEBUG 0x5c10 DEBUG 0x5c50 DEBUG
TimeAlign = No T.A. TimeAlign = No T.A. TimeAlign = No T.A.
CMR/CMI Rec = 5,90 KBit/s CMR/CMI Rec = 5,90 KBit/s CMR/CMI Rec = 5,90 KBit/s
RIF = CMR RIF = CMI RIF = CMR
TimeAlignCmd = 56 TimeAlignCmd = 56 TimeAlignCmd = 56
FrType = Speech FrType = Speech FrType = Speech
0xc200 DEBUG 0xc200 DEBUG 0xc200 DEBUG
BuffAdj = 0 BuffAdj = 0 BuffAdj = 0
SpFlag = 1 SpFlag = 1 SpFlag = 1
SPARE = 0 SPARE = 0 SPARE = 0
BFI = 0 BFI = 0 BFI = 0
AntiSyncErr = NO AntiSyncErr = NO AntiSyncErr = NO
CRC_Err = NO CRC_Err = NO CRC_Err = NO
FrClass = Speech Good FrClass = Speech Good FrClass = Speech Good
… … …

Programcounter (PC) polling history is reported together with internal systemstates and information about the channel ID and the processes insertedin the scheduler queues as is shown below:

Code Sample 2

PAM_ERROR_LOG: Cic = 1 Pam = 4 LogType = DSP Info(0)
PamLog[0-41] =
0x0000 0x0004 0x0004 0x000e 0x0001 0x8ae2 0x0001 0x8b15 0x0001 0x8aee 0x0000 0xdeab 0x0001
0x8b06 0x0001 0x8b2c 0x0000 0xb528 0x0000 0xe0ae 0x0001 0x8b2c 0x0001 0x8af9 0x5a6b 0x6d3a 0x5a2c
0x1263 0x5510 0x0016 0x000d 0x0000 0x0000 0x0000 0x0000 0x0003 0x0000 0x0000 0x0000 0x0000 0x0000
0x0000
DSP Info Decoding
0x0000 DEBUG 0x5a6b DEBUG 0x5510 DEBUG
DiagnosticsMemoryAddress = 0x0000 DLTickCounterCh0 = 107 CurrPhInfo( 0)Dir = UL
0x0004 DEBUG ULTickCounterCh0 = 90 CurrPhInfo( 0) Ch = 0
DiagnosticsMemoryExpectedValue = 0x0004 0x6d3a DEBUG CurrPhInfo( 0)MCC = OFF
0x0004 DEBUG DLTickCounterCh1 = 58 PrevPhInfo(-1)Dir = DL
DiagnosticsMemoryActualValue = 0x0004 ULTickCounterCh1 = 109 PrevPhInfo(-1) Ch = 0
0x5a2c DEBUG PrevPhInfo(-1)MCC = OFF
Queue[ 0]: DiagnosticsPCBuffer = 0x0000b528 DLTickCounterCh2 = 44 PrevPhInfo(-2)Dir = DL
Queue[ -1]: DiagnosticsPCBuffer = 0x00018b2c ULTickCounterCh2 = 90 PrevPhInfo(-2) Ch = 2
Queue[ -2]: DiagnosticsPCBuffer = 0x00018b06 0x1263 DEBUG PrevPhInfo(-2)MCC = OFF
Queue[ -3]: DiagnosticsPCBuffer = 0x0000deab CurrPhNumb( 0) = 3 PrevPhInfo(-3)Dir = DL
Queue[ -4]: DiagnosticsPCBuffer = 0x00018aee PrevPhNumb(-1) = 6 PrevPhInfo(-3) Ch = 2
Queue[ -5]: DiagnosticsPCBuffer = 0x00018b15 PrevPhNumb(-2) = 2 PrevPhInfo(-3)MCC = OFF
Queue[ -6]: DiagnosticsPCBuffer = 0x00018ae2 PrevPhNumb(-3) = 1
Queue[ -7]: DiagnosticsPCBuffer = 0x00018af9 0x0016 DEBUG
Queue[ -8]: DiagnosticsPCBuffer = 0x00018b2c NumberElementsInProcessQueuePrio1 = 22
Queue[ -9]: DiagnosticsPCBuffer = 0x0000e0ae 0x000d DEBUG
NumberElementsInProcessQueuePrio2 = 13
0x0000 DEBUG
NumberElementsInProcessQueuePrio3 = 0
0x0000 DEBUG
NumberElementsInProcessQueuePrio4 = 0

Frominformation it is possible to drastically narrow the investigation: thefault happened during UL processing, when incoming frame has a codecmode of 5,90 KBit/s, when the channel 0 has being processed, and duringthe 3rd phase of the decoding process. Still the lines of code belongingto this class are too many to find the fault quickly with a codeinspection, but the program counter polling comes up to be decisive.When the PC was 0xb528 (PC[0]) , the memory corruption has already happened, while when the PC was 0x18b2c (PC[-1]) the memory was OK.

The two PCs belong to two different functions and the program flow from PC[-1] to PC[0] is still too large. Repeating the test, increasing the polling periodis possible but risky because of the MIPS load it implies. Instead, asthe fault also happened on other DSPs, all data can be used, finding theminimal program flow distance between all found PC[-1] and all found PC[0] . In other words, the bug is in a function in the range (max(PC[-1]), min(PC[0])) that is the intersection between the program segments suspected by each DSP.

Conclusions
Faultdetection and correction times are among the key factors to build andmaintain a good long- term relationship with customers. Providing yourapplications with unintrusive debug facilities is one effective way toshorten the fault analysis time in the entire product’s life cycle, butespecially at the maintenance phase, when the use of testinstrumentation is not always possible. Furthermore, in embeddedsystems, all the relevant data is often unreachable with simplemeasurements and granting accessibility to the core of the system isessential to perform the investigation.

System self-awareness anddetailed fault reporting are the two main strategies to increase thechance to localize and identify the fault at its first occurrence. Anearly characterization of the misbehavior will exclude the need todeliver test trap SW versions and wait for the fault to pop up again,hoping to retrieve what is missing in your measurements.

As theadvantages are evident, the trade-off of having such a debug-orienteddesign clearly consists in reserving extra effort at development and thechance to have tools available that it is not guaranteed will be everused, and for sure not massively. The matter is a technical vs.marketing discussion, but designing and selling a product with aneffectively high maintainability will draw rewarding benefits in termsof customer relations.

Lorenzo Fasanellii is a senior embedded-software specialist for Ericsson Labs in Italy; Lorenzo Lupini is Embedded SW Specialist and Massimo Quagliani is Embedded SW Senior Specialist at Marconi S.p.A.

This article was a part of a class taught by the authors at the Embedded Systems Conference (ESC-360)

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.