Capturing and Debugging System Crashes
System crashes can be difficult to debug, especially if days elapse between failures. Although the root cause of system crashes tend to be unique, the concepts, techniques, and logic analyzer tools can be applied broadly for troubleshooting system crashes in at least three common scenarios embedded system developers often face:
Scenario # 1: If the system clock dies when the system crashes, modern logic analyzers provide 'trigger on user stop' trigger macros and 'stop' buttons to capture events that lock up a system. These convenient trigger and stop commands allow state mode capture of data up to a month after the system crash occurs.
Scenario #2: If the system clock keeps running after the system crashes, logic analyzer triggers need to be more creative to capture the events leading up to the system crash. Developing time out triggers to capture signals that stop behaving within normal parameters allows the user to view events immediately preceding the crash.
Scenario #3: The need to simplify logic analyzer measurements and take advantage of the enhanced features and tools available on modern logic analyzers. This section explores different logic analyzer features and tools, and describes how each can be used to debug system crashes.
Regardless of the technique or set of tools used to debug a system crash, recreating the failure conditions is important in determining the root cause of the system crash.
Log files, knowing what software applications or tests are running, BIOS settings, temperature, and powerare all things to consider when isolating system failures. As much information should be gathered about the normal system operation and the failure symptoms as the engineer (or user) prepares to debug the system crash.
Scenario #1: If the system clock dies when the system crashes, logic analyzer triggers vary significantly depending if it is in state or timing mode.
In state mode, the logic analyzer samples using the clock signal from the device under test. This type of clocking makes the sampling of data into the logic analyzer synchronous to the clocked events on the device under test. When the clock from the system under test stops, the logic analyzer stops collecting data.
Many logic analyzers will give a warning "slow or missing clock" when the clock is not present. Some logic analyzers require a continuous clock in state mode, while other logic analyzers can tolerate bursty clocks in state mode (bursty clocks go dormant for periods of time, perhaps while a system is idle). Individual logic analyzer module data sheets provide details on clock requirements.
Note: If the error "clock edges too close together" is shown that means the logic analyzer saw clock edges closer together than the instrument can reliably capture. In this case, the system clock (or logic analyzer connection to the clock) is suspect and must be corrected. Bad ground connections can cause this issue.
As shown in Figure 1 below, the "Run until user stop" trigger macro or equivalent macro, is available on most modern logic analyzers. This trigger sets up the logic analyzer to never trigger, and the stop button must be pushed to view captured data.
|Figure 1: "run until user stop" trigger macro from a 16950B Logic Analyzer. Each macro includes a fly-out diagram that describes how the trigger macro works.|
In state mode, 'run until user stop' is an effective trigger for capturing a system crash where the system clock dies as the logic analyzer will stop sampling data when the system clock fails. When the "stop" button is hit, the logic analyzer produces two extra clock edges (if needed) so that it can finish processing the data.
Modern logic analyzers have counters that allow up to 32 days between the crash and hitting the stop button. Unfortunately many older logic analyzer modules had smaller counters that would wrap in seconds making it virtually impossible to capture data from a system crash by using the stop button. (When the logic analyzer counter wraps, the instrument can no longer determine when events happened relative to one another.)
Set the trigger position to 0% post store to ensure the logic analyzer captures as much data as possible and utilizes the full memory depth of the analyzer to show events leading up to the crash.
In timing mode, the logic analyzer provides the clock for sampling data and samples events in the system under test asynchronous to the clock on the system under test.
A guideline for timing measurements is to have a sampling rate at least four times the clock frequency to capture meaningful data. Higher sample rates provide the best resolution. (Remember to consider channel-to-channel accuracy and resolution when making measurements across channels.)
When the logic analyzer is in timing mode, the concept of pushing a 'stop' button is not appropriate to capture system crashes. This is because the logic analyzer is usually sampling at a much faster rate than human reflexes.
Even if the user noticed a system crash quickly, by the time the stop button could be pushed, the analyzer memory would be filled with flat line post crash data containing no information on the events leading up to the system crash.
Transitional timing mode is a special mode available on select logic analyzers. In this mode, the logic analyzer will sample on its internal clock only when the signals specify transition. Trigger on stop is a viable trigger in transitional timing mode.
|Figure 2: Edges too far apart. This is an example from a 16950B Logic Analyzer in timing mode. Notice that the macro includes a diagram and description of the trigger, which makes it simple to fill in the blanks with signal names.|
One trigger macro available in timing mode that can be used for capturing traces when the system clock crashes is "edges too far apart." As shown in Figure 2 above the logic analyzer will trigger if a rising edge of the system clock isn't seen within 4ns of the previous rising clock edge.
This trigger is appropriate for frequencies faster than about 270MHz and for slower system clock frequencies, a time period higher than 4ns should be selected.
Rough estimates are fine since the goal is to get the logic analyzer to notice the system clock stopped so that it will trigger before the memory is filled with post crash data of little interest.
For example, even though 4ns is the period of a 250MHz clock, the user needs to allow for the sampling resolution of the logic analyzer, making the timing mode measurement.
If the example 4ns timer trigger is used to capture a crash on a system with a 250MHz clock, the logic analyzer would false trigger anytime its sampling resolution suggested a violation.