Capturing and Debugging System Crashes - Embedded.com

Capturing and Debugging System Crashes

System crashes can be difficult to debug, especially if days elapsebetween failures. Although the root cause of system crashes tend to beunique, the concepts, techniques, and logic analyzer tools can beapplied broadly for troubleshooting system crashes in at least threecommon scenarios embedded system developers often face:

Scenario # 1: If the system clock dies when the systemcrashes, modern logic analyzers provide 'trigger on user stop' triggermacros and 'stop' buttons to capture events that lock up a system.These convenient trigger and stop commands allow state mode capture ofdata up to a month after the system crash occurs.

Scenario #2: If the system clock keeps running afterthe system crashes, logic analyzer triggers need to be more creative tocapture the events leading up to the system crash. Developing time outtriggers to capture signals that stop behaving within normal parametersallows the user to view events immediately preceding the crash.

Scenario #3: The need to simplify logic analyzermeasurements and take advantage of the enhanced features and toolsavailable on modern logic analyzers. This section explores differentlogic analyzer features and tools, and describes how each can be usedto debug system crashes.

Regardless of the technique or set of tools used to debug a systemcrash, recreating the failure conditions is important in determiningthe root cause of the system crash.

Log files, knowing what software applications or tests are running,BIOS settings, temperature, and powerare all things to consider whenisolating system failures. As much information should be gathered aboutthe normal system operation and the failure symptoms as the engineer(or user) prepares to debug the system crash.

Scenario #1: If the system clock dies when the systemcrashes, logic analyzer triggers vary significantly depending if it isin state or timing mode.

In state mode, the logic analyzer samples using the clock signalfrom the device under test. This type of clocking makes the sampling ofdata into the logic analyzer synchronous to the clocked events on thedevice under test. When the clock from the system under test stops, thelogic analyzer stops collecting data.

Many logic analyzers will give a warning “slow or missing clock”when the clock is not present. Some logic analyzers require acontinuous clock in state mode, while other logic analyzers cantolerate bursty clocks in state mode (bursty clocks go dormant forperiods of time, perhaps while a system is idle ). Individual logicanalyzer module data sheets provide details on clock requirements.

Note: If the error “clock edges too close together” isshown that means the logic analyzer saw clock edges closer togetherthan the instrument can reliably capture. In this case, the systemclock (or logic analyzer connection to the clock) is suspect and mustbe corrected. Bad ground connections can cause this issue.

As shown in Figure 1 below , the “Run until user stop”trigger macro or equivalent macro, is available on most modern logicanalyzers. This trigger sets up the logic analyzer to never trigger,and the stop button must be pushed to view captured data.

Figure1: “run until user stop” trigger macro from a 16950B Logic Analyzer.Each macro includes a fly-out diagram that describes how the triggermacro works.

In state mode, 'run until user stop' is an effective trigger forcapturing a system crash where the system clock dies as the logicanalyzer will stop sampling data when the system clock fails. When the”stop” button is hit, the logic analyzer produces two extra clock edges(if needed) so that it can finish processing the data.

Modern logic analyzers have counters that allow up to 32 daysbetween the crash and hitting the stop button. Unfortunately many olderlogic analyzer modules had smaller counters that would wrap in secondsmaking it virtually impossible to capture data from a system crash byusing the stop button. (When the logic analyzer counter wraps, theinstrument can no longer determine when events happened relative to oneanother. )

Set the trigger position to 0% post store to ensure the logicanalyzer captures as much data as possible and utilizes the full memorydepth of the analyzer to show events leading up to the crash.

In timing mode, the logic analyzer provides the clock for samplingdata and samples events in the system under test asynchronous to theclock on the system under test.

A guideline for timing measurements is to have a sampling rate atleast four times the clock frequency to capture meaningful data. Highersample rates provide the best resolution. (Remember to considerchannel-to-channel accuracy and resolution when making measurementsacross channels.)

When the logic analyzer is in timing mode, the concept of pushing a'stop' button is not appropriate to capture system crashes. This isbecause the logic analyzer is usually sampling at a much faster ratethan human reflexes.

Even if the user noticed a system crash quickly, by the time thestop button could be pushed, the analyzer memory would be filled withflat line post crash data containing no information on the eventsleading up to the system crash.

Transitional timing mode is a special mode available on select logicanalyzers. In this mode, the logic analyzer will sample on its internalclock only when the signals specify transition. Trigger on stop is aviable trigger in transitional timing mode.

Figure2: Edges too far apart. This is an example from a 16950B Logic Analyzerin timing mode. Notice that the macro includes a diagram anddescription of the trigger, which makes it simple to fill in the blankswith signal names.

One trigger macro available in timing mode that can be used forcapturing traces when the system clock crashes is “edges too farapart.” As shown in Figure 2 above the logic analyzer willtrigger if a rising edge of the system clock isn't seen within 4ns ofthe previous rising clock edge.

This trigger is appropriate for frequencies faster than about 270MHzand for slower system clock frequencies, a time period higher than 4nsshould be selected.

Rough estimates are fine since the goal is to get the logic analyzerto notice the system clock stopped so that it will trigger before thememory is filled with post crash data of little interest.

For example, even though 4ns is the period of a 250MHz clock, theuser needs to allow for the sampling resolution of the logic analyzer,making the timing mode measurement.

If the example 4ns timer trigger is used to capture a crash on asystem with a 250MHz clock, the logic analyzer would false triggeranytime its sampling resolution suggested a violation.

Scenario #2: When the system clock keeps running after the systemcrashes
In this scenario there is a need to be more creative and set up morecomplex triggers to capture the events leading up to the crash. Waitinguntil someone hits the 'stop' button after the system crashes is nolonger a viable solution even for state mode, and the system clock canno longer be used as a timing mode trigger to capture the crash.

Frequently, the most difficult part of setting up a complex triggeris breaking down the problem. In general, when setting up a complexlogic analyzer trigger do the following:

Step #1. Break down the problem into events that don'thappen simultaneously. These correspond to the sequence levels in thelogic analyzer trigger.

Step #2. . Scan the list of trigger functions to try tofind some that match each event identified in the first step. Settingup logic analyzer triggers can be greatly simplified by usingpre-defined trigger functions (macros).

Step #3. . If an event doesn't correspond to apredefined trigger macro, break it down into Boolean expressions andcorresponding actions to each expression. Each Boolean expression oraction pair corresponds to a separate branch within a sequence level.

Boolean expressions result in true or false, for example: PatternA =FFF is either true or false depending on the value of PatternA when itis comparedeye to FFF.

One approach to triggering on system crashes when the clockcontinues to run is to figure out a signal that happens regularly orwithin a specified time, and set up a trigger that times out when theevent doesn't happen within a time limit.

This could be a signal from a bus specification, an interrupt, asignal that can be injected using a bus exerciser, or any signal orpattern expected on a system running a specific test.

Example: Let's consider an embedded system with DDR2 memory. Refreshis a command made up of three individual signals and is valid only whena chip select line is low (CS0#).

A refresh must occur at regular intervals to keep the data in memoryvalid. The static refresh for DDR2 (and previous DRAM technologies) is64ms. This means that every cell must be refreshed at least every 64ms,or data corruption may occur.

Therefore, a trigger watching for intervals greater than 64msbetween valid refresh commands is valid to capture system crashes onsystems with DDR2 memory. See Figure 3 below.

Figure3 An example trigger from a 16950B Logic Analyzer for capturing a DDR2system crash by watching for too much time between refresh commands.

(Clickhere to view larger image. )

The use of a symbol table for the 'command' label makes selection ofauto refresh/self refresh simple. This concept/algorithm can work forstate or timing modes. Comments help to follow the trigger flow andmake it easier to modify triggers for similar situations.

If the time to capture the system failure is tightened, for theDDR2, the average amount of time between each auto refresh command is7.81 microseconds.

Taking a trace of a healthy system and looking for the normal rateof refresh cycles in a system helps determine a shorter time to use asa time out indicating a failure for a specific system.

If the logic analyzer is sampling at regular intervals, a countercan be used in place of a timer, which can be simulated by counting thenumber of samples acquired. For example, if the logic analyzer acquiresa new sample every 2.5 ns (200MHz clock sampling on both rising andfalling edges) then 25,600 samples represents 64ms.

In this case, a logic analyzer may already have a state mode triggermacro ready for the user to fill in the number of states and signalnames. See example in Figure 4 below. Similarly, in timing modethe trigger macro 'pattern absent for >t time' can be selected.

Figure4: Too many states between refresh commands, and the the qualificationof CS0# = low was added to indicate a valid command for this example ona 16950B Logic Analyzer module.

(Clickhere to view larger image. )

Another option is to inject a known pattern or signal that can beeasily recognized using a software routine, pattern generator, or busexerciser. (Many logic analysis systems have built-in patterngenerators. In systems with a PCI or PCI express exerciser, commandscan be injected to create known data patterns on the PCI/PCI expressbus, processor bus, or memory bus. )

Scenario #3: Simplify logic analyzer measurements with enhancedfeatures and tools
Different logic analyzer features and tools offer powerful insight fordebug system crashes: Deep memory, Store qualification, FPGA dynamicprobes, Compare, Application specific probes, Eye diagrams on logicanalyzers, and Integrated logic analyzer and scope traces.

Deep Memory – Capturing a trace leading up to a systemfailure is critical to determine the root cause of a system failure.Handshaking, split transactions, pipelining, out-of-order execution,and deep first-in-first out data storage (FIFO) all mean the flow ofdata related to a problem can be distributed over thousands, if notmillions of bus cycles.

How often a problem occurs can also vary from once every bus cycleto once in several weeks. Since the actual cause of the crash canhappen prior to the actual crash, it is helpful to use the deepestmemory a logic analyzer module has to offer when setting up ameasurement. Many logic analyzer modules allow for after-market memoryupgrade licenses.

Store Qualification. . This capabiliy allows theengineer to use the available acquisition memory more efficiently,rather than filling it with unwanted activity such as idle cycles orwait loops. Store qualification determines if an acquired sample shouldbe placed in memory or discarded.

The simplest method to set up storage qualification is by setting upthe default storage, which means “unless a sequence step specifiesotherwise, this is what should be stored.”

As an example, a user may want to only store samples if the systemisn't in an idle state. By default, the storage is set to store allsamples acquired. A user can also set the default storage to storenothing, which means that no samples will be stored unless a sequencestep overrides the default storage.

Sequence step storage qualification. This means thatwithin a particular trigger level only certain samples will be stored.This signifies that until a “go to” or “trigger” action is used toleave this sequence step, the storage qualification applies.

This is useful when different storage qualification for eachsequence step is required. For example, in a microprocessor system, auser may want to store nothing until ADDR = 1000 and then only storesamples with ADDR in the range of 1,000 to 2,000 for the rest of themeasurement.

Establishing a sequence step storage requires the use of anadditional branch, and it always overrides the default storage, butonly for the conditions specifically mentioned in the sequence stepstorage. It is important to account for the interaction between defaultstorage and sequence step storage.

FPGA dynamic probes. These are a result ofcollaborative development between a logic analyzer and FPGA companies(features vary by logic analyzer vendor and FPGA model).

Using an FPGA dynamic probe, a user can view internal FPGA signalswithout routing each signal to the periphery of the FPGA. Perhaps theuser's design has an internal FPGA signal that occurs at a regularinterval when the system is operating normally. That signal can beeasily accessed to use in a trigger to capture a system crash.

Moving probe points internal to an FPGA used to be time consuming,but using an FPGA dynamic probe can help measure a different set ofinternal signals within seconds without design changes. It is importantto note that FPGA timing stays constant between sets of internalsignals that are probed using a dynamic probe.

Feature rich FPGA dynamic probes automatically map internal signaland bus names from the FPGA design tool to the logic analyzer,eliminating mistakes and saving hours of set-up time.

Compare captured data to reference data on a logicanalyzer with a compare display. Set-up the measurement to highlightdifferences between a known-good device under test and the last tracecaptured. Or stop repetitive runs and send an email after a specifiednumber of differences are found in a trace.

A comparison tool may provide the first indication of a problemcausing a system crash, when troubleshooting failures on systems areexpected to run exactly the same each time they boot (or at least for atest that can be defined or bracketed by markers).

Application specific probes provide non-intrusiveprobing plus logic analysis setup, triggering, and decoding forspecific applications. They also enable time-correlated analysis andsequenced event triggering across multiple buses making it easy tofollow transactions, data, and packets as data flows through thesystem. Application specific probes that are available include (but,are not limited to):

PCI Express® (PCIe)
Advanced Switching Interface (ASI)
Serial ATA (SATA) and Serial Attached SCSI (SAS)
Serial RapidIO
Parallel RapidIO
SPI 4.2 (System Packet Interface, POS PHY L4)
InfiniBand
I2C
FlexRay
SPI (Serial Peripheral Interface)

Eye diagrams on logic analyzers can provide insightinto signal integrity issues across an entire bus simultaneously inminutes. The results can be viewed as individual signals or as acomposite of multiple signals or buses. Eye diagrams can be used to:

* Observe skew between signals
* Find and fix inappropriate clock and signal thresholds
* Identify signal integrity issues related to rise-time,fall-time, ordata valid window widths
* Acquire signal integrity insight rapidly under a wide variety ofoperating conditions

Expanding the bus can identify individual signals with problems forfurther parametric analysis.

Integrated logic analyzer and oscilloscope traces inthe logic analyzer waveform display help, validate, correct, logicaland timing relationships between the analog and digital portions of asystem. Automated wizards and minimal connections using standard LANand BNC cables simplify measurement set-up.

An incorrect value on a bus captured on a logic analyzer can triggera scope for investigation of the analog characteristics of the signal.Alternatively, a glitch captured on a scope can trigger the logicanalyzer.

As an example, in Figure 5 below , the logic analyzertriggered on a read cycle with suspect data on a DDR3 system.Integrating an external high performance scope probing the DDR3 systemclock, data strobe, and a data signal allows inspection of multipleanalog characteristics of the suspect data burst.

Figure5: View scope feature is included on the 16800 and 16900 series LogicAnalyzers. In this DDR example, logic signal names highlighted inpurple and scope signal names highlighted in blue.

(Clickhere to view larger image. )

The user can zoom out to accurately measure the time between thevalid read command and the start of the data burst associated with thiscommand. Or zoom in to measure the preamble (time the data strobedrives low) prior to the data burst.

Logic analyzers offer a wide selection of powerful features andtools that enable insightful techniques for troubleshooting elusivesystem crashes. When troubleshooting a system crash, the user shouldfollow these steps:

* Gather information about normal system operation
* Recreate the failure conditions
* Consider the nature of the system crash (for example, did the systemclock stop?)
* Determine which tools are available to troubleshoot the system
* Plan the logic analyzer trigger using the concepts outlined andadjust accordingly.

Conclusion
The concepts, techniques, and logic analyzer tools covered in thisarticle apply broadly for troubleshooting system crashes. Whenconsidering the nature of the system crash, it is critical tounderstand if the clock stops or continues.

If the system clock dies when the system crashes, modern logicanalyzers provide 'trigger on user stop' trigger macros and 'stop'buttons to capture events that lock up a system.

If the system clock keeps running after the system crashes, logicanalyzer triggers need to be more creative to capture the eventsleading up to the system crash. Time out triggers to capture signalsthat stop toggling or change from a known behavior allow the user toview events leading up to the crash.

Jennie Grosslight, Technical Marketing Lead Engineer, has19 years of experience at AgilentTechnologies with logic analysis strategy and solutions. Her areasof expertise include; system engineering, high-speed hardware designand validation, product marketing, application support, and projectmanagement. She earned her B.S.E.E. from the University of Colorado atColorado Springs in 1989. In addition to spending time with herdaughter, Jennie enjoys yoga, hiking, and water sports.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.