How we froze our Software! - Embedded.com

How we froze our Software!

The last time we froze our software, things became really cold aroundthe office. We were producing an embedded product with a liquid crystaldiode (LCD) display that was supposed to show the positions of theautomatic transmission – the so-called prindle (PRNDL).

We took our sample part and dropped the temperature to -40o C. (or F.) in a test chamber with the unit powered. Once we stabilizedthe part thermally, we observed that it took approximately six secondsfor the display to change when shifting simulated gears.

This situation violates a U.S. Department of Transportation Federal Motor Vehicle Safety Standard (FMVSS)requirement for prompt display of information.

Yes, we agree that it wasn't really the software that suffered froma chill but, in the world of embedded software development, therequirements and hardware behavior are never very remote.

The term “firmware” captures the essence of embedded software byimplying the hardware and software are mixed indissolubly as anintegrated solution to a product requirement. Considering thesehardware failure impacts on the software in advance is key in securinga robust system in the field.

We have tested automotive embedded software under environmentalconditions as varied as some of the following:

1) Emiting electromagneticradiation, where we can fry the brains of the micro-controller;

2) Vibrating the sample,where oscillator or clock chips may assume erratic behavior,particularly if they contact the product;

3) Raising or lowering thetemperature into regions where components no longer behave as expected;

4) Loading the data bus athigh intensity so buffer memories become saturated and dysfunctional;

5) Introducing electronicnoise with wholly unexpected results;

6) Pulsing the system withvoltage transients or `brown outs,” lower then expected input voltages;

7) Latching the CMOS atvoltage boundaries;

8) Cycling EEPROM or flashmemory with writes that are not protected against error; and

9) Inducing code driftthrough an inadequate watch dog timer (over-relianceon the microprocessor integrated watchdog timer ).

What should have happened with the prindle issue? The followingsteps would have help any developer from the start in similarcircumstances:

1. Secure the customerrequirements specification;
2. Assess whether anyregulatory requirements applied also;
3. Translate the requirementsspecification into a functional analysis system technique (FAST) chartin order to capture the true purpose;
4. Specify the physicalparameters for the component;
5. Derive a Design Failure ModeEffects Analysis (DFMEA) DFMEA from that information;
6. Use the DFMEA to create adesign verification plan and report (DVP&R) to drive the testing;
7. Code the product to therequirements;
8. Test the product to therequirements;
9. Test the product beyond therequirements;
and,
10. Iterate 7-9 until the codeis demonstrably functional.

People forget that embedded software is not only software that weburn onto a chip but rather a completely different design approach. Forexample, memory recovery techniques like garbage collection, which arenot temporally deterministic, can cause embedded software to fail tomeet scheduler needs.

Embedded software will not function correctly with advancedmicro-controllers when the appropriate chip selects are not set duringpower-on startup; chip selects define for both hardware and softwareprecisely what the future behavior of the pins.

Additionally, an over-dependence on the internal circuitry of theintegrated circuit, specifically internal watch dog timers, can mean apoor recovery from a brown-out or low power condition.

Since hardware is linked with the software, it is necessary to dealwith the consequences of hardware failure using the software. Critiqueat the embedded system level, the external interfaces of the system,makes it possible to orchestrate the software responses to thesefailures in a particular or specific way.

A very good tool for making this connection is DFMEA. Correctlyexecuted, the DFMEA allows hardware and software modification to adjustto system need, which allows the developer to manage uncontrolled writecycles, latch-ups and other such design problems that show up as frozensoftware.

The DFMEA should be informed by a FAST approach, which provides fora logical breakdown of performance expectations for any design orprocess.

Software testing complexity exponentiates quickly, but the FAST andDFMEA exercises can add value by defining high-level test requirementswhich the test team can meet later by using techniques like stochastictesting, combinatorial arrays, and extreme condition tests.

The FAST approach and the DFMEA also provide a list of failure modeswhich then feed into the failure criteria for the test cases. Multipletest cases may product the same failure mode—the test team must beaware of these failure modes in order to avoid the logical fallacyknown as affirming the consequent.

Affirming the consequent occurs when we assume only one antecedentevent could cause the subsequent effect; this fallacy is common and canbe deadly to high-quality testing because the team will stop lookingfor other causes.

Another hardware issue that affects embedded software development iswhat we call the “bad unit syndrome” or the “worn-out sample disease.”This situation can occur when the sample unit the developer uses forin-circuit emulation and testing becomes damaged through repeatedconnects/ disconnects; that is, the in-circuit emulator pod wears out,the socket becomes damaged, or the actual connectors erode.

The danger in this scenario occurs when the developer writessignificant quantities of code to compensate for hardware behaviorwhile believing the issue really resides in the software. Failure torefresh the development hardware can lead to wasted time on the orderof weeks, perhaps months.

An even more insidious issue can affect the embedded software. Theissue occurs when we attach multiple units to a data bus or a commonpower source. The hardware laboratory may have already declared theunits to exhibit electromagnetic immunity and to present minimalelectromagnetic emissions.

The factor that is missing in this melange is the concept ofinteractive behavior; that is, the whole may exhibit an emergentbehavior not exhibited by any single component. Unfortunately, immunitytesting has poor reproducibility and it is possible to find a 'sweetspot' in the test chamber that would allow a marginal unit to pass thetest.

If the hardware is crowded together in the final product in such away that near field effects occur, the test team may see failures notpredicted by individual component testing. If the micro-controller issusceptible to electromagnetic interference, the software will mostlikely be affected by the schizoid component.

Another case where the hardware can make the software look defectiveis the famous “intermittent failure.” Intermittent failures are anightmare because it is often difficult to capture the required data atthe exact moment the failure occurs.

Because of the data capture issue, it may be difficult to determinewhether the defect is hardware, software, or both or somethingcompletely unknown at the time. If the problem occurs during field use,the apparent randomness of the appearance may suggest manufacturingissues, especially if the failure rate is low.

Wear out of some critical hardware components can also look like asoftware failure. One prime example is the durability of a given cellon an EEPROM or flash memory. At some number of writes, the cell willbegin to deteriorate or cease to function completely.

The software will look like it is no longer saving the data and thedata itself may appear to be corrupted. The solution in the case of theEEPROM is to provide an algorithm that begins storing the data in othercells.

With automotive odometer the trend has generally been to provideredundant cells for error-checking, but this approach can easily becomeone of “who is watching the watchers watch the watcher.”

Flash presents its own problems, since wise use must be made of theboot block in flash memory so the remainder of the memory can bewritten over when the software needs a new version installed. The bootloader should not be directly accessible from the application codedesigned for the product.

Perhaps the most dangerous of all the software/hardware diseases isthe condition where we have a failure mode that has a minimal failurerate; that is, the probability that the event occurs is low enough thatthe likelihood of seeing the event with any affordable—time and money -sample size is small.

These scenarios are dangerous because the test team is unlikely tosee the event before the firmware is released to manufacturing and usedout in the field. An approach that can be effective in these situationsis combinatorial testing.

Combinatorial testing uses some tools to help develop arrays of testcases that attack the problem of sample sizes and probability byefficiently stimulating inputs. A typical example of a combinatorialtest would be pairwise testing, wherein every pair would be tested.

Obviously, this approach works for pairwise problems, but not higherorder issues. We can also build an array of three-wise groups fortesting. In fact, using the designed experiment approach withorthogonal arrays, we can efficiently assess the software. The methodis not complete, but it provides a rational alternative to doingnothing.

An additional approach when the probability of a given failure islow is to use stochastic testing; that is, to randomly stimulate theinputs. The approach can be mechanical when using automated testers orwe can rely on the 'gut' instinct of the test engineer to feel out theproblem.

It is clear, then, that taking an already complex issue such assoftware development and adding hardware dependency to the stewincreases the difficulty of testing the software, the hardware, and theunique combination of both. Because the hardware and software are mixedinextricably, it is indeed possible to freeze your software!

Kim H. Pries is Director ofProduct Integrity and Reliability/Quality Management System at Stoneridge Electronics “North America, where he is responsible for all test and evaluationactivities including laboratory, calibration, hardware-in-the-loopsoftware testing, and automated test equipment. Jon Quigley is EESystems and Verification Manager at Volvo 3P.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.