Test-osterone! A case load of tests
Testing is a tough task. It calls for a different mindset than that required during design. A diligent coder seeks perfection in the product and strives to have the product released as soon as possible--and often overlooks the inherent conflict between these two goals. The tester seeks flaws in the product and strives to prevent it from being released by delivering a never-ending list of bugs to the design team.
A good tester, like a murder suspect, must have opportunity, motive, and intent. Opportunity arises only if the design or test team provides the tools, whether homegrown or third-party, to examine the software's behavior. Motive comes when you reward the tester for finding bugs instead of making life easy for coders. On some small teams the coder and the tester are the same person. This arrangement is destined to fail because the motive is not there--the coder is punished for finding bugs, because it's also his responsibility to fix them. Finally, malice of intent is that little extra factor that makes a tester smile wickedly to himself when he knows he's found a bug that will cause the coders some lost sleep. Some people get a kick out of seeing something break, rather than making something work. Not all engineers possess this nasty streak, but good testers do.
In this article I'll describe some of the tests that apply well to a variety of embedded systems. Whether you test your own work or someone else's, add these tests to your checklist. If you're primarily a systems designer, you might want to build some of these test capabilities into your project to give your favorite testers more work; of course, be prepared for that to backfire on you.
Where to start
Before you test anything else you should test the parts of the system that detect and log errors. This step may sound obvious but it's tempting to start testing the features that are most visible to the user and to consider the error-handling mechanism as part of the housekeeping you'll get to later. Resist the temptation; if you later discover circumstances where errors are not reported or recorded, you'll be left wondering if your earlier tests were valid. If the error-reporting mechanism doesn't work at all, it would probably get spotted early enough anyway, but a more subtle failure might go unnoticed.
Consider the example of a stack-overflow detection mechanism with a bug that never records or reports when the stack limit is violated. If you don't test this feature until late in the test phase and then discover it's faulty, you'll realize that any tests you've performed may have violated the stack limits but gone undetected. A gross stack failure would probably be obvious anyway, but if the limit were breached by only a few bytes the effect could be too subtle to crash the system. If the stack-overflow check isn't functioning properly you've missed an opportunity to catch this bug. If you're curious how to check for stack overflow conditions, have a look at my article "Safe Memory Utilization."1
While prototype systems are still on the developer's desk they're usually powered differently than they will be in the final design. Often the system is reset by the debugging tool or emulator, so it's not really powered-up cold. When a finished unit is built, it's finally possible to see what happens when it's turned on for real. At this stage you'll inevitably find that turning the unit on and off several times in quick succession causes some previously unseen problem.
Designers give careful thought to gracefully surviving a power-down sequence--possibly allowing for storing settings in nonvolatile storage or turning off certain outputs once an interrupt informs the system that only a few milliseconds of power remain. The same thought isn't always applied to the vulnerable state of the system during the power-up sequence.
Many systems allow the watchdog to lead to a reset during their start-up sequence. So the start-up sequence has two resets, one caused by the system powering up and a second by the watchdog. Typically, a flag in nonvolatile memory indicates that a watchdog test is in progress. This flag enables the code to decide when the watchdog test is complete. After a reset, if the flag is not set, the watchdog test is performed and the flag is set. If the flag is set after reset, the watchdog test is complete and the rest of the start-up sequence is executed. This arrangement can be vulnerable if power is lost while the flag is set. Turning off power will probably lose all data in RAM but the nonvolatile storage will be used on every following reset. If the software sees from the microprocessor's status bits that the current reset was caused by a power-on, but the flag indicates that a watchdog test is in progress, the software might falsely report a watchdog failure.
Performing many power cycles of varying duration will probably exercise the case I just described, but this error might get missed if the failure window is short. If you want to perform a test that explicitly targets the watchdog test, one approach is to change the watchdog timeout to a longer period. If you can configure it to several seconds the test will take so long that it's easy to time the turn-off to happen during this extended window. Later we'll see other cases of slowing down the system in order to facilitate testing.
Beware your own tests
A few years ago I was supporting the clinical trials of a medical device. One day, one of the clinicians commented that the time and date displayed by the device were incorrect. I didn't pay the comment much heed, since it was quite possible that another user in the hospital could have changed the date using one of the menu options and left it in some silly state. Then a similar report came from a different site. The second site reported a date that was similar to the first--a date in late 1968. This was a bit of a coincidence and required further investigation. The date mentioned at both sites was just after my own birthday--Friday the 13th of December 1968. Ominously, there was a section of the code that referenced that date.
Back when we were conducting the power-on tests, we attempted to read from and write to as many peripherals as possible, including the system's real-time clock. It was trivial to read the clock, but not so trivial to establish if the data read was valid. The test we devised was to read the current time and date and store them in RAM. We then wrote a fixed time and date to the real-time clock, which I vainly chose as my birthday. Once that time and date were correctly read back, the copy of the correct time and date stored in RAM was restored to the device.
In hindsight, the weakness of the test should have been obvious. Turning the device off during the test could leave the clock with the wrong date and time, since the RAM copy would never get restored. We naively assumed that this time window was so small that it was a one-in-a-million chance that a user would turn the device off at this time. Unfortunately, the first few seconds of operation is quite a likely time for the device to be turned off. Frustrated users will sometimes cycle the on/off switch several times. Wiggling the battery while inserting it may lead to a number of on-and-off periods before the batteries settle into place, and those periods of power will be very short. There are many other similar scenarios, which means you have to plan for the device to be turned off after a very short periods of operation.
Some power supplies accept commands sent over a serial port via Hyperterm or a similar terminal emulation program. Simple commands can turn the power on and off, and a simple script can power-cycle the unit over and over, with the delay varied. You may want to vary the off-time as well as the on-time if you want to test how the capacitance of the system may keep some parts of the system alive while others are fully reset. I've worked on multiprocessor systems where a short loss of power will reset some processors and not others, which led to all sorts of problems when the processors tried to resynchronize.
Be aware that on some battery-powered devices, removing power may not be equivalent to turning the device on and off with the on/off button. A separate mechanism should be designed to exercise such soft resets.
At the other end of the scale, a system may be left running for a very long time. The mechanical and electronic components of the system may be submitted to life testing. This test typically involves running the system continuously for a number of months in order to establish which components will fail first. In my experience this testing starts early in the project, since the test has, by definition, a long duration. This testing may start with a very early version of the software, which is not a good environment in which to identify software issues such as memory leaks; the leaks may simply not have been written yet.
By the time the software is complete, there may not be enough time on the calendar to do a lengthy test of running the software, so some tests must exaggerate the aging effect. The challenge of leaking memory is important and I covered techniques for managing that problem in earlier issues of Embedded Systems Programming.2, 3
Finding memory leaks rarely depends on simulating lots of calendar time. Most leaks require a particular path through the code to cause a small leak that eventually accumulates into a resource shortage from which the system cannot recover. Usually they're discovered by checking for small leaks over a short period, rather than trying to diagnose the problem after all the memory has been exhausted, which may happen some considerable time after the original leak occurred.
Other bugs are difficult to find unless you can fast-forward the calendar. If your system knows the real time and date you can perform obvious checks such as setting the calendar to just before New Year's Eve to observe how the rollover is handled. Although the predictions of technological collapse at the end of 1999 turned out to be more figment than fact, still plenty of bugs were rectified. A year later I was involved in troubleshooting a problem that arose on the first day of 2001--a bug that was introduced while rectifying a Y2K problem! This tells us that calendar testing is important for plenty of rollover dates other than 2000.
Of course, your system could possibly suffer a Y2K problem even now, but it would only manifest itself if the calendar were rewound to that date. Most users would never rewind the date on a system, so the next test on my list is to try to set the date to some time in the past. Ideally the system would prevent the user from setting a date into the past, but of course the system can't do so without an independent reference. What it can do is disallow any date before the start of the project.
You can test many future rollover dates other than New Year's Eve. Test dates where calibrations or licenses may expire. Some systems I've worked on remind the user when service is due. Once the service occurs this counter is reset. Setting the calendar forwards to simulate a system that never got serviced may discover rollover problems in the service timer.
Other aging problems aren't calendar related. If you have nonvolatile memory that's updated on a regular basis, your system may have a time limit based on wearing out the nonvolatile memory. On some types of nonvolatile storage, the number of write cycles is on a per-cell basis, so the life can be extended by using one location to store a frequently updated piece of data, then moving to another location when the first one is reaching its design limit. This technique is known as wear leveling and most flash file systems implement some variation of this method to increase the life of the flash-memory devices.
Changing the location of data within nonvolatile memory adds to the complexity of the code and therefore means more testing. These location changes may only occur every few months. Accelerating the frequency of writes allows us to simulate the passing of a greater amount of time, in order to test many location transitions. I recently ran such a test over a weekend that simulated 15 years of operation by increasing the write frequency from once every five minutes to once every 100 milliseconds.
Another interesting problem that arises in a system that runs for a long time is that repeated mathematical operations can accumulate floating-point errors. The following statement introduces a very small error to the value of x:
x = x + (1/3);
Since the value of 1/3 can not be represented precisely by the system's floating-point representation, a small error will be added to the value of x. As a percentage, this error is minuscule (unless the previous value of x was in the region of -1/3). But if this statement is repeated 100 times per second on a system that runs for days, the error will accumulate and may eventually become significant. This accumulation is precisely what happened to several Patriot missiles, causing them to misjudge the distance to their target.4 How badly they misjudged this distance depended on how long the missile had been sitting on the launch pad in the ready state. In some cases, the missiles were in standby for 100 hours before use, leading to an accumulated error that represented a half kilometer of flying time.
Natural gas is odorless. By the time it's piped to you as domestic heating gas, a smelly additive called mercaptan has been added to it. Mercaptan does nothing to reduce the frequency of gas leaks but it does mean that the leak can be detected. If the gas remained odorless, many leaks would go undetected in the early stages and lead to far greater problems later. In the case of natural gas, people recognized the danger of invisibility and solved it by taking something invisible and making it detectable.
Many embedded systems deal with invisible forces. The failures of the Therac-25 irradiation system are well documented.5,6 This device killed and injured a number of patients undergoing cancer treatment. At the time of the incident, no one even recognized that the damage had occurred--many patients didn't suffer adverse symptoms until a number of days later. Because radiation is not perceivable by any human sense, the user had no obvious way to detect the fault as one could if, for example, a faucet delivered too much water or conveyer belt moved too fast.
In software, you should look for ways to make invisible properties detectable to the testers. Many values are maintained internally to the system but are invisible to the user. If the value is wrong, it can not be seen, unless some invasive tool such as a debugger is used. Sometimes one value is maintained internally, and then it is modified for display to the user. Consider the case where the measured oxygen percentage is maintained internally. Before display, that value is clipped to ensure that it's in the range of 21% to 100%, while the original calculated value remains invisible.
But the calculated value is a good indicator of how much the sensor has drifted. While the user may only need to see a maximum of 100, the tester wants to see the calculated value. If the interface is graphical, test builds could display both values; it's far more effective than allowing the value to be queried via a serial port or accessed at a breakpoint, because the value is then seen in all tests.
Making such a change to your system can be as simple as taking a few of the internal values of the system and making them "visible" on a serial port. If the user interface is flexible enough, you may be able to make several internal values permanently visible. Such a tactic is often trivial to accomplish on a graphical interface without having much impact on the rest of the interface.
One of the most frequent failure modes for electromechanical systems comes after they've been serviced and the service technician fails to reconnect all cables. So as part of the testing of the system, it's wise to open the box and try leaving cables or mechanical interconnects disconnected. In the lung-ventilation systems I work on this test often means disconnecting tubes that carry a flow of air, oxygen, or water. In each case the system should fail in such a way that the cause can be identified. The results of such a test provide valuable feedback for the service manual.
In an ideal world this test would be a subset of testing all possible system failures, but in practice it's rarely going to be possible to simulate the failure of each individual resistor and solder point. Specifying a test for each interconnection is a compromise where you catch a number of cases that have a high probability of happening in the field.
A list of every character string that the system can display provides the basis for a coverage test. Design a test that causes each of those messages to appear on the display. The list may be trivial to create if all of the strings are stored in a single file, as is often the case for software than needs to be translated into other languages. As each string is displayed, the tester checks it off the list. It may be that a cross section of other tests will cause them to be displayed anyway, but there is value in a single test dedicated to this purpose.
If during testing you're checking the strings, you should also check for spelling, grammar, and consistency errors. For more advice on reviewing strings, see the column I wrote on wordsmithery.7 Depending on the type of display you may also need to check the size of the string. A string that's too long may partially vanish off the right side of the display or overlap another string. Columns or tables of text may lose their alignment if one of the strings exceeds its boundary.
Once this test is written, the same test can be reused to validate each language. In some cases I've automated the key sequences for the translators. Using a special build of the software, the user may press one key to trigger the next sequence of "faked" keys. Exact sequences can be reproduced with minimal effort for the tester. The tester only needs to press a single key to generate the sequence of keys and events that lead to the next screen for review. The translator or tester gets to view each screen without concerning himself with how the screen was generated. This approach is good when the concern is solely within the appearance and meaning of the strings; for more general testing it's better if the tester performs each step manually.
Some strings are formatted to include numbers, and so their length can vary at runtime. Such strings should be displayed in their shortest and longest forms, which you can achieve by constructing tests where the maximum and minimum numeric values are displayed. Similar logic applies if a string is inserted into the string. Consider a string that inserts the month, for example "Current Month: January" where "January" could be substituted for any other month. Include a test to display "May" and "September" to exercise the longest and shortest versions of this string. Be aware that if the test is reused in a different language, the longest month may be quite different.
This test will usually discover if there are strings compiled into the system that can never be displayed. These strings should be eliminated to save memory and reduce the translation effort.
Strings that are used to report obscure errors may not be trivial to display, and so the tester will have to construct a method of forcing those error conditions in order to see the strings. These test methods exercise many paths that rarely occur during normal use, and the mechanism of inducing or faking those failures here will be useful in other test cases.
A related user-interface test is to press all invalid keys in each state of the user interface. Often the invalid key causes a beep and no change to the user interface, or maybe a message appears to tell the user why that key is not currently valid.
Because most of the tests are written to test the valid key actions, a bug that's caused by pressing an invalid key may go undetected for a long time. Sometimes the key actions are defined in a table-driven state machine in the software. A single incorrect entry in that table could lead to an invalid key acting like one of the other valid keys, or maybe leading to a nonsense transition.
The full set of states of the user interface may multiply to a huge number of states, especially for a GUI. It may be impossible to enumerate every invalid key sequence. It should be possible, however, to cover enough cases to shake out any bugs. If the key actions are table-driven, it may be possible to use that table as a checklist to ensure that each row of each table has been exercised at least once.
Some devices require calibration at manufacture or at regular intervals throughout their lives. A nuclear reactor in a research center in Grenoble, France, was shut down when it was discovered that it had been running 10% above its rated limit for 20 years.8 The instrument used to measure the output was calibrated using ordinary water, but nuclear reactors use heavy water.
When used on lung ventilators some sensors require calibration with oxygen, and others with air, so similar issues often arise. Using the wrong gas introduces an error large enough to be significant but small enough that it doesn't cause the entire calibration to fail.
It's a useful test to deliberately fail the calibration. If the user is asked to enter a pressure reading from an external source, entering a value higher than the value displayed on the external source will introduce an error to the calibration. It's useful to know the worst-case performance that can be caused by bad calibration. On some systems cross-checking means that one badly calibrated sensor will be detected by other parts of the system that are still taking accurate readings.
If the system uses a watchdog, it must strobe that watchdog at some regular interval to ensure that the system doesn't reset itself. The watchdog hardware should be tested by the system as part of its start-up sequence. I've covered the implementation of watchdogs in previous articles.9,10
When it comes to testing whether the software is indeed strobing the watchdog at an appropriate rate, my favorite test is to decrease the period of the watchdog timer. This will detect any places in the code where the watchdog was strobed close to its deadline. For example, consider a system that has been designed with a watchdog period of 200ms and has a path in the code that typically takes 190ms. If there are nondeterministic elements in the code (say, interrupts or context switches), the code path that normally takes 190ms may sometimes stretch to more than 200ms, leading to an occasional watchdog reset. Another issue is the accuracy of the watchdog timer. Many microprocessors use an RC circuit to generate the watchdog clock instead of basing it on the crystal driving the rest of the system. The RC circuit can have a wide tolerance and may be temperature dependent as well. The 200ms timeout may be a nominal value, but I've seen systems where it could vary between 150ms to 250ms. Our 190ms path may lead to a reset on some systems and not on others. Or it might reset on a cold day, but not when it's warm.
In a system where the watchdog varies, the real limit for strobing the watchdog is the minimum possible interval, because that is the value that will work on all systems at all temperatures. If the watchdog has a range of 150ms to 250ms, then the ideal test is to set the watchdog to 100ms, which would give a range of 50ms to 150ms. Now our 190ms path will be guaranteed to lead to a watchdog reset, and so will be identified as a problematic path, which possibly needs more strobes.
Rather than running specific tests with the watchdog set at to a shorter interval, I like to shorten the interval for all systems during development, and then closer to the release, reset it to the desired design interval. In this way most testing is run at the shorter interval, maximizing the chances of catching those troublesome longer paths in the code.
Take it slow
Watch out for opportunities to widen race-condition windows. One system I worked on phased in new operator settings gradually over a period that varied between two and five seconds. The operator was allowed to enter new settings during this short period, but it used a different path through the code. By altering the software to phase in the settings more slowly, taking up to 30 seconds. This widens the window where the alternate path is used giving the tester a greater chance of finding bugs.
In another case a graphical application updated the screen faster than the eye could follow. We slowed it down so that each shape could be seen individually before the next one appeared. We discovered that occasionally some parts of the display were drawn, erased, and drawn again. The end result looked the same, but obviously it was more efficient to render each area just once.
When you get to that final build--the one you think is actually going to ship--make sure you check that any back doors or debugging hooks you added for test and debug purposes are removed from the product. It's not unknown for a product to go to market with streams of debugging information being printed to a serial port.
A tester's work is only finished when he can't find any more bugs, no matter how hard he tries to plumb previously unexplored areas of behavior. Although every product has unique functionality and mis-functionality, perhaps some of the tests described here will help your product be more reliable and arrive in the market with fewer bugs.
Niall Murphy has been designing user interfaces for 14 years. He is the author of Front Panel: Designing Software for Embedded User Interfaces. Murphy teaches and consults on building better user interfaces. He welcomes feedback and can be reached at firstname.lastname@example.org. Reader feedback to his articles can be found at www.panelsoft.com/murphyslaw.
- Murphy, Niall. "Safe Memory Utilization," Embedded Systems Programming, April 2000, p. 110.
- Murphy, Niall. "Flushing Out Memory Leaks," Embedded Systems Programming, March 2002, p. 37.
- Murphy, Niall. "More on Memory Leaks," Embedded Systems Programming, April 2002, p. 31.
- Skeep, Robert. "Roundoff Error and the Patriot Missile," SIAM News, July 1992, Volume 25, Number 4, p. 11.
- Leveson, Nancy. Safeware, System Safety and Computers. Addison-Wesley Publishing, Reading, MA, 1995.
- Leveson, Nancy and Clark S. Turner. "An Investigation of the Therac-25 Accidents," IEEE Computer, Vol. 26, No. 7, July 1993, pp. 18-41.
- Murphy, Niall. "The Programmer as Wordsmith," Embedded Systems Programming, December 2004, p. 9.
- New Scientist, 17 Feb 1990, page 18
- Murphy, Niall. "Watchdog Timers," Embedded Systems Programming, November 2000, p. 112.
- Murphy, Niall and Michael Barr. "Beginner's Corner: Watchdog Timers," Embedded Systems Programming, October 2001, p. 79.
Currently no items