A product can be of low quality for several reasons, such as it was shoddily manufactured, its components were improperly designed, its architecture was poorly conceived, and the product's requirements were poorly understood.
Quality must be designed in. You can't test out enough bugs to deliver a high-quality product. The quality assurance (QA) process is vital for the delivery of a satisfactory system. In this last part in this series, we will concentrate on portions of the methodology particularly aimed at improving the quality of the resulting system.
The software testing techniques described earlier in this series constitute one component of quality assurance, but the pursuit of quality extends throughout the design flow. For example, settling on the proper requirements and specification cannot be overlooked as an important determinant of quality. If the system is too difficult to design, it will probably be difficult to keep it working properly.
Customers may desire features that sound nice but in fact don't add much to the overall usefulness of the system. In many cases, having too many features only makes the design more complicated and the final device more prone to breakage.
To help us understand the importance of QA, the Application Example below describes serious safety problems in one computer-controlled medical system. Medical equipment, like aviation electronics, is a safety-critical application; unfortunately, this medical equipment caused deaths before its design errors were properly understood.
This example also allows us to use specification techniques to understand software design problems. In the rest of the section, we look at several ways of improving quality: design reviews, measurement-based QA, and techniques for debugging large systems.
Application Example. The Therac-25 medical imaging system. The Therac-25 medical imaging system caused what Leveson and Turner called “the most serious computer-related accidents to date (at least nonmilitary and admitted).”
In the course of six known accidents, these machines delivered massive radiation overdoses, causing deaths and serious injuries. Leveson and Turner analyzed the Therac-25 system and the causes for these accidents.
The Therac-25 was controlled by a PDP-11 minicomputer. The computer was responsible for controlling a radiation gun that delivered a dose of radiation to the patient. It also runs a terminal that presents the main user interface. The machine's software was developed by a single programmer in PDP-11 assembly language over several years. The software includes four major components: stored data, a scheduler, a set of tasks, and interrupt services. The three major critical tasks in the system were as follows:
1) A treatment monitor controls and monitors the setup and delivery of the treatment in eight phases.
2) A servo task controls the radiation gun, machine motions, and so on.
3) A housekeeper task takes care of system status interlocks and limit checks. (A limit check determines whether some system parameter has gone beyond preset limits.)
The code was relatively crude—the software allowed several processes access to shared memory, there was no synchronization mechanism aside from shared variables, and test-and set for shared variables were not indivisible operations. Let's examine the software problems responsible for one series of accidents. Leveson and Turner reverse-engineered a specification for the relevant software as shown below:
Treat is the treatment monitor task, divided into eight subroutines (Reset, Datent, and so on). Tphase is a variable that controls which of these subroutines is currently executing. Treat reschedules itself after the execution of each subroutine. The Datent subroutine communicates with the keyboard entry task via the data entry completion flag, which is a shared variable.
Datent looks at this flag to determine when it should leave the data entry mode and go to the Setup test mode. The Mode/energy offset variable is a shared variable: The top byte holds offset parameters used by the Datent subroutine, and the low-order byte holds mode and energy offset used by the Hand task.
When the machine is run, the operator is forced to enter the mode and energy (there is one mode in which the energy is set to a default), but the operator can later edit the mode and energy separately.
The software's behavior is timing dependent. If the keyboard handler sets the completion variable before the operator changes the Mode/energy data, the Datent task will not detect the change—once Treat leaves Datent, it will not enter that subroutine again during the treatment. However, the Hand task, which runs concurrently, will see the new Mode/energy information. Apparently, the software included no checks to detect the incompatible data.
After the Mode/energy data are set, the software sends parameters to a digital/analog converter and then calls a Magnet subroutine to set the bending magnets. Setting the magnets takes about 8 seconds and a subroutine called Ptime is used to introduce a time delay.
Due to the way that Datent, Magnet, and Ptime are written, it is possible that changes to the parameters made by the user can be shown on the screen but will not be sensed by Datent. One accident occurred when the operator initially entered Mode/energy, went to the command line, changed Mode/energy, and returned to the command line within 8 seconds.
The error therefore depended on the typing speed of the operator. Since operators become faster and more skillful with the machine over time, this error is more likely to occur with experienced operators. Leveson and Turner emphasize that the following poor design methodologies and flawed architectures were at the root of the particular bugs that led to the accidents:
1) The designers performed a very limited safety analysis. For example, low probabilities were assigned to certain errors with no apparent justification.
2) Mechanical backups were not used to check the operation of the machine (such as testing beam energy), even though such backups were employed in earlier models of the machine.
3) Programmers created overly complex programs based on unreliable coding styles.
In summary, the designers of the Therac-25 relied on system testing with insufficient module testing or formal analysis.
Quality Assurance Techniques
The International Standards Organization (ISO) has created a set of quality standards known as ISO 9000, created to apply to a broad range of industries, including but not limited to embedded hardware and software.
A standard developed for a particular product, such as wooden construction beams, could specify criteria particular to that product, such as the load that a beam must be able to carry. However, a wide-ranging standard such as ISO 9000 cannot specify the detailed standards for every industry.
Consequently, ISO 9000 concentrates on processes used to create the product or service. The processes used to satisfy ISO 9000 affect the entire organization as well as the individual steps taken during design and manufacturing.
A detailed description of ISO 9000 is beyond the scope of this series. I can, however, make the following observations about quality management based on ISO 9000:
Process is crucial Haphazard development leads to haphazard products and low quality. Knowing what steps are to be followed to create a high quality product is essential to ensuring that all the necessary steps are in fact followed.
Documentation is important: Documentation has several roles: The creation of the documents describing processes helps those involved understand the processes; documentation helps internal quality monitoring groups to ensure that the required processes are actually being followed; and documentation also helps outside groups (customers, auditors, etc.) understand the processes and how they are being implemented.
Communication is important: Quality ultimately relies on people. Good documentation is an aid for helping people understand the total quality process. The people in the organization should understand not only their specific tasks but also how their jobs can affect overall system quality.
Many types of techniques can be used to verify system designs and ensure quality. Techniques can be either manual or tool based. Manual techniques are surprisingly effective in practice.
Later I discuss design reviews, which are simply meetings at which the design is discussed and which are very successful in identifying bugs. Many of the software testing techniques described in earlier in this series can be applied manually by tracing through the program to determine the required tests.
Tool-based verification helps considerably in managing large quantities of information that may be generated in a complex design. Test generation programs can automate much of the drudgery of creating test sets for programs. Tracking tools can help ensure that various steps have been performed. Design flow tools automate the process of running design data through other tools.
Metrics are important to the quality control process. To know whether we have achieved high levels of quality, we must be able to measure aspects of the system and our design process.
We can measure certain aspects of the system itself, such as the execution speed of programs or the coverage of test patterns. We can also measure aspects of the design process, such as the rate at which bugs are found. Later in this article there is a description of the ways in which measurements can be used in the QA process.
Tool and manual techniques must fit into an overall process. The details of that process will be determined by several factors, including the type of product being designed (e.g., video game, laser printer, air traffic control system), the number of units to be manufactured and the time allowed for design, the existing practices in the company into which any new processes must be integrated, and many other factors.
An important role of ISO 9000 is to help organizations study their total process, not just particular segments that may appear to be important at a particular time.
One well-known way of measuring the quality of an organization's software development process is the Capability Maturity Model (CMM) developed by Carnegie Mellon University's Software Engineering Institute. The CMM provides a model for judging an organization. It defines the following five levels of maturity:
1. Initial: A poorly organized process, with very few well-defined processes. Success of a project depends on the efforts of individuals, not the organization itself.
2. Repeatable: This level provides basic tracking mechanisms that allow management to understand cost, scheduling, and how well the systems under development meet their goals.
3. Defined: The management and engineering processes are documented and standardized. All projects make use of documented and approved standard methods.
4. Managed: This phase makes detailed measurements of the development process and product quality.
5. Optimizing: At the highest level, feedback from detailed measurements is used to continually improve the organization's processes.
The Software Engineering Institute has found very few organizations anywhere in the world that meet the highest level of continuous improvement and quite a few organizations that operate under the chaotic processes of the initial level.
However, the CMM provides a benchmark by which organizations can judge themselves and use that information for improvement.
Verifying the Specification
The requirements and specification are generated very early in the design process. Verifying the requirements and specification is very important for the simple reason that bugs in the requirements or specification can be extremely expensive to fix later on.
Figure 9.11 below shows how the cost of fixing bugs grows over the course of the design process (we use the waterfall model as a simple example, but the same holds for any design flow).
|Figure 9.11. Long-lived bugs are more expensive to fix.|
The longer a bug survives in the system, the more expensive it will be to fix. A coding bug, if not found until after system deployment, will cost money to recall and reprogram existing systems, among other things. But a bug introduced earlier in the flow and not discovered until the same point will accrue all those costs and more costs as well.
A bug introduced in the requirements or specification and left until maintenance could force an entire redesign of the product, not just the replacement of a ROM. Discovering bugs early is crucial because it prevents bugs from being released to customers, minimizes design costs, and reduces design time.
While some requirements and specification bugs will become apparent in the detailed design stages—for example, as the consequences of certain requirements are better understood—it is possible and desirable to weed out many bugs during the generation of the requirements and spec.
The goal of validating the requirements and specification is to ensure that they satisfy the criteria we originally applied earlier in this series to create the specification, including correctness, completeness, consistency, and so on. Validation is in fact part of the effort of generating the requirements and specification.
Some techniques can be applied while they are being created to help you understand the requirements and specifications, while others are applied on a draft, with results used to modify the specs.
Since requirements come from the customer and are inherently somewhat informal, it may seem like a challenge to validate them. However, there are many things that can be done to ensure that the customer and the person actually writing the requirements are communicating.
Prototypes are a very useful tool when dealing with end users—rather than simply describe the system to them in broad, technical terms, a prototype can let them see, hear, and touch at least some of the important aspects of the system.
Of course, the prototype will not be fully functional since the design work has not yet been done. However, user interfaces in particular are well suited to prototyping and user testing. Canned or randomly generated data can be used to simulate the internal operation of the system.
A prototype can help the end user critique numerous functional and nonfunctional requirements, such as data displays, speed of operation, size, weight, and so forth.
Certain programming languages, sometimes called prototyping languagesor specification languages, are especially well suited to prototyping. Very high-level languages (such as Matlab in the signal processing domain) may be able to perform functional attributes, such as the mathematical function to be performed, but not nonfunctional attributes such as the speed of execution.
Preexisting systems can also be used to help the end user articulate his or her needs. Specifying what someone does or doesn't like about an existing machine is much easier than having them talk about the new system in the abstract. In some cases, it may be possible to construct a prototype of the new system from the preexisting system.
Particularly when designing cyber-physical systems that use real-time computers for physical control, simulation is an important technique for validating requirements. Requirements for cyber-physical systems depend in part on the physical properties of the plant being controlled. Simulators that model the physical plant can help system designers understand the requirements on the cyber side of the system.
The techniques used to validate requirements are also useful in verifying that the specifications are correct. Building prototypes, specification languages, and comparisons to preexisting systems are as useful to system analysis and designers as they are to end users.
Auditing tools may be useful in verifying consistency, completeness, and so forth. Working through usage scenarios often helps designers fill out the details of a specification and ensure its completeness and correctness. In some cases, formal techniques (that is,design techniques that make use of mathematical proofs) may be useful.
Proofs may be done either manually or automatically. In some cases, proving that a particular condition can or cannot occur according to the specification is important. Automated proofs are particularly useful in certain types of complex systems that can be specified succinctly but whose behavior over time is complex. For example, complex protocols have been successfully formally verified.
The design review is a critical component of any QA process. The design review is a simple, low-cost way to catch bugs early in the design process. A design review is simply a meeting in which team members discuss a design, reviewing how a component of the system works.
Some bugs are caught simply by preparing for the meeting, as the designer is forced to think through the design in detail. Other bugs are caught by people attending the meeting, who will notice problems that may not be caught by the unit's designer.
By catching bugs early and not allowing them to propagate into the implementation, we reduce the time required to get a working system. We can also use the design review to improve the quality of the implementation and make future changes easier to implement.
A design review is held to review a particular component of the system. A design review team has the following members:
1) The designers of the component being reviewed are,of course, central to the design process. They present their design to the rest of the team for review and analysis.
2) The review leader coordinates the pre-meeting activities, the design review itself, and the post-meeting follow-up.
3) The review scribe records the minutes of the meeting so that designers and others know which problems need to be fixed.
4) The review audience studies the component. Audience members will naturally include other members of the project for which this component is being designed. Audience members from other projects often add valuable perspective and may notice problems that team members have missed.
The design review process begins before the meeting itself. The design team prepares a set of documents (code listings, flowcharts, specifications, etc.) that will be used to describe the component. These documents are distributed to other members of the review team in advance of the meeting, so that everyone has time to become familiar with the material. The review leader coordinates the meeting time, distribution of handouts, and so forth.
During the meeting, the leader is responsible for ensuring that the meeting runs smoothly, while the scribe takes notes about what happens. The designers are responsible for presenting the component design.
A top”down presentation often works well, beginning with the requirements and interface description, followed by the overall structure of the component, the details, and then the testing strategy. The audience should look for all types of problems at every level of detail, including the problems listed below.
1) Is the design team's view of the component's specification consistent with the overall system specification, or has the team misinterpreted something?
2) Is the interface specification correct?
3) Does the component's internal architecture work well?
4) Are there coding errors in the component?
5) Is the testing strategy adequate?
The notes taken by the scribe are used in meeting follow-up. The design team should correct bugs and address concerns raised at the meeting. While doing so, the team should keep notes describing what they did.
The design review leader coordinates with the design team, both to make sure that the changes are made and to distribute the change results to the audience. If the changes are straightforward, a written report of them is probably adequate.
If the errors found during the review caused a major reworking of the component, a new design review meeting for the new implementation, using as many of the original team members as possible, may be useful.
System design takes a comprehensive view of the application and the system under design. To ensure that we design an acceptable system, we must understand the application and its requirements. Numerous techniques, such as object-oriented design, can be used to create useful architectures from the system's original requirements.
Along the way, by measuring our design processes, we can gain a clearer understanding of where bugs are introduced, how to fix them, and how to avoid introducing them in the future.
This series of three articles is based on material printed with permission from Morgan Kaufmann, a division of Elsevier, Copyright 2008 from “Computers as Components, Second Edition” by Wayne Wolf. For more information about this title and other similar books, please visit www.elsevierdirect.com.
Wayne Wolf is currently the Georgia Research Alliance Eminent Scholar holding the Rhesa “Ray” S. Farmer, Jr., Distinguished Chair in Embedded Computer Systems at Georgia Tech's School of Electrical and Computer Engineering (ECE). Previously a professor of electrical engineering at Princeton University, he worked at AT&T Bell Laboratories. He has served as editor in chief of the ACM Transactions on Embedded Computing and of Design Automation for Embedded Systems .