In addition to all of the usual testing concerns, interactive graphical applications present some complicated issues. But testing around these complexities is possible, as shown here.
The automated testing of a traditional sequential program, such as a compiler, is well understood and practiced in many areas of the computing industry. By contrast, the regular automated testing of interactive programs, especially their presentational aspects, requires additional techniques and rigors to assure operational quality and “correct” behavior and output over the program's lifetime. This article introduces the concepts of automated regression testing and its advantages over traditional methods, illustrated with a specific development and testing regime.
We are all familiar with the mantra that testing is a good thing. But what does this entail when the software in question is a graphical interactive program running on a range of embedded platforms? How does testing fit in, not only with the quality assurance process, but also with the simple day-to-day engineering activities? How can frameworks be constructed to anticipate future testing needs as well as those known and understood when a testing framework is introduced?
This article uses a hypothetical embedded graphical application to explore many of the principles embodied within this testing regime. With this one example, it is possible to explore a range of different issues, some or all of which might be directly relevant to the reader. Both the philosophical abstractions of such testing and the practicalities of implementing this approach are covered.
What sort of software are we testing?
Software portability is a major consideration to many software engineering departments in these days of Internet appliances. Portability is often accomplished by employing a software philosophy and structure wherein the application executes through the assistance of a target-specific portability library. When addressing a new target platform, only the implementation of the portability layer needs specific attention-the application and user interface are compiled with the appropriate toolchain and do not require modification.
All the code is written in ANSI C for further portability. This means that varying only the implementation of the portability layer permits the same application and user interface code to execute on different platforms.
This has obvious benefits from a software development viewpoint, but it also gives rise to a powerful testing mechanism to separate the behavior of an application and user interface from a target platform. Significantly, it permits us to take advantage of the greater speed of desktop workstations, which are typically about a decimal magnitude faster than the majority of embedded target platforms. By developing implementations of the portability layer for Unix and Win32, we are able to perform the development and testing of applications and their user interfaces on fast workstations, rather than on (potentially) slow target platforms. Workstation testing enables significantly more testing to be performed in a given time period than direct testing on the target embedded platform.
With suitably defined rigorous abstractions between a portable application and the target-specific implementation of the portability layer, it is possible to make powerful, reasoned statements about how much software needs testing on a given target platform. Further, if the main application itself fails on a particular target platform but not on a development workstation, the list of potential causes can be drastically reduced through some simple logical thinking. Typically, such failures are caused by incorrect implementations of the portability layer, toolchain faults, and hardware faults. As can readily be seen, such scope reduction is incredibly valuable.
To help illustrate the range of ideas and practices we have explored, this article uses a hypothetical program called GVLize. This is a program for visualising graphs of connected nodes. It takes data sets describing a large number of nodes and their interconnections, and produces a graphical representation with which the user can interact. The complexity of the layout task, both in terms of the number of possible combinations and the ability to produce a visually pleasing layout, means that there is no simple test for “correctness.” Further, the user can “nudge” the layout to perform small modifications to the chosen layout. GVLize is intended to run across a range of embedded platforms, with the main development performed on workstations. The majority of the code is implemented on top of a small portability library.
GVLize typically runs as a “full screen” application on the target embedded platforms. Data sets are obtained from both local capabilities and through one or more network connections.
GVLize thus illustrates a range of characteristics typical of many real-world applications, even if most real applications only possess a subset of these characteristics. This presents a number of additional testing challenges beyond those associated with standard workstation programs such as compilers:
- Primary output is graphical
- Interactive in nature (possesses a main event loop)
- “Correctness” is difficult to fully determine programmatically
- Cross-platform nature introduces a range of additional hazards
What is to be tested?
All aspects of behavior, including interactivity and presentation, are to be tested. These include:
- Machine checks (otherwise known as exceptions, traps, segmentation faults, or bus errors)
- Internal assertion checks
- Internal data structure consistency checks
- Ensure operations complete without looping indefinitely or taking an excessively long time
- Heap integrity
- Heap exhaustion behavior
- Error recovery behavior
- User input simulation
- Correct screen contents
Additional factors that must be taken into account due to the nature of both the applications and the target platforms upon which they execute are:
- An interactive program has no obvious termination point
- The complexities of the interaction of GVLize means that correct behavior can require operator verification
- Network behavior can introduce varying behavior
- The potential “remoteness” caused by executing on a target platform must be addressed
- Long term heap fragmentation and leakage
Having identified a number of very specific issues relating to the testing to be performed, the next few sections build up the structure of the testing harness. This harness is designed to provide flexible and long-term testing. As such, its design is well suited to many other testing requirements.
Note that we assume you're using version control to track the specific source code that goes into specific builds and releases of your software. Without that, this sort of automated testing will only help you detect bugs, and fixing them will be considerably more difficult.
What's a test?
The fundamental purpose of testing is to determine whether something specific works or not. This gives rise to the first attribute of a test-its status. The range of different values that can be recorded by the status is deliberately limited to the absolute minimum. The four status values that must be distinguishable are shown in Table 1.
|Table 1: Four test status values that must be distinguishable|
|Test status||Ascribed meaning|
|Never||The test has never been performed|
|Passed||Testing did not find any reasons to fault the test|
|Failed||One or more reasons to fault the test were found|
|Broken||Testing could not be performed (for example, equipment absence)|
It is important to distinguish between not having obtained a test result and either the passed or failed status values. Consider the situation where the target platform necessary to perform a test is not powered up. It would be inaccurate to say that being unable to perform such a test equated to that test either failing or passing. Not having the necessary equipment to perform a test is simply not a predictor of whether a test will pass or fail. In a similar fashion, the fact that a test has never been attempted is an important piece of information to be able to distinguish.
The next characteristic of a test is identification. We use two different ways to identify a test-a unique integer test number (the test ID) and a unique textual name (the test name). Over time it is to be expected that new tests will arrive and old tests will be retired. Thus, test ID values should be allocated atomically, with no reuse of values, with the expectation that the sequence may contain holes. Test names are by convention constructed in a hierarchical fashion to encapsulate progressively more specific aspects of a test. Consider an example test name:
The standard forward-slash-separator approach is used. The leftmost component is the most significant. The meaning of this test name is deconstructed in Table 2.
|Table 2: Deconstruction of a test name|
|Test name: /gvlize/x86-linux-gcc/xgvlize/1shot/
|Name component||Component meaning|
|gvlize||The product being tested is GVLize|
|x86-linux-gcc||The executable is targed at Linux running on an x86 processor and has been compiled with GCC|
|xgvlize||The particular variant of GVLize is the reference X Windows implementation|
|1shot||The test category is the “1-shot” category|
|gis||The sub-category is GIS node data set|
|cambs||The sub-sub-category is data sets of Cambridge|
|dset1203||The particular test data set|
The final characteristic of a test represents a practical compromise between maintaining a clean abstraction within the core of the test harness and the many assorted tests and their variants that it controls in a real-world environment. This is a list of test attributes, stored as a semi-colon separated list of strings. This provides a mechanism whereby the test harness records and preserves test-specific information without the harness itself examining or (necessarily) being directly influenced by such stored attributes. These attributes are used for such things as indicating which tests have unusually long timeout values or are expected to generate error situations, and so on.
See Table 3 for a complete list of test characteristics.
|Table 3: Test characteristics|
|Status||One of passed, failed, broken, and never|
|ID||Unique integer identifier|
|Name||Unique textual identifier|
|Attributes||Semi-colon separated list of strings|
Adding a test database
Practical use of a testing mechanism within day-to-day development activities and the quality assurance process requires some history to be associated with each test. Thus, the details of all tests known to the test harness are recorded in a database. To permit historical observations to be made, a number of dates are recorded for each test, as shown in Table 4.
|Table 4: Test database|
|Database field||Information recorded|
|Created||Date test was first entered in the database|
|Last passed||Date the test was passed|
|Last failed||Date the test last failed|
The date a test was last run is readily determined from the latest date recorded in the passed, failed, and broken fields. A short aside: when catering for tests that take significant amounts of time to execute, one might create a separate last run field in which to record the time at which testing started. The passed, failed, and broken fields would then record the time when the new status was actually determined and testing finished. We have not yet found a significant requirement for this level of subtle timing.
How is a test conceptually implemented?
Having sketched out a testing framework (or test harness), let's look at how a number of tests themselves interact with the framework. A test presents itself to the test harness as a function pointer or function reference. This function is referred to as the implementor function. The basic contract between the test harness and the implementor function is as follows:
- A test registers itself with the test harness to become part of the testing system, supplying an implementor function, a test name, and an initial set of test attributes
- The test harness is responsible for permanently recording all relevant information
- The implementor function is responsible for all aspects of executing the actual test, ensuring a tidy state after this testing, and yielding a new status value
- The implementor function is also responsible for reporting the detailed results of a previous test
The textual name of a test is supplied when the test registers itself. The numeric identifier of a test is generated and maintained by the test harness core. Implementor function references are not recorded in the database and are free to vary from invocation to invocation of the test harness. The operator is free to decide whether a test ID or test name is the appropriate method of specifying a test when interacting with the test harness.
Outline of a testing framework
The core operation of the test harness can now be sketched:
- Load database
- Register tests
- Determine stale and new tests
- Perform invocation-related operations using implementor functions
- Save updated database
When invoked, the test harness initializes its state from the test database. The known tests are then registered. New tests are identified as tests without a recorded identifier number. Old tests are those with no registered implementor function. Such tests are classified as stale and are awaiting manual retirement by the user. Attempting to perform any operation on a stale test results in an appropriate warning message and no further action.
The fundamental operations the test harness can perform can now also be enumerated, as shown in Table 5. This provides a generic framework for a testing mechanism with tremendous flexibility. It provides the basic necessary operations. It addresses changing testing requirements. It can handle real-world situations such as the precious development PCB having been “borrowed.” No particular implementation language is required. No particular host platform is required. A minimal set of operations is required of individual tests so that they can participate within the framework.
An initial measure of the effectiveness of this abstraction is the absence of any concrete details on what is to be tested within the framework discussed so far.
|Table 5: Test harness operations|
|Initialize||Initialize the database|
|Register||Ensure all known tests are registered in the database|
|Report||Report on test results within the database|
|Test||Actually execute a test, generating a status value|
|Retire||Delete a test in a controlled fashion|
|Verify||Verify the integrity of the test database|
An abstract model such as the one just described is a useful starting point, but it does not translate into instant practicality. The remaining sections of this article illustrate how we have taken the above framework and constructed an entire testing regime around it. From the abstract model to many of the finer practical details, the structure of this regime combines the knowledge of a range of seasoned industry experts, with experiences in such diverse areas as compiler construction and 3-dimensional modeling.
Implementing the test harness
The starting point for implementation is the test harness itself. For this, we chose the Python interpreted programming language. This provides a range of benefits:
- Easy string handling
- Moves easily between Unix and Win32 platforms
- High level nature of Python makes for easier development
- Modular nature of Python permits a powerful structuring
The extent to which the test harness is aware of the different tests to be executed is that it initializes a number of code modules. These modules, in turn, register their tests through a specific test harness call where they supply the implementor functions. A library file provides common functionality that is standardized across all in-house testing but still external to the core test harness. A simple command line argument syntax provides access to the facilities provided by these implementor functions:
report reporting-mode test-pattern
The reporting-mode controls the verbosity of reporting, ranging from a simple count of the number of tests matching the test-pattern, to displaying the complete log files of each matching test. The test-pattern provides a number of ways to match a subset of the test database. Referring back to the test name example, */1shot/*/dset1203 provides a wildcard method to specify all 1-shot tests using the dset1203 data set as input across all platforms.
Basic areas to test
The most basic testing involves feeding data sets to GVLize and checking whether it completes their decoding and presentation. At this level of testing, no judgement is made about the correctness of the displayed graphics. Nonetheless, a significant amount of checking can be performed.
The most blunt thing to check for during this sort of testing is the generation of machine-level traps. Such traps (or exceptions, bus errors, segmentation faults, and so on) are always incorrect and indicate a significant problem. Of course, not all target platforms can actually raise such exceptions. This is where workstation testing is particularly useful, as these do reliably raise such exceptions.
Assertions are a common tool for consistency checking. We use three levels of consistency checking: production-level assertions, debug-level assertions, and specific debug-level consistency checking functions.
Two levels of assertion macro are provided, one of which is present in all builds and the other of which is only present in debug builds. The latter provides for more formal and time-consuming checking. This is also useful during the development of new features. Debug-level assertions might well be present within inner loops. Production-level assertions are never present in such a potentially time consuming code location.
Aside: one can debate whether assertions should be present in production builds. I am firmly of the opinion that they should be, provided they do not have a significant code size or execution performance impact. If a situation arises that a production assertion can trap, then the choice is quite simple: is the incorrectness caught early and reported in a controlled fashion to the user or is the situation permitted to have unpredictable side effects and not be reported to the user? Although we obviously do our best to ensure an assertion failure does not occur in released code, we would rather receive immediate notification of problems than experience errant code behavior.
The third level of consistency checking is also only present in debug builds. This involves specific functions whose purpose is the examination and verification of the internal data structures to ensure that all the relevant design rules are adhered to. An example of this sort of checking is to ensure that linked lists that are meant to be NULL terminated are indeed NULL terminated and not circularly closed. Consider a situation with many interacting complex data structures and a number of linked list traversal mechanisms. It is quite possible to have a linked list that has been incorrectly linked back to itself, but the standard run-time behavior still manages to terminate under most circumstances (such cases have been caught in the past). Specific consistency checking functions can detect and report this early on, during development and testing.
Memory allocation (heap) debugging has long been notorious for its difficulty. In addition to using workstation-specific tools, such as Great Circle, we perform cross-platform debugging by placing a set of wrappers around all heap operations. Although the facilities that are provided through these wrappers are not unknown in other similar tools, their cross-platform nature is unusual. These wrappers modify the behavior of heap operations to enlarge the blocks allocated to contain additional information. Guard words are placed immediately before and after the block returned to the caller. A linked list of all current allocations is maintained. The heap allocation functions are extended, through C macros, to include the information provided by the __FILE__ and __LINE__ pre-processor values. Thus, a simple heap trace function can list all current heap allocations, including their size and from where they were allocated. Additionally, guard-word checking can readily detect simple over-run and under-run situations.
As a further aid, in debug builds, all heap allocations that return non-zeroed memory (malloc vs. calloc) actually initialize the allocated memory with a specific byte value. This makes it more likely that attempts to use un-initialized memory will be detected. We use the value 0x69 as a readily recognized value. When manipulating pointers, this gives 0x69696969 if initialization has not been performed. This value is an invalid address on every system we have encountered so far. As a numeric value, 0x69696969 is an unusually large value and liable to reveal usage of itself very rapidly. As a character string, this yields an unterminated string consisting entirely of lowercase “i” characters, which is also readily observable.
One final benefit of such wrappered heap behavior is that an artificially fixed size may be readily imposed on the heap. This permits observation of behavior in memory exhaustion situations on workstations when they would otherwise have sufficient memory to satisfy most allocations (typically due to virtual memory).
During both development and debugging, it is useful to observe a trace of program execution. Such debug (or trace) messages are often considerably more useful than simply setting a few breakpoints in a debugger, because they can rapidly provide a much larger view of program behavior.
Debug tracing is performed through a set of macros. In release-level builds these macros expand to empty contents and have no effect. In debug-level builds these macros expand into calls to a debug formatting routine somewhat like printf. A number of different debug macros can be provided to address different categories of tracing and the current debug nesting level can be maintained.
Functions to increment and decrement this nesting level are provided for use by specific data structure display routines. Before printing a debug message, each line is indented according to the current debug nesting level. Other notable characteristics of the debug routines are:
- All formatting is done by the same code to ensure consistent behavior
- Extended range of formatters to address common requirements
- Fully NULL-protected for ease of use
- Dynamic control of address printing (%p formatter)
- Final printing goes through a single function pointer
A number of important points can be observed here. The completeness of sprintf implementations varies across embedded platforms. For example, a number of implementations do not recognize a %p formatter for pointer printing. By performing the formatting with the same code on each platform, consistent results are obtained. Extending the range of data types that can be formatted greatly eases more involved tracing. For example, being able to format four integers as the coordinates of a rectangle with a simple %B increases the ease and reliability with which windowing operations may be traced. By having control over %p formatting, it is possible to have a dynamic format control that replaces the output from %p with a fixed string such as
This is useful for capturing and comparing debug traces where slight variations in memory allocation would otherwise cause different addresses to be allocated, but where all other aspects of the trace should be identical. The means for observing a debug stream varies from platform to platform. By providing a single function responsible for getting debug tracing to the user, the variations between different platforms may be quickly addressed.
Debug tracing is not only useful for specific development activities. When investigating a bug, especially a bug that is not 100% repeatable, a corresponding debug trace can be invaluable for gaining a better understanding of what did (or did not) happen. For this reason, debug traces are captured to file whenever possible.
Another use of debug traces is to compare them (as with the standard diff utility) against traces from previous executions of a test. Differences among such debug traces can be indicative of subtle problems or changes of behavior. The %p behavior referenced previously is particularly useful here.
Interactive debug control
As stated in the introduction, this testing regime is concerned with interactive programs. The debugging and testing mechanisms presented so far have not concentrated on interactive aspects. Interactive debugging is performed through a mechanism we know as SDEBUG, or socket debugging.
The SDEBUG mechanism buries a very small telnet server within the application to be debugged. Once started, an operator can telnet to and interact with the application. A simple command line interpreter decodes and dispatches commands. Two tables of commands are recognized-one associated with the portability layer and the other with the application utilizing the portability layer. Whether the client end of the telnet connection is a programmer typing commands manually or a send/expect script producing automatic responses is unknown to the program being debugged. When used within the test harness, the client is a send/expect script.
Basic SDEBUG operation involves sending a command to the program that is being debugged and waiting for a response. To simplify the construction of send/expect scripts, all responses start with a three-digit response code, similar to those used by FTP and HTTP. Some commands are termed immediate commands and others are termed long-term commands. The distinction is whether the command has completed when the server sends the next command completion response code. An informational-type response code is used to prefix lines sent when responding to commands with intermediate results. For example, a command can be sent to display a listing of all current heap allocations. This will produce a series of text lines, each tagged as being an information line, followed by an immediate command completion code. Prior to receiving this immediate command completion code, the server (program being debugged) is busy and unable to respond immediately to further commands. Long-term commands, such as network fetches (which are non-blocking), produce two responses. The first indicates that the command itself has been started and the server is ready for further commands. The second indicates completion of the page fetch itself.
The SDEBUG mechanism can also inject user input events into the portability layer. This permits the simulation of keypress, mouse movement, and mouse click events. In turn, this permits the exercising of the portability layer, the front end, the core application (such as GVLize), and the replication of a sequence of previously troublesome user actions. By injecting the event information at what is normally the boundary between the portability layer and the device driver, the maximum amount of code is tested.
Examining the graphical output
The correctness of the screen images is the remaining program feature we need to examine. Correctness is by far the most subtle and complex characteristic to test, because of the large number of ways in which input data nodes for GVLize may be combined, an issue that is further complicated by the user's ability to nudge particular visualizations.
The fundamental problem with entirely automated testing of such complex behavior is that there is a significant probability of errors in the test code itself. Relying on entirely automated testing could lead to test results with little or no value. Rather than grappling with such a large and potentially insoluble problem, the approach we have chosen avoids them by using some manual intervention.
The SDEBUG mechanism can capture the current screen frame buffer to a file, typically on the main machine running the test harness itself. On the first execution of a presentation-related test, the screen is captured. An operator then needs to examine this screen capture and manually qualify it. This qualification is a subjective “I think the results are correct” decision. This promotes to the operator all the issues regarding the correctness of the complex interactions leading to screen generation. Subsequent testing compares newly generated screen captures against the previously qualified capture. If they are the same, the test is deemed to have succeeded. If the screen images differ, the test has failed. For this type of testing, such a failure means that an operator (typically a suitably experienced programmer) must examine the two screen captures and determine whether the differences constitute an error or some other change. One example of a difference that is not an error is when the default font size has changed. When a difference is not viewed as an error, the qualified screen image needs updating.
Control required over tests in the real world
In addition to the various requirements identified so far, a short list of additional requirements for actual tests is as follows:
- Must support test execution on a variety of target platforms
- Must cater to different OS, executable loading and ICE, JTAG, or other such debugging interfaces
- Must be able to catch and (ideally) abort excessively long execution
- Must be able to capture debugging trace output to file, for later examination
- Must support screen image capture to file, for comparison purposes
An executor program addresses the divide between the machine hosting the test harness and the machine executing the test. Supplied with a range of information, the executor arranges for execution of the test on the desired target platform, in a controlled environment. It collects a range of statistics and any debug tracing produced into files stored on the machine running the test harness. Where applicable, it enforces execution time limits. The implementation of the executor varies from target platform to target platform.
Starting with a generic test harness structure, we have constructed a framework for the execution of tests. The presence of executors addresses cross-platform issues by restricting runaway tests and capturing various informational files from testing. The ability to use workstations to perform a large amount of testing provides for faster testing and permits reliable exception capturing. A number of levels of assertions provide for programmed consistency checks.
Wrappers around heap operations supply the ability to observe detailed allocation behavior, probe for potential memory initialization bugs, and simulate a range of different fixed size memory capacities to observe exhaustion behavior. Debug tracing facilities provide not only convenient mechanisms during development but the ability to retrospectively determine what happened during a testing run, and the ability to compare the internal operations of multiple test runs. Many interactivity issues are addressed through the SDEBUG mechanism coupled with send/expect scripts. User input is addressed through the ability to inject relevant events into the portability layer. Output correctness checking is addressed through a combination of screen image captures and comparisons, supplemented with manual operator qualification to ensure that complexity does not compromise the validity of results obtained.
Each of these mechanisms is useful in itself. Indeed, most development teams are probably already using some of these techniques during their development.
However, it is not the usefulness of these mechanisms individually that is relevant here. Rather, it is the benefits yielded by combining them together into a single testing and development regime. Combining these techniques permits comprehensive testing to be performed through simple invocations of the test harness program.
To deny the presence of bugs is like King Canute denying the influence of the moon. A robust software development process must recognize this and provide for the earliest possible detection and continued elimination of bugs. This is where regression testing fits in.
Each night, a testing sequence is run and the results e-mailed to each member of the development and QA teams. This summary focuses on identifying the tests that have failed or broken status values and only briefly summarizes the number of tests with a passed status.
As a precursor to the bulk of testing, a current source tree is retrieved from the source repository and a set of compilations is performed to generate the executables to be tested. These compilations are also performed as tests within the test harness. This means that the compilations are themselves a set of tests for the test harness, yielding captured output and a status value like any other test. A simplified executor without an imposed time limit is used for such compilations. The success, or otherwise, of such compilations can then be used as a dependency for performing subsequent testing.
As new features are implemented during development, corresponding tests are added to the test suite. As bug reports are received, tests to replicate them are also added to the test suite. Tests are only removed from the test suite when they are no longer applicable, for example because they test a feature that is no longer supported. Tests are definitely not removed just because they have been passing for a while.
The value of regression testing to non-interactive command line programs has long been recognized. Through the regime described in this article, we can see how this powerful quality assurance technique can be extended to cover programs that are interactive, graphical, and cross-platform in nature.
David Fell is the chief scientist at ANT Limited, which he founded in 1993. He has more than 15 years of experience in software design and development programming. He is also the author of numerous technical articles for trade publications. You can e-mail him at .
Return to January 2001 ESP Index