GUI testing: exposing visual bugs -

GUI testing: exposing visual bugs

My approach to testing a graphical user interface (GUI) has always been to find the most appropriate access point to manually inject test cases. This article will discuss the challenges of trying to make GUI tests repeatable, and we'll look at a homegrown framework that allows test input to be managed.

To test a GUI, we need a framework that gives us the ability to inject test cases easily and to observe the output. Ideally, in a test system, the output can be stored, and then a subsequent test run can be compared with a previous run to provide regression testing. By regression testing, we mean that we have proven that the new version under test has not broken anything that used to work for the previous version.

If we consider a non-GUI issue like message processing, you might introduce a test set of messages and then the application's message store, or inbox, will contain those messages. If the inbox from one run is compared with the next, there are two possible outcomes. The inboxes may be identical–this means that the new functionality has not broken previous features. In some cases, you might have expected changes to result from new functionality in the system under test–in other words, the output should have changed but didn't, telling us that the new features are not working.

The second possibility is that the outputs are different. If the inbox messages can be expressed in text format, some text difference tool can show those inconsistencies and the engineer can examine them to decide if the changes match the new functionality. For example if an urgency field has been added to the messages, a message identifier might change from:

Message index:17, message title: test_185_title  

to: Message index:17, message title: test_185_title, urgency: normal .

If the only changes in the output are consistent with this, we have successfully added the urgency feature, and we have not broken anything else.

It would be great to apply the same principle to graphics, but the challenge is that the appearance of the display can not usually be expressed in a humanly readable text format. This means that it is tricky to examine differences. The output of a GUI test is the appearance of the screen, which is a large number of pixels, and each pixel has a color value. While a human being can view the screen and establish if the text is readable and the layout conforms to requirements, the job of checking if each pixel has the correct numerical value (or color) is a nontrivial task.

A screen-capture tool could record the screen and a comparison tool could simply point out if the screenshot has changed, and then allow the tester to view the old and new image. In theory this is a good approach, but it can be very labor intensive. The reason is that the number of screenshots for a GUI can be very high. Usually it is desirable to capture the screen after every user event (such as button press), or external event (some alarm condition occurred). Unfortunately one global change can cause a change to all of those screenshots. For example, changing the default font, or background color, or making the margin at the left of the screen a little wider, will cause a change in every screenshot that has been gathered.

Commercial tools are available to help automate the process just described, and while I would not discourage their use, be aware that in this context, automated testing does not mean labor-free testing. The other restriction of these tools is that they need access to the GUI's frame buffer, so a certain amount of integration between your system and the test tool is required. The amount of effort required to get the tool up and running will depend on your operating system and hardware platform.

At the other extreme, all tests could be manual, where the test document instructs the tester to press certain buttons in a certain order, and then observe the results. While this approach does not require any code to be written, it is extremely weak for a number of reasons. One is that it is very time consuming for the tester to read instructions before each button press. If the tester makes a mistake, he will most likely have to restart a sequence. At the end of a sequence, if the output is not correct the tester will be left wondering if the test failed because of a bug or because they might have pressed the wrong button without realizing it, so they may run the sequence again just to be sure.

An even bigger disadvantage is that you can not call specific functions using the method above. In some cases you may want the tester to press on a button, then simulate some external event that causes an alarm to appear on the display and then get the tester to release the on-screen button. These are the sort of scenarios that often expose bugs. But scenarios that have exact sequence or timing requirements may be impossible for a human to reproduce.

Similarly if the test requires that an on-screen event happens at an exact x,y location, the tester cannot guarantee that that acted on the exact required location, especially on a touchscreen with no visible mouse.

Finally, testing the system with no access to the code means that the test can not access internal data structures at the end of the test to check if they have the correct values. The tester might be able to observe on-screen or external state, but they have no visibility of the state changes internal to the software that occurred during the test.

Roll your own
A simple test framework can exercise test cases in your GUI. These test cases can not be fully automated, because a human observer must confirm the appearance of the GUI. They allow the tester to run them in a reasonably efficient fashion, and they make it straightforward to ensure that one test run is consistent with the previous run.

There are two levels of testing that are of concern when you build a GUI. One is the low-level graphical operations accessible via a function call, for example drawing a line in a particular color. The second type of testing is where you want to simulate events that occur on the finished product. Much of this involves simulating mouse/touch events on the screen, or simulating external events such as changes to the analog or digital input lines. We will look at each of these situations in turn.

Testing low-level graphics
In many modern GUIs, the low-level graphics primitives are provided by a graphics library, so the issue of testing this portion of the software may already be taken care of by the company providing the library. Even if you are using a well-established product, you may choose to write tests for some portions of it.

Of course, if you are creating and selling a library, then you need a test infrastructure to test the whole library, and being able to repeat the tests will be important at each release.

At you find an executable that runs a few simple tests of a button object. The executable runs on a Windows PC, and simulates the type of testing you can perform on an embedded GUI, using an RS232 port as a means of getting test information out of the system. Each step of the test performs some small change on the GUI and either the test code or the tester checks that the appropriate change has happened.

For almost all graphics projects, there are huge advantages to making the system work on a desktop computer as well as on the target. All commercial embedded GUI libraries support this arrangement. This applies to the final product code, but it also applies to the test code. The tests can be developed far more quickly on the PC with a simulated display. The completed tests are then run on the target system, which will uncover any problems that arise on the target that were not an issue on the PC. The PC simulation and the target display should be pixel-for-pixel identical in theory, but you can still get bugs that occur on the target, but not on the PC. Problems with running out of memory or timing issues are possible sources of trouble. Also the PC is likely to be driven by a mouse, while the target may use a touchscreen. This difference might disguise some problems on the PC.

Step by step
The demo test does a few very simple checks on the functionality of the button object in the system under development, but for the purposes of this example, we are not concerned with whether the button is being fully tested. We are more concerned with exploring the test harness that makes it possible to manage lists of these tests. Once the harness is in place, it provides a home for tests to be added, for new features, or in response to bugs.

How the harness is coded will depend on the communications mechanism available, but a serial port is typical. At each step some test code is run and the tester is prompted to check the display for certain properties. The example test code tests a button object, and Figure 1 shows the state at the end of the first step, which simply tests the construction of a button object.

View the full-size image

A step like this requires the tester to confirm that the button is visible and its text is readable. The test code will often check internal values. For example a call to the button's getText() function could be checked to see if it returned the string “Press me”. The advantage of these checks is that they do not require any human interaction and so do not add to the test time.

Pressing 'Return' on the RS232 interface (probably using a terminal emulator program on a PC) will advance to the next step. Figure 2 shows this display after the tester has advanced to the next step. This step modified the font used in the object that has already been created. Changing the appearance in each small step allows the tester to observe if an operation that changes the object has the desired effect.

View the full-size image

The level of detail of the instructions given to the tester will vary. One of the tricky things to check is the coordinate system. If this test places the button at position 52, 26, it's desirable to measure the distance from the top, or left, of the display to the button. Even if the pixels were large enough to be individually counted, checking 52 of them would be tedious and error prone. One approach is to get the test code to draw a horizontal line 52 pixels from the top of the screen and then observe if the line is aligned with the edge of the object under test, as shown in Figure 3 .

Of course drawing guidelines assumes that the lines will come out in the correct position. So this test would catch a positioning bug that is specific to the button, but it would not catch the case where all x positions were incorrect by 3 pixels. In practice, you will want to do some position checks, but it is too labor intensive to add guidelines for every positioned object, and so once the coordinate system is tested enough to be considered trustworthy, the testing effort can move elsewhere.

If the test is initially run on a PC, a number of options are available, that might not be possible on the target. One is to do a screen capture and paste the screen into a drawing tool. Most drawing tools will report the position of the cursor as an x,y position and so you can measure the distance, in pixels, between any two points in the captured screenshot. Figure 4 shows an example using Paint Shop Pro.

View the full-size image

Once you have done a screen capture, you also have the chance to zoom in to examine the details. This is especially useful if the screen is of high resolution and subtle details, like anti-aliasing at the edge of letters can not be easily seen with the naked (and sometimes aging) eye. Figure 5 shows how zooming in allows us an up close view of the anti-aliasing being applied. Part (a) shows no anti-aliasing. When this is rectified, part (b) shows the edges being anti-aliased. The color choices for the softening of the edges are not ideal, and this is due a limited color palette. If this is rectified in software, the changes to the edges of the letters would be quite subtle, and zooming would be vital to examine the change.

View the full-size image

A typical PC simulation copies the target screen pixel for pixel. On some projects, I have modified the simulation to allow an option of doubling the number of pixels in the horizontal and vertical directions. This effectively is a 200% zoomed view of the target and can be very useful when examining the details.

Code organization
A good test harness must make it easy to insert and remove tests. Each set of tests should be runnable independent of the last set, to allow them to be run in reasonable size chunks. For example, all of the button tests might be built as one executable, and so they can be run completely independent of the slider tests. This means that a problem with the button tests will not impact the slider tests. Within the suite of tests for button, sectioning the numbers as 1.1, 1.2, 1.3, and so forth, and then another section as 2.1, 2.2, etc., will allow a new test to be added without having to reorder every following test. In theory you could have used the rule that new tests must always be added at the end, but in practice, previous tests often set up the right state in which you can best run the new test, so the best place for a new test might be just after some closely related test that has set the right conditions.

There will often be a bit of code that must be run before or after each test–sometimes a function to refresh the display, or to flush all pending events is required. Also, during development, you want a lot of control over which tests are run and in what order. For example, if I am having trouble with test 4.16, I may want to run that test several times, making alterations each time, but I do not want to run every preceding test each time. To achieve this, I want to temporarily disable all the tests before 4.16. In some cases, I will want to keep any tests that set up the conditions required for test 4.16.

These requirements lend themselves to a structure where each test step is a function and a table of pointers to functions dictates the order in which they will be called. Commenting out some portions of the table allows a bunch of tests to be temporarily disabled. Part of the array might look like Listing 1 .

Each function name in the list represents a step of the test. Note that the ButtonClick part requires two entries in the table. This is because the harness progresses to the next step each time the tester presses return and some tests require more than one press of the return key to check all parts. In this case there is a setup step and a check-at-end step. Most of the time, it is a purely semantic issue whether such setup and check pairs are considered to be two parts of one test or whether they are considered to be two tests that have a dependency on each other.

Event checking
In the case of the button test, the reason two steps are required is that at the first step, the tester is prompted to click on (or touch) the button, and then on the second step the test code confirms if the tester did in fact click on the button. For this to work, the first step has to set up the button's event handlers in such a way that the click is recorded in some boolean flag. In the second step, the flag is checked. If it is still false, the event never registered. Either the tester did not follow the instructions, or the event handling has a bug.

If you run the demo executable, and if you do not click on the button when instructed, the next step will print a failure message. Because the event handler also prints a message to the serial port, the tester could have just observed that there was a response to the event. However checking a flag means that there is less reliance on the tester and therefore less opportunity for human error.

A similar approach should be taken to any other objects that the tester manipulates. If the tester is instructed to move a slider to its maximum, the tester can be instructed to check the on-screen value of the slider, and the test code can also query the internal stored value. Both checks are necessary to ensure the object is storing and displaying its value correctly.

The Big Picture
Most of the discussion so far has assumed that individual graphical objects are being tested. In many cases, the underlying objects are trustworthy, but the application logic needs to be tested. I generally use the same test harness to test the application, but instead of calling individual functions of the object's interface, I fake mouse/touch events or eternal events that have an impact on the appearance of the interface, such as an alarm. Other external stimulation might be a varying analog signal that is being graphed on the GUI.

Simulating the external data sometimes means replacing a function such as readAnalog() function with code that accesses a table of test data instead of the analog hardware.

Creating false button presses is sometimes more challenging than you would think. If there are 10 buttons on the display, the test code may not have pointers to those objects. The only pointer to them might be inside the window, or group, object that contains them, and since that window contains many other objects, it might not be trivial to query the window for the specific button that you require.

In some cases, the creation order of the objects is known. For example, the test author might know that the fifth object created within a window is the button the required for the test. So the test code iterates through the list of buttons contained by the window until it reaches the fifth one. This approach is not entirely satisfactory, since minor changes to the order of construction will break the test.

Some libraries will allow you to seek a button without knowing the pointer. For example the PEG library allow an identifier, which is an integer, to be associated with any object. A find() function is available in the top level window, which will recursively search all windows to locate the object with the a matching identifier. This is better than the previous approach, but still has weaknesses. One is that the identifiers are not necessarily unique, so two buttons in two different windows might have the same identifier. The second problem is that it only locates objects that are on the display. The test code may need to access an object before it becomes visible, in order to alter some of its properties.

Another approach that doesn't require pointers to the button objects, is to generate mouse/touch events at a low level, and the x, y, position of those events is set to correspond to the position of the required button. This has the drawback that changing the position of a button means that the test might not work as expected.

While I would not recommend this method for all testing, there are some cases where generating low-level mouse/touch events are ideal. If you want to test what happens when you touch the edge of the button, specifying the x,y position allows you to simulate a user's touch on an exact location that sits on the button's border. Note that this is something that would be impossible for a human tester to do on a touchscreen, since the human finger is just not that accurate. Another example of a test that suits this method is where you simulate a press-down event inside the button and a release event outside the button.

If you plan for it early in the development cycle, the buttons and other objects required for testing can be registered with the test harness, using some enumeration type to index them in a table, which will make them always accessible to the test code. This registration step is often conditionally compiled code within the application itself. Adding code to the application in order to allow the test harness to work is not ideal, since it can make the build process more complex, but in this case it may be justified if it means that we get a clean mechanism for faking button events.

At this point you might be thinking that faking button presses might be more trouble than it is worth. If the tester has to press return at the end of each step, why not let the tester press the button on the GUI and avoid the need to generate button events from the test code. In practice, it would be very difficult to get the tester to follow an exact sequence of button presses. In many cases the test may require multiple buttons for one step–the intermediate states in the GUI may already have been tested, so the goal is to check the final state that resulted from many user events.

Showing all strings
A test suite that displays all test strings is very useful, especially if the GUI is going to be translated into foreign languages. Each step of the test navigates to a different state, or changes some external condition to display a string that has not been seen before. Of course, there are many strings that get seen at multiple steps of the test. The 'OK' string in one of your buttons might be visible at almost every test step. The serial port output of each step should identify which of the strings on the display have not already been viewed on a previous step.

Once you have a test that displays all strings, you can be confident that you will catch any strings that are too long to fit in the space allocated to them on the screen. When the product has been translated into a foreign language, the same test can be run to ensure that the translated strings do not overflow the areas that were sufficient in the English version.

If all translated strings can be generated in a simple list, I usually print out that list and tick off each string as it is viewed in the test, so that at the end, I can tell at a glance if there is a translated string that did not get tested.

Memory management
Some GUI libraries use the heap and some do not. A major concern is whether use of the GUI library leads to any memory leaks. Letting a GUI run for days, and checking memory consumption occasionally, does not really stress the heap. If there is a memory leak, it is caused by the response to a particular event. So the key to testing the heap is to measure heap size and drive a sequence of events, and then at the end of that sequence, check if the heap has grown.

So the memory management test should start at some neutral state, where there are no outstanding events on the GUI. Then the test code should navigate through every screen, trigger every conceivable event, and then navigate back to its starting position. The external events may be things like alarms that cause a warning to popup on the display, or anything else that leads to GUI interaction.

When the test code arrives back at the starting state, there should be no partially processed events. Any windows that were opened should be closed. Any alarms that were raised should be cleared. At this point, the heap size should be exactly the same as it was at the start. If not, you should suspect a leak.

Be aware that in some designs objects will consume space the first time they are used. For example, the first time an alarm occurs, space might be allocated for the alarm message. All following occurrences then use that same piece of storage. So one sequence through the test will result in a heap that is bigger at the end than it was at the start. For this reason, I prefer to run this sequence twice. I ignore any heap growth on the first run of the test, but the second run should not cause any further growth.

If there is growth in the heap, this test may detect it, but it does not identify the exact cause. There is further discussion of how to pinpoint the cause of a memory leak in my “More on Memory Leaks” article.1

Logging and pacing
Many types of applications lend themselves to test-by-log-file, where a test is stimulated and as the application performs actions, they are also logged to a file, or transmitted on a serial port. Generating the log file depends on the application including a logging feature, which might be conditionally compiled. Following the test, examination of the log file will indicate if the correct actions took place. This is particularly useful if the actions are difficult to observe externally. On a GUI, by definition, almost everything is visible externally and log files are not used very much.

There are a few cases where I have found logging revealed things that would have been difficult to spot on the display. A log file that records each item drawn on the display makes it possible to see the order in which they were drawn, and also if an item was drawn twice. It is not unusual to have a bug in a GUI that leads to part of the display being updated twice. Such a bug is relatively harmless, but it is a waste of CPU cycles to draw the same items more than once. Since the final state of the GUI is the same, a tester might not spot this. However if a log file is examined and it states that certain items were drawn more than once then the problem can be identified.

Another way to detect the same style of problem is to deliberately slow down the GUI. Add a delay in some low-level drawing routine and the whole GUI will move in slow motion. You will see the display being gradually built up as different elements are drawn. If an item is drawn, then erased, and then drawn again, you will see this happen.

Another common flaw that is exposed by this technique is where one component is drawn but then completely covered by some overlapping objects. Since there was no benefit in drawing the elements that get covered, the software could be optimized to remove that step.

A second method for slowing down the GUI is to disable the clock that drives the on-screen events. Each time the tester presses return, the clock is advanced a few ticks. I have found this very useful when testing waveforms that are tracking some analog input, such as an oscilloscope display. Running at full speed, it is difficult to observe the changes from one tick to the next. By freezing the clock, the waveform can be observed as it advances from one sample to the next on each press of the return key.

Border crossings
The alignment of objects and whether they overlap is often only visible if the borders of the object are visible. In many cases those borders are deliberately invisible. For example a piece of text will often not show its bounding box. A bitmap can use the background color so that it floats on the background instead of appearing as a rectangle.

For test purposes, seeing those boundaries is important if we want to ensure objects do not overlap. Fields that vary in length, such as names and addresses entered by the user, might look OK when the data is short, but cause problems if the field is filled completely. Figure 6 illustrates this as one of the text fields will overlap with an icon if the text field is completely filled.

View the full-size image

Changing a few of the background colors in a test build will often make these issues visible. You do not want to scatter conditionally compiled background definitions throughout your code. However, if all of your color definitions are arranged in one header file, changing the definition of the most common window background may show up most of the objects of interest.

Bugs versus taste
One of the challenges of interpreting test failures is distinguishing between outright bugs and issues that might be open to interpretation or taste, such as usability issues. One tester might consider a screen to be perfect while another tester might consider it difficult to read because of font size or difficult to understand because of poorly worded text. Other usability issues such as size or position of buttons may also be open to debate. These subjective failures may be rejected by developer on the grounds that the software was still meeting its explicit requirements.

It is important to have a process in place that allows these issues to be resolved. The process might state that no usability issues are to be addressed at the test phase, or it might state that the project leader will arbitrate if the tester and developer cannot agree if a particular test has failed, or you could hand all power to the tester and allow him to fail any test where he thinks the look of the GUI might not meet customer expectations. As long as the process is understood by all sides, it can avoid endless debate between developers and testers.

First impressions, lasting impressions
Testing the visual elements of your GUI is vital because the GUI forms the customer's first impression of the product and any quality issues in the GUI will be interpreted as quality issues for the whole product. Different areas of software will always require some tweaking of the test methods employed to maximize the number of bugs caught, and hopefully some of the techniques shown here will help you track down a few extra GUI bugs.

Niall Murphy has been designing user interfaces for over 14 years. He is the author of Front Panel: Designing Software for Embedded User Interfaces. Murphy teaches and consults on building better user interfaces. He welcomes feedback and can be reached at nmurphy@ His web site is

1. Murphy, Niall. “More on Memory Leaks,” Embedded Systems Programming , April 2002, available on-line at

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.