Transporting bugs with virtual checkpoints -

Transporting bugs with virtual checkpoints

One of the most challenging problems in software debugging is to correctly and reliably reproduce a problem found by someone other than the software developer.

Typically, test departments and software users have to type long and brittle “instructions to reproduce the error” in bug tracking systems, along with a list of the version of the software that contained the bug and anything else in the software environment that seems relevant.

More often than not, the developer has to iterate a series of questions with the reporter to get more information about the system where the bug manifested itself and the precise steps taken to cause the bug to trigger.

Such iterations can take days for a globally distributed development effort. The questions are often hard for the reporter to answer in the way the developer wants. In some cases, developers can do remote logins to the failing system in order to get their hands on the precise failing setup; but more often than not this is not possible due to security restrictions or the fact that the failing system is no longer available or has been used for some other important test.

For embedded systems, the problem gets compounded by the availability of the precise hardware needed to run the software and reproduce a bug. In the worst (but fairly common) case, both the reporter and the developer fail to reproduce the bug, making it a glitch that will never get properly fixed (quite possibly impacting an important customer as soon as the software is released).

Virtual platform checkpoints to the rescue
Developers can overcome the issues of reliable bug reproduction and communication of the target state using virtual platform checkpoints. By using a virtual platform to conduct software development and test [1] , any bug can be captured, communicated, and reproduced any number of times, in any location.

As shown in Figure 1 below , the basic concept is as follows. The bug reporter runs the target software (operating system and custom applications) using a virtual platform rather than physical hardware.

When a bug occurs, the reporter saves a checkpoint (R) of the combined hardware and software state and sends the checkpoint to the software developer. Opening the checkpoint using the same virtual platform, the developer is able to reproduce the bug as well as investigate the target system state for clues as to what went wrong. There is no need for the developer to get back to the reporter to gather more facts, as everything is encapsulated within the checkpoint.

Figure 1: Following the bug with checkpoints
Virtual platforms are becoming a standard tool in the development of embedded systems. Modern virtual platforms are fast and scalable and can run the target system fast enough to replace or complement physical development platforms [2} . Virtual platforms reduce risks and shorten time-to-market for embedded systems. They decouple hardware and software development and offer debug, test, and development environments superior to physical hardware systems.

Bug Reporting with Checkpoints
A virtual platform with checkpoints capability can store the complete state of the virtual platform to the host computer disk [2] . When the checkpoint is later loaded into the virtual platform, the exact same target system state results. The checkpoint includes the hardware setup (boards, networks, plug-in cards, and other configuration aspects), hardware state, and software state.

Specifically, it contains the contents of memories and disks, the state of processor registers, memory management units (MMUs), peripheral devices, and network connections. It also stores some core platform states, such as the current time and events queued for later execution, allowing the virtual platform to continue its execution seamlessly from a checkpoint.

For this kind of use, it is best that the virtual platform chosen is designed to be a deterministic and repeatable simulator [3] . Each time a checkpoint is opened, the target system will execute in the exact same way. The execution remains perfectly repeatable even as a developer reruns the bug, adds debug probes and traces in the simulator, sets breakpoints, and stops, single-steps, and reverses the execution.

Repeatability applies equally to single-processor systems as shared-memory multi-core systems and distributed multi-board systems. All parts of the virtual system stop and run in synchronization. Single-stepping interrupt routines and code in a multiprocessor does not affect the system execution.

Checkpointing and repeatable execution thus achieve perfect bug reproduction by any person, at any point in time, anywhere in the world, irrespective of the target hardware needed to run the software of interest. Additionally, virtual platforms make target hardware availability a nonissue.

There is a virtually infinite supply of every type of board, with no need to physically ship hardware around. The checkpoint contains the information necessary to give the developer a perfect copy of the hardware setup used by the bug reporter.

Automatic Testing and Checkpointing Bugs
Finally, let us walk through a more complex scenario where we put bug reporting into a larger context. The flow is illustrated in Figure 2 below .

Figure 2: Distributing checkpoints virtually
We start with a platform team that creates the fundamental software that is used by the developer’s software. This team configures a virtual platform and loads the platform software on it, boots it, and takes a checkpoint after the boot is finished (P). This checkpoint is distributed to developers, testers, and other users.

The developer can use this checkpoint to load and test software. The reporter (who in this case resides in the testing department) takes the developer’s software and adds some testing components and configures the target hardware some more. For example, the reporter might add some application-specific boards and specific test driver software to the target.

Once this setup is complete, another checkpoint is saved(R0 ). This is then used as the starting point for several parallel test runs of the system (including all software and hardware configuration performed by the platform team and the reporter). While the tests are running, checkpoints are regularly saved and stimuli recorded. When test Q hits a bug, we send the checkpoint of the failing system (RQ n ) and the recorded stimuli and send it back to the developer.

Since checkpoints can be incremental, there can be quite a few checkpoints saved during the execution of test Q. To simplify the package to be sent back to the developer, we apply a checkpoint merge operation before sending the bug report.

The merge combines all the state changes in a chain of checkpoints into a single checkpoint. In this example, the combined checkpoint (R) would still depend on checkpoint (P), since the developer already has that from the platform team and there is no need to duplicate information.

Figure 3. Merging virtual checkpoint state changes
The final situation for the developer is shown in the Figure 3 above . There is no need to redo the system boot or loading. The developer can replay the final steps of the system execution leading up to the bug and investigate the complete system state.

Jakob Engblom is Technical Marketing Manager for Wind River Simics, a full system simulator used by software developers to simulate any target hardware from a single processor to large, complex, and connected electronic systems. He has been working on Simics since 2002, and today works on product planning and how to apply Simics to customer problems. He holds a PhD in real-time systems from Uppsala University and an MSc in computer science, also from Uppsala University. He has written and presented more than 100 articles and talks on a variety of embedded systems topic since 1997.

[1] Full system simulation from embedded to high performance systems , by Jakob Engblom, Daniel Aarno, and Bengt Werner, in “Processor and SoC simulation,” Chapter 3, Rainer Leupers and Olivier Teman (ed), Springer Verlag, 2010
[2] Checkpoint and restore for SystemC models by Màrius Monton, Jakob Engblom, Christian Schröder, Jordi Carrabina and Mark Burton, publishing soon.
[3] Fixing an Intermittent Multi-core Bug with Wind River Simics, Wind River white paper

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.