Debug and Test In Distributed Systems - Embedded.com

Debug and Test In Distributed Systems

As embedded designs move from traditional deeply embedded designs with little or connection to the outside world to open environments with some degree of cooperation amongst embedded devices, a number of new challenges are emerging. The most serious ones relate to the debugging and testing of both hardware and software.

A connected environment is one in which at the very least there is some regular connection to an outside system monitoring the device and even modifying the code. At the far extreme are the distributed computing environments being proposed for cooperating smart devices by Sun Microsystems in its Java-based Jini, and by Microsoft Corp. in its Internet and Web standards-based Universal Plug and Play. In these environments a number of questions have to be resolved having to do with real-time operation, determinism, and the ability to debug and test not only each device in such networks, but the network as a whole.

Let's assume an array of cooperating devices that share resources. How real-time can such a network of cooperating devices be? Second, how deterministic and predictable can such networks be? In some of the articles I have read, the authors deflect or ameliorate these issues by assuming that the environment in which such networks of devices would be used, such as home networks, will not require the determinism or real-time operation of traditional embedded applications.

But how does a developer monitor, test, and debug in such distributed computing environments? When a program or set of programs is distributed, debugging, monitoring, and testing grow more difficult.

At its heart a distributed system of cooperating devices is a collection of processes working together to accomplish a task. Each process is often a deterministic program able to execute separately from, and concurrently with, other processes that may or may not be distributed throughout the collection of cooperating devices.

From a testing and debug point of view, one primary problem is that a distributed network of devices has many foci of control. Sequential monitoring and debugging techniques, such as tracing and break points based on program counters and process states, must be extended and redefined if they are going to be useful.

A second problem is that communication delays among the cooperating devices in a distributed computing system make it difficult to determine the networked system's state at any given time. A third problem is that the kind of confederations being proposed seem to be inherently asynchronous, and therefore non-deterministic. There is no way, it seems, for two executions to produce anything but different orderings of events, both of which may be valid at the time they occur. Therefore it is difficult to reproduce errors and to test possible, but not likely, situations.

Fourth, there is the “uncertainty principle” as it relates to distributed systems. Introducing tools, methods and monitors to determine a system's correct or incorrect operation changes the nature of the system and thus the results. A distributed system that has been disturbed by the introduction of outside influences reacts differently from a traditional embedded system. In the latter case, a sequential program is not affected to any significant degree by the elapsed time delay between the execution of two successive instructions, such as would be introduced with the use of a symbolic debugger. And a debugger can interrupt a sequential process at a breakpoint without affecting the process's later execution. Things are much different in a distributed system, where stopping or slowing down one process may alter the behavior of the entire system.

One possible source of answers to these questions is in the control networks used in many embedded industrial control applications. There engineers have developed work-arounds and ways of limiting such uncertainties, making it possible to run real-time, deterministic functions on token-ring networks designed for the purpose.

Even as more probabilistic and asynchronous protocols such as Ethernet and the Internet's packet-based protocols have come into wider use in traditional embedded markets, engineers have found that by changing the boundary conditions in which such networks operate they can be made much more deterministic and predictable.

But other questions arise. As far as control networks are concerned, how scalable are the solutions, if any, developed there, to the new environment of net-centric computing? Even in a home network based on megabit/s wireless, power line, or IEEE 1394 schemes, the data rates far exceed those on most control networks. Moreover, unlike these networks, which are designed for transmitting control information and small amounts of data accurately, the new network environment in the home, for example, is one in which control transmissions and the small amounts of data they invoke are sent across the same connections as multimedia mixes of audio and video. In 5-GHz wireless home networks these problems are solved with a couple of techniques including frequency hopping. However, solutions developed in such environments are highly idiosyncratic to a particular application and might not be broadly applicable.

At the other end of the spectrum, in the large and sophisticated multiprocessor systems that have been the focus of researchers for many years, a whole range of techniques have been developed to ensure deterministic operation of a network of processors and to test and debug them reliably. These techniques — the use of interactive control and state re-creation, for example, or runtime protocol checking and event abstraction — depend on the use of techniques, languages and tools that are not familiar to the average engineer.

The answers to such questions will be necessary if the kinds of federated networks of distributed smart devices Sun and Microsoft propose are to work in a deterministic and reliable way.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.