Coordinated debugging of distributed systems
Imagine a world without a global notion of time. Now try to find out the flight direction of an airplane with the following information: There's an e-mail from Alice that she saw the plane about two hours after sunrise and another e-mail from Bob that he saw the plane about three hours after sunrise. So Alice and Bob tell us when they saw the plane, at least from their point of view. If they are nice, they might give us some additional information, namely their location at the moment of the observation. But, unfortunately embedded systems are usually not that nice.
Now imagine a distributed system built of networked embedded nodes. When a problem arises with the distributed application, the designer invokes a debugger to find out the faulty system behavior. In detail, the designer traces the execution of two nodes A and B simultaneously. The situation is similar to the plane-tracking scenario. Obviously, a systemwide notion of time would be helpful, which leads us to an important aspect in distributed debugging.
State-of-the-art embedded systems debugging
To alleviate the difficulty of debugging of modern microcontrollers and complex system on chips (SoCs), support for test and debugging is routinely built into silicon. Today, many debugging approaches rely on offline debugging based on trace buffers added to the CPU to reduce intrusiveness by the debug system. Leading processor-core vendors offer on-chip trace solutions.
However, existing debugging and test tools are mainly focused toward single nodes with one or more CPUs on-board or on-chip (SoC) by using auxiliary debug interfaces like JTAG or a simple UART. The problem of these approaches is that they entirely neglect the distributed nature of many applications since to connect a monitoring computer directly to each node is impractical, especially if the nodes are already embedded in their place of installation (see Figure 1a).
Wouldn't it be nice to precisely coordinate debug, test, trace, and replay activities across the entire distributed system without the use of any auxiliary interface or special cabling? Moreover, it would be helpful if only a single debugging master and monitoring computer is attached to the network, used to issue debugging, test, or monitoring actions. Such an approach that greatly simplifies debugging and testing of distributed systems is shown in Figure1b.