Doing design and debug on real-time distributed applications -

Doing design and debug on real-time distributed applications

Real-time system designers and embedded software developers are veryfamiliar with the tools and techniques for designing, developing anddebugging standalone or loosely coupled embedded systems. UML may be used at the designstage, an IDE during development anddebuggers and logic analyzers (amongst other tools) at the integrationand debug phases.

However, as connectivity between embedded systems becomes the norm,what used to be a few nodes connected together with clear functionalseparation between the applications on each node, is now often tens orhundreds of nodes with logical applications spread across them.

In fact, such distributed systems are becoming increasinglyheterogeneous in terms of both operating systems and executingprocessors with tight connectivity between real-time and enterprisesystems becoming the norm.

This article will identify the issues of real-time distributedsystem development and discuss how development platforms and tools haveto evolve to address this challenging new environment.

The idea of a 'platform' for development has long pervaded thereal-time embedded design space as a means to define the applicationdevelopment environment separately from the underlying (and often verycomplex) real-time hardware, protocol stacks and device drivers.

Much as the OS evolved to provide the fundamental building blocksof standalone system-development platforms, real-time middleware hasevolved to address the distributed-systems development challenges ofreal-time network performance, scalability and heterogeneous processorand operating system support.

And as has already happened in the evolution of the standardreal-time operating system, new tools are becoming available to supportdevelopment, debug and maintenance of the target environment ” in thiscase, real-time applications in large distributed systems.

The Distributed-System DevelopmentPlatform
From the individual application developer's perspective, there arethree basic capabilities which must be provided by an applicationdevelopment platform when a logical application spans multiplenetworked computers:

1. Communication betweenthreads of execution
2. Synchronization of events
3. Controlled latency andefficient use of the network resources

Communication and synchronization are fairly obvious distributedplatform service requirements and are analogous to the servicesprovided by an OS. However for distributed applications they have torun transparently across a network infrastructure of heterogeneous OS'sand processors with all that implies in terms of byte ordering and datarepresentation formats.

It should ideally use a mechanism that does not require thedeveloper to have an explicit understanding of the location of theintended receiver of a message or synchronizing thread so that thenetwork can be treated as a single target system from an applicationdevelopment perspective.

Typically a user will use a commercial or home-grown middleware toprovide these key capabilities. There are several middleware solutionswhich support this approach, such as JMS and DDS (Data Distributions Service)from the ObjectManagement Group (OMG).

Figure1. DDS provides a framework for providing controlled latency andefficient use of target network resources.

But only solutions such as DDS (Figure1, above ) explicitly address the third point; controlled latencyand efficient use of (target) network resources, which is a criticalissue in real-time applications. DDS provides messaging andsynchronization similar to JMS, but additionally incorporates amechanism called Quality of Service (QoS) .

QoS brings to the application level the means to explicitly definethe level of service (priority, performance, reliability etc) requiredbetween an originator of a message or synchronization request, and therecipient.

DDS treats the target network somewhat like a state machine,recognizing that real-time systems are data driven and it's thearrival, movement, transition and consumption of data thatfundamentally defines the operation of a real-time system.

Some data is critical and needs to be obtained and processed withincontrolled/fixed latencies, most especially across the network.Moreover, some data need to be persisted for defined periods of time soit can be used in computation; other data may need to be reliablydelivered but is less time critical. QoS facilitates all theserequirements and more.

Perhaps the greatest advantage of using middleware isn't oftenappreciated until late in the application development process: defininginterfaces in a rich middleware format makes it much easier tointegrate, debug and maintain a system. What good middleware does isallow you to completely specify the data interaction through quality ofservice which forms a “contract” for the application.

DDS, for example, allows a data source to specify not only the datatype but also whether the data is sent with a “send once” or “retryuntil” semantic, how big a history to store for late arrivingreceivers, the priority of this source as compared to others, theminimum rate at which the data will be sent, as well as many, many morepossibilities.

By setting these explicitly many of the soft issues that creep upin integration can be addressed quickly by matching promised behaviorto that requested. DDS middleware will even provide warnings at runtimewhen contracts aren't met.

The Distributed System ToolsChallenges
A development platform isn't complete until it has the tools to supportthe environment throughout the application lifecycle. Ask any supportor sustaining engineer and they will tell you that they need threethings: good documentation, great tools, and code written to expose thestate and event parameters as easily as possible.

Provided that a clear interface definition language between thenetworked application nodes is used, current toolchains that operate ona single node are still quite useful in running down memory, codecorrectness, performance and, in some cases, can be used for white boxtesting.

The new challenge for developers is isolation, identification andcorrection of the problems that are exhibited at the integration stage,when individual distributed sub-components are connected and thenetworked subcomponents start ” for the first time – to execute as alarge integrated application.

Most engineers are familiar with debugging within a single-boardenvironment, and will have developed a high degree of debug competencein fixing “hard faults”, i.e. faults that halt or crash the process.

These are relatively easy to debug because you can normally workbackwards from the state of the crash or, if you were really lucky, youcould get it to crash in a debugger and you were home free!

The nastiest hard faults to debug are normally multithreadingrelated, so it should comes as no surprise that as we move to larger,more complex distributed systems you will see more and more of thesetypes of faults; every node will have its own thread(s) of execution,potentially working on the same data at the same time received fromacross the distributed system architecture.

Distributed systems are also much more likely to be subject tonumerous types of “soft faults”. In these cases, no applicationcrashes, but the warning lights are flashing and the distributedapplication either performs poorly or not at all.

There are numerous types of soft-faults, but many of them come downto the synchronization of data generation and processing across manymachines. One example, for instance, is the effect of a single droppedmessage; if that message is one sample of an update of data it mightnot be a big deal, but if it is transitional event or command, youcould suddenly have the system in an unexpected state.

Moreover, you may not be able to detect this until some time afterthe initial fault occurred, leading to a debugging nightmare. This isjust one type of soft fault, many others occur regularly: highlatencies (either sustained or periodic) which cause control loops tolose stability, self-reinforcing data dropouts, unexpectedly blockingapplications, systems that work in the lab but fail when scaled up,data mismatches between what is provided and what is expected etc.

Thus for distributed systems, it is vital to be able to get at thestate and event information without stopping or significantly slowingthe system.

New Tools for Distributed AppDevelopment
Starting with the basics: the first thing that you need is a tool thatallows you to generate common data types across all your boards and aprocess that keeps them in synchronization. If you are using middlewareyou will normally write your data types in a meta-language (IDL, XML,XDR) and autogenerate the code that handles the data types.

Some systems will allow you to create new types on the fly, butbeware that this is potentially a source of error since it will be muchharder to verify the usage contract on data if the programmer doesn'tknow its details.

Fig2. Using an IDL file to define the data types tools like 'rtiddsgen'can generate code that handles the defined data types. Extensions tortiddsgen can be used to generate data types that are also compatiblewith CORBA.

The next tool you need allows you to design the applications andspecify the data and QoS requirements. This class of tool shouldideally be used to design as many of the applications as possible sothat the QoS contract between senders and receivers is met at designtime (much easier than debugging and fixing it later).

In an ideal world, this tool should integrate with your normaldesign methodology. For instance, UML users may wish to consider SparxUML. This tool has interface description components formiddleware such as DDS to make it easier to initially set these up.

Once your applications are deployed you need to make sure that thecommunications are happening as intended, QoS parameters are setproperly and the system is running! One of the first questions you willneed to answer at integration is “are these distributed applicationfunctions talking properly?”.

With the appropriate middleware interrogation tool such as RTIAnalyzer you can determine that the middleware has “hooked up” the twoapplications and you can make sure that the designers of the twoapplication functions actually met specification.

Fig3: RTI Analyzer is a system level debugging tool that finds RTI DataDistribution Service objects in a running system, organizes them, andshows you their communication parameters. Correlating this informationwith your system design can quickly expose performance and reliabilityissues.

Such a tool also needs to show you which objects are exchangingdata, or more importantly, not exchanging data, and if not, suggest whynot. You can truly appreciate these tools when you have 3 differentsubcontractors (or even just free-willed developers) each building partof a distributed application and it comes time to integrate. Root causeof most configuration issues can be found quickly, accurately and witha minimum of debate.

Fig4: RTI Analyzer showing the QoS mismatch error in 'Ownership' between aDataReader and DataWriter.

Three use-cases for debugging
You now have great up-front design, good interfaces that people arefollowing and yet it still isn't working. This is where distributedsystem-wide state and event analysis becomes key. Typically there arethree use cases during the debugging:

Use Case #1 .Monitoring ofoverall distributed system health . In this case you might wantto see the high-level behavior of most of the applications in thesystem. Tools such as RTView from SL Corporation allow you tobuild one or many Control Panel GUIs or Data Report views by listeningto data put out by the middleware as well as your application.

By selectively instrumenting key variables in your application thiscan be a great first step in isolating system issues and ensuring thatyour system is running properly. When taking advantage of data-centricmiddleware implementations such as DDS, tools like RTView can generatedisplays without detailed information about its source.

Merely knowing that it exists and in what format it is available(as defined by your data meta-language) and how the data is madeavailable (QoS) facilitates rapid assimilation of the informationneeded for such useful system overview displays.

Typically the applications leveraging this sort of tool have manydifferent data sources, primarily at low time resolution, that need tobe combined and displayed together to create a meaningful perspectiveof the systems health.

Tools like these are often deployed as part of the maintenanceenvironment for the distributed system and as such include easy to useGUI builders that allow end user oriented displays of system data andhealth to be generated.

Fig5: RTView provides virtual Instruments for user views of the keydistributed data

Use Case #2. Getting into theguts of a faulty application. Once you've isolated which nodesare having a problem with the system health tool you may need to getmore detailed and higher time resolution data from a few selectedapplications and their interaction across the network. Tools such asRTI Scope provides this functionality by allowing the user to look atthe different data streams into and out of an application graphically,in real-time, without pre-configuration.

Think of RTI Scope as an oscilloscope for the data coming out of anapplication from anywhere in the network, complete with negative timetriggering, multiple plot types (vs time, x vs y), derived signals andthe ability to save out the data for post processing. RTI Scope stilloperates at the defined data level, but is designed to capture fewerdata sources, in a minimally intrusive manner.

It is ideal for capturing data that runs out of bounds, or isdelivered outside of its required throughput or performance objectives.Its full knowledge of the underlying middleware implementation meansthat it can 'discover' the data sources and recipients and connect tothem across the network, leveraging the middleware to pull the datathrough for local analysis and visualization.

Fig6: RTI Scope showing DDS Topic Data plotted against time with an'Oscilloscope-like” display.

Use Case #3.Network Analysis . Sometimes the middleware is attempting toperform the service requested of it by the application, but theunderlying network implementation itself is not behaving as expected.Perhaps the router is dropping packets, or a wireless hop is providinglower bandwidth than needed, or a node periodically drops off thenetwork for a second or two or any one of a number of other problems.

Drilling down to the wire
At this point you are left with no choice but to drill down to the wireand see what's happening. You reach for your protocol analyzer and itgives you all the UDP or other packet information you need. But it'smeaningless unless you can correlate it back up to the application.

Well constructed distributed middleware include a standardized onthe wire protocol; DDS for example uses the open standard RTPS (Real-Time Publish Subscribe) ,and as you'd expect such a platform includes the ability to monitor thewire traffic and pull out the associated middleware packets, dissectingthem for correlation back to the application layer. RTI can help heretoo with a dedicated Protocol Analyzer, capable of providing areal-time display of all “on-the-wire” activity.

Fig7. RTI Protocol Analyzer allows you to see the 'on wire' traffic.

As we have seen, the development of real-time applicationsoperating across a large and complex network requires an innovativeapproach to deliver an effective tools strategy in the face of themultiple challenges posed by such a distributed environment. Withoutsuch a coherent and integrated strategy, both system performance andproject development times can be severely compromised.

The fundamental requirements for an effective tools are essentiallytwo-fold: the ability to define and support a consistent andpredictable real-time environment across heterogeneous operatingsystems, processors and network topologies; and a fully integratedtoolchain that provides comprehensive debug information at each level(design, code, integration, debugging & maintenance) across thedistributed system architecture comprising the development application.

Dr. Bob Kindel is Vice President of Engineering Services at Real-Time Innovations, Inc. He joinedRTI in 2000 as an applications engineer with a strong background incontrol systems and distributed network engineering. He is an expert atthe design and debugging of complex distributed applications and spenttwo years focused on embedded and network-system debugging. His pastconsulting work has included customer training, system design andintegration debugging. He can be reached at

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.