
How to Build Complex Protocols in Embedded Software
by David Nix
A rafting expedition through Colorados Royal Gorge offers travelers a front-row seat to
some of the most breath-taking landscapes imaginable. The pristine river transports the rafters between narrow, thousand-foot-high canyon walls through some of the best whitewater around. Those who are prepared for the turbulent water and know how to navigate through it tend to come through drenched but safe, mastering the rapids. To the unprepared, however, the trip can be an incredibly harrowing ride. The initial tranquility of the river gives way to seven miles of churning, world-class rapids with names
like Sledgehammer and Boateater. The whitewater tosses the rafts around like so much flotsam, sending mountains of cold water washing over the shell-shocked rafters and ping-ponging the rafts from boulder to boulder. The unsuspecting often find themselves without a raft, scrambling for the safety of shore while frantically fishing boats, paddles, and equipment out of the river.
This scenario is played out in a more subtle manner in the world of software protocol development. Any protocol can be
characterized by tranquil cases, which describe the protocols expected and typical behavior, and turbulent cases, which describe the unexpected and atypical behavior. The developer of a complex protocol often enters the development process with an understanding of the tranquil cases, but is thoroughly unprepared for the turbulent. The tranquil development gradually gives way to the emergence of seemingly endless turbulent cases, forcing the developer from a state of total control into a completely reactionary
mode. The implementation degrades into a jumbled mass of unpredictable logic, and the developer is forced to bail out.
The key to implementing a complex protocol in embedded software is to look beyond the tranquil flow and identify the turbulence before pushing headlong into the implementation. This article details an effective strategy for navigating the turbulence.
Elements of a Development Approach
Several elements, when present, can bolster the software development strategy for the
implementation of complex protocols. These elements provide a solid foundation for a protocol software development process.
Visibility.
A picture really is worth a thousand words. Human beings typically gain understanding of complex problems more readily through visual symbols than through verbose explanations. To master a complex protocol, the developer must be able to visualize the most information in the smallest amount of space. A concise visual model of the protocol is more useful than reams
of protocol description.
Testability.
The developer must be able to systematically test every possible path through the protocol, rather than rely on stress testing or field testing to randomly exercise the more obscure scenarios. Although random luck is always welcome, it cannot be a planned element of the development process. The developer must have faith that all the bases have been covered by the time the software enters system testing. In addition, being able to automatically exercise the
model and see the model in motion greatly enhances the developers chances of achieving these testing goals.
Maintainability.
Maintenance is as popular as the plague for most software engineers because it requires them to make modifications to a poorly documented, difficult-to-understand implementation without introducing fundamental flaws. Maintenance of the protocol model and software starts the moment the developer begins designing, and continues for the life of the product. Throughout its
lifetime, a piece of protocol software may be touched by many engineers over a period of years. Its crucial that the developer and future maintainers have the ability to easily understand and modify the protocol so that the implementation doesnt degrade over time.
Example Protocol
To demonstrate the protocol development process, here is a rudimentary development case, complete with a simple communication protocol. This example case requires the developer to implement a protocol
for establishing, maintaining, and releasing a data link over a physical medium for a client application. You can assume that layering has already been applied, and protocol messaging has been defined to fit the layered scheme. Layering is simply a divide-and-conquer technique resulting in client-server layers. Each layer provides black-box services to the layer above through message pipelines termed service access points, or SAPs. The layering for the example appears in
Figure
1.
When the application must communicate with its peer application, it requests the data link layer to establish and manage a communication path to the peer. The data link layer, in turn, invokes the services of the physical layer to establish a physical connection. To limit the scope of this example, well focus our attention on the data link layer. The services of the data link layer include accessing physical link resources, maintaining the link while transferring application data in
segmented packets, and releasing the link resources when instructed to do so. In addition, if the peer data link layer initiates the link, the data link layer must accept the link initiation, maintain the link while accepting peer data packets, and release the link when instructed to do so.
Figure 2 illustrates the tranquil case message sequence charts, or ladder diagrams, which capture the expected and typical protocol messaging necessary to provide the data link services.
Developing the Protocol
The protocol software development process can be divided into several stages. Although the stages are successive, significant overlap and iteration can occur. The stages include:
* Tranquil case design modeling
* Turbulent case design modeling
* Model translation
* Maintenance
In a traditional software development process, very little testing or verification occurs until coding is well underway or even completed, and any
precoding verification is usually limited to design reviews or inspections. Numerous industry studies, however, have shown that the cost of repairing defects increases exponentially with each stage of development. A defect that is easily repaired in the design stage can become a runaway train if it escapes to the field. In order to detect defects as early as possible in the development process, each development stage must include a mechanism for formally and rigorously verifying the results of that stage.
Tranquil Case Design Modeling Stage
When starting the design process, the developer generally starts with at least an understanding of the tranquil cases. The purpose of the tranquil case design modeling stage is for the developer to accurately, completely, and exclusively reproduce the tranquil cases in a design model.
Identify the Finite State Machines
The foundation element of the design model is the finite state machine because its such a powerful way of representing
protocol behavior. A finite state machine consists of discrete states responding to events by taking some action and possibly transitioning to a new state. Events may be triggered by various occurrences, such as the receipt of a message, the expiration of a timer, or a hardware interrupt. There are numerous formalisms for representing state machines, but Harel statecharts are one of the most versatile and powerful representation methods.
1
Among its many capabilities, statecharts allow the developer to:
* Execute an action upon entry to a state (entry action)
* Execute an action upon exit from a state (exit action)
* Execute an action upon occurrence of an event (transition action)
* Allow the action to branch based on decision criteria (conditional connector)
* Allow sub-states within states (state hierarchy)
* Allow sub-states to inherit transitions from the parent state (group transitions)
Using statecharts, you can capture the behavior
of each layer in highly visual state diagrams. The behavior of a layer can often be expressed with a single state machine, although complex layers may consist of multiple independent state machines. If a layer contains multiple state machine objects, then the developer must determine which state machine handles which set of events.
In the example, the data link layer is simple and serial enough to be captured in a single state machine object, so all of the messaging associated with that layer must be
handled by that object.
Determine the States and Transitions
After identifying the scope of the finite state machines, the developer must determine the constituent states of each state machine by analyzing the protocol messaging. Since protocols tend to be serial, the receipt of each tranquil case message generally results in a change of state.
An analysis of the messaging from the example protocol produces the following set of states:
* Link Idle: the link isnt established
and no establishment is in progress
* Access Pending: the application has requested a link establishment, and access confirmation is pending
* Ready: the link has been established by the application or the peer application, but the application isnt sending any data; data packets can be received from the peer application in this state
* Segmenting: data is being sent to the peer application one packet at a time; data packets can also be received from the peer application
* Releasing: the application has requested a release, and release confirmation is pending
Adding the tranquil transitions to the appropriate states results in the diagram depicted in
Figure 3.
Assign Actions to All Action Points
Once the states and transitions have been identified, behavior must be associated with each of them. As I mentioned, statecharts allow actions to be taken upon exiting a state, entering a state, or during a transition.
In the example provided, an exit action that initializes all of the variables to be used during the life of the link could be assigned to the Link Idle state, and an entry action could be assigned that updates link metrics. When a dllEstablishReq message is received from the application or a phyAccessInd message is received from the physical layer, the exit action is executed. When the Link Idle state is entered for any reason, the entry action is executed.
An example of a transition action is the
behavior associated with the receipt of a dllEstablishReq from the application. Upon receiving this message, a phyAccessReq is sent to the physical layer, and the state machine transitions to the Access Pending state.
The diamond which appears between the Segmenting and Ready states is a conditional connector, which allows a transition to fork based on a decision. In the example, if the final data packet has been sent to the physical layer and then confirmation of the final data packet is received, the state
machine transitions to the Ready state; otherwise, the state machine sends the next data packet and remains in the Segmenting state.
Simplify the Model Using State Hierarchy
Assuming that simplicity is preferable to complexity, state hierarchy can be applied to substantially reduce the visual complexity and action redundancy in the model. Any time two or more states have at least one action in common, the potential to apply state hierarchy exists.
In the example state machine, the Segmenting
and Ready states react in the same way to the following three events:
* phyReleaseInd: the state machine cancels all timers, sends a dllReleaseInd to the application, a phyReleaseRsp to the physical layer, and transitions to the Link Idle state
* dllReleaseReq: the state machine cancels all timers, sends a phyReleaseReq to the physical layer, and transitions to the Releasing state
* phyPacketInd: the state machine appends the data packet from the physical layer to the
previously assembled packets and sends a phyPacketRsp to the physical layer; if its the final packet of the peer-to-peer message, the state machine forwards the assembled message to the application as a dllDataInd
State hierarchy can be employed by encapsulating both the Ready and Segmenting states in an Established state, as illustrated in
Figure 4. The transitions in the Established state are executed regardless of the sub-state occupied by the state machine at the
time the event occurs. Not only are three transition arrows removed from the model, but it becomes more obvious that the two sub-states are extremely similar, with the exception of the way they handle dllDataReq and phyPacketCfm messages. The Estab-lished state then becomes the parent state, and both the Active and Segmenting states inherit transition actions from it.
Verify the Model
To verify the tranquil case design model, a test plan must be constructed to exercise all possible paths
through the protocol. The construction of a message sequence chart per test case at first seems a logical approach, but as the turbulent cases are considered, the number of message sequence charts grows exponentially. An alternative approach is to attack the model state-by-state. Because states are self-contained objects and the state machine may occupy only one state at a time, verifying the actions of each state independently will cover most, if not all, of the test scenarios. A tranquil case test plan for the
example scenario would consist of the test cases shown in Table 1.
At minimum, the developer can use the test cases to visually inspect the model, or hold a peer inspection with the test cases as the basis. However, physically executing the test cases against the model using an automated tool set is more desirable. The capability to animate a design model by injecting messages and invoking events provides a means of verification which can be nearly 100% effective in uncovering protocol logic errors
* Capture the model in a formal representation which completely represents all states, transitions, and actions
* Cause the model to execute the actions by reacting to events, where the events can be artificially introduced
through an I/O interface
* Allow the results of the execution to be observed
Only a few commercially available design modeling tool sets support model execution, but the numbers are growing as developers demand increasingly sophisticated methods of capturing and verifying designs.
Table 1
State Event Expected Action
Link Idle
receipt of dllEstablishReq Send phyAccessReq, transition to
Access Pending
Link Idle
receipt of phyAccessInd Send dllEstablishInd and phyAccessRsp, transition to
Ready
Access Pending
receipt of phyAccessCfm Send dllEstablishCfm, transition to
Ready
Established
receipt of phyReleaseInd Send dllReleaseInd and phyReleaseRsp, transition to
Link Idle
Established
receipt of dllReleaseReq Send phyReleaseReq, transition to
Releasing
Established
receipt of phyPacketInd Send phyPacketRsp; if last packet in message, send dllDataInd
Ready
receipt of dllDataReq Segment, send first phyPacketReq, transition to
Segmenting
Segmenting
receipt of phyPacketCfm If more packets, send next phyPacketReq, else send dllDataCfm and transition to
Ready
Releasing
receipt of phyReleaseCfm Send dllReleaseCfm, transition to
Link Idle
|
Turbulent Case Design Modeling Stage
Once the tranquil model has been completed and verified for all possible tranquil scenarios, the turbulent cases can be addressed. The most serious problems associated with protocol development are due not to the expected difficulties, but to the unexpected ones. More specifically, its the failure to account for all
turbulent cases that can lead to catastrophic problems in the field. The maintenance to correct these problems often results in the creation of unexpected behaviors in the protocol, leading to yet further maintenance. After a few such maintenance episodes, the stability and reliability of the original implementation can be violated to the point that the maintainer or developer is unsure of the actual behavior of the state machines. The net result is usually customer unhappiness, which promotes much
finger-pointing and assignment of blame. The purpose of this stage is to discover all of the turbulent behaviors of the protocol and incorporate them into the design model.
Discover the Turbulent Cases
Protocol turbulence can be defined as anything unexpected or atypical in the behavior of the protocol. Turbulent behavior can be roughly divided into two categories. The first category is that of unexpected events, which are events that occur at an unexpected point in the protocol. These unexpected cases
can be handled in various ways. They can be:
* Ignoredtake no action
* Rejectedtake no action except for the return of an explicit denial to the sender
* Allowedtake action to process the event
A notable class of unexpected events is that of
glare
, which may be defined as the collision of activity between protocol entities, and it occurs most often at the beginning of a protocol exchange. For example, if adjacent layers simultaneously initiate the
start of a protocol exchange, the resulting collision of messages may cause both attempts to fail. In the example protocol, if a phyAccessReq is sent by the data link layer and the physical layer simultaneously sends a phyAccessInd, by the time the phyAccessInd arrives, the data link layer has already transitioned to a state which doesnt handle the event. If the physical layer has likewise transitioned to a state that doesnt handle the phyAccessReq, both layers will wait indefinitely for a
confirmation event which will never be sent, resulting in a stalemate. Because glare cases tend to be subtle, they are often overlooked during protocol development, with devastating results. Glare can be handled in much the same way as other unexpected events, but with a couple of caveats. In the example case, the data link layer, which has just sent a phyAccessReq, can react in the following ways to the receipt of a phyAccessInd:
* Ignore it and continue to wait for the phyAccessCfm
*
Reject it by sending an explicit reject message, such as a phyDenyRsp
* Allow the new event to take precedence by abandoning the wait for the phyAccessCfm, sending a dllEstablishInd to the application and a phyAccessRsp to the physical layer, and transitioning to the Ready state
* Perform explicit collision handling, such as having the data link layer back off for a random period of time before resending the phyAccessReq
In a real-world example of glare, if two vehicles arrive at a four-way
stop from adjacent streets and both decide to proceed, there may be a collision. If both defer, they may wait indefinitely for the other to proceed, or at least until someones patience runs out. To prevent this impasse, a traffic rule has been instituted which dictates that the vehicle to the right has right of way. Similar rules must be applied in the protocol world. If the two layers decide to react to the glare case in the same way, a stalemate or collision can still result. For example, if both
layers decide to allow the new event to take precedence over the previous event, both sides may wait indefinitely for the other to make the next move. To avoid a stalemate, the layers must agree ahead of time on a rule for handling the glare scenarios. In the example, the data link layer can defer by allowing the phyAccessInd, and the physical layer can assume right of way by ignoring the phyAccessReq, allowing the protocol to proceed as if the case were a peer origination. For every possible glare scenario,
then, its important that the adjacent layers have a pre-determined rule for handling the glare.
The second category of turbulent behavior is that of the absence of expected events. Failure to account for this category of behavior can cause the state machine to become locked in a state. The typical way to prevent the state machine from locking is to start a watchdog timer whenever a future event is expected. If the expected event arrives, the timer is halted. If, however, the expected event
doesnt arrive before the timer expires, the state machine is able to take corrective action. Corrective actions include:
* Resending the last message and restarting the timer. For example, if the data link layer sends a phyAccessReq and starts a timer, and the phyAccessCfm isnt received before the timer expires, the data link layer can resend the phyAccessReq and restart the timer
* Taking a new course of action. For example, if the data link layer sends a phyAccessReq and starts
a timer, and the phyAccessCfm isnt received before the timer expires, the data link layer can send a phyReleaseRsp to the physical layer, a dllReleaseInd to the application, and transition to the Link Idle state
* Combining actions. For example, the data link layer may resend the phyAccessReq some number of times before taking a new course of action
The widespread use of watchdog timers can vastly improve the ability of the state machine to recover from protocol anomalies, software
defects, or physical anomalies by allowing the state machine to eventually proceed or return to an idle state.
Figure 5 illustrates a method of representing timers in the message sequence charts. The chart indicates that on receipt of a dllEstablishReq, a phyAccessReq is sent to the physical layer and a timer of length
tAccess
is started. If a phyAccessCfm is received before the timer expires, the timer is halted; otherwise corrective action is taken.
To recap, the
procedure for identifying turbulent cases is to identify every possible tranquil or turbulent event and then determine the reaction to every possible event in every state. A straightforward approach for this procedure is to create a matrix of states and events, and fill in every square. Table 2 illustrates the state-event matrix for the example case.
Table 2
1:Link Idle 2:Access 3:Established 4:Established 5:Releasing
Pending Ready Segmenting
dllEstablishReq +T:2 -I -I -I -I
dllDataReq -I -I +T:4 -I -I
dllReleaseReq -I -AT:5 -AT:5 -AT:5 -I
dllReleaseRsp -I -A:1 -A:1 -A:1 -A:1
phyAccessInd +T:3 -GAT:3 -I -I -I
phyAccessCfm -I +T:3 -I -I -I
phyPacketInd -I
-I +T:3 +T:4 -I
phyPacketCfm -I -I -I +TC:3/4 -I
phyReleaseInd -I -A:1 -A:1 -A:1 -GA:1
phyReleaseCfm -I -A:1 -A:1 -A:1 +A:1
timeout -I -OTN:1 -A:1 -OTN:1 -A:1
Legend: + (expected), - (unexpected), I (ignore), R (reject), C (conditional), A (allow), T (start timer), G (glare), O (do over),
N (new course of action), :# (new state)
|
The state-event matrix includes a couple of new events: dllReleaseRsp, which allows the protocol exchange to be aborted by the application, and timeout, which accounts for timer expiration. The use of the matrix allows the developer the opportunity to make an explicit decision about how to handle each event in each state, leaving no details unspecified.
Add New States and Transitions
Once all of the
turbulent cases have been identified, the design model must be updated to include the new transitions, and if necessary, new states. In the example case, new states arent necessary, but the number of transitions must be tripled, effectively adding a new transition for each unexpected and non-ignored event in the state-event matrix.
Assign Actions To All Action Points
As with the tranquil cases, actions must be assigned to each new transition. The state-event matrix serves as a guideline
for determining the actions. For example, the receipt of a phyAccessInd event in the Access Pending state is a glare scenario (G) which is allowed (A), an activity timer is started (T), and the state machine transitions to the Ready state. A timeout event in the Segmenting state is handled by sending the previous phyPacketReq over (O), re-starting the activity timer (T), and staying in the same state. If the timeout happens again, a new course of
action is taken (N) by sending a phyReleaseRsp to the physical layer, a dllReleaseInd to the application, and transitioning to the Link Idle state.
Simplify the Model Using State Hierarchy
Hierarchy can be applied once again to simplify the design model. In the example state machine, all of the states, except the Link Idle state, react in the same way to the following transitions:
* phyReleaseInd: the state machine halts all timers, sends a dllReleaseInd to the
application and a phyReleaseRsp to the physical layer, and transitions to the Link Idle state
* phyReleaseCfm: the state machine halts all timers, sends a dllReleaseInd to the application, and transitions to the Link Idle state
* dllReleaseRsp: the state machine halts all timers, sends a phyReleaseRsp to the physical layer, and transitions to the Link Idle state
To simplify the model, all of these states can be enclosed in an Active state. Furthermore, since the phyReleaseInd and
phyReleaseCfm events have nearly identical transition actions, they can be combined into a single transition. The result of these simplifications is depicted in
Figure 6.
Without state hierarchy, the model would contain 30 transitions instead of 18, making the relatively simple model far more visually complex.
Verify the Model
Once the final model has been constructed, the developer can expand the tranquil case test plan to include all new states and transitions.
Again, the developer can use the test cases to visually inspect the model, hold a peer inspection based on the test cases, or animate the model to execute the test cases.
Model Translation Stage
After the design model has been completed and verified for all possible tranquil and turbulent scenarios, the developer can convert it to a software implementation. The purpose of this stage is to translate the design model to a target software architecture without introducing translation defects.
Translate the Design Model to a Target Architecture
Design models are tremendous tools, but much of their usefulness is sacrificed if the developer must make a large leap between the model and the implementation. Constructing a model of states and transitions is of little benefit if the target software architecture cannot cleanly and easily represent states and transitions. Force-fitting a model into an incompatible target architecture generally results in the systematic introduction of
translation errors. It is critical, therefore, that the target architecture represents as precisely as possible the elements present in the design model.
Normally the coding stage is the major contributor of defects to a software product and consumes a major portion of the schedule. Good translation practices from a working design model can make the coding stage nearly a non-factor in terms of development time and defect rates. A few design modeling tool sets are available which provide automated translation
(code generation) to a target architecture that exactly represents the model. The benefits and practicality of automated code generation are explored in the March 1998 issue of
Embedded Systems Programming
.
2
If automated code generation isnt an option, the most robust approach to achieving a seamless transition from the model to the implementation is to develop from the ground up a target architecture which precisely mirrors the elements of the design model. Quite often, though,
development schedule and resource pressures dictate that an existing architecture be adapted for the implementation. In any case, the more closely the architecture represents the modeling elements, the faster and more accurately the translation can occur. The major elements of the model, along with target architecture elements, are listed in Table 3.
Table 3
Model
Element Target Element
layers (state machine containers) task object, task shell, task context, processing thread
states, with hierarchy state objects (e.g. tables) specifying actions on a per- event and per-state basis, and associated state change per action; also data associated with the state
state exit and entry actions methods or subroutines belonging to or invoked by the state object, executed upon entry to and exit from a state messaging
intertask and intra-task message passing event transitions methods or subroutines belonging to or invoked by the state object, executed in response to the occur- rence of events (e.g. timeouts, receipt of messages)
timer handling typical timer handling, but the expiry of a timer causes an event to which the state objects react
I/O handling I/O interface; allows messages to be injected or events to be generated to exercise a state object (verification); allows state progress
to be monitored
|
Verify the Translation
The translation can be verified by applying the turbulent case test plan to the implementation. Successful results prove both that the implementation is logically correct and that the translation process has introduced no defects. An I/O interface can be used to exercise the implementation in the same manner that the model has been verified, by injecting messages and invoking events.
Maintenance Stage
As soon as
the tranquil case design modeling is underway, any change made to what has already been done is effectively maintenance. By this definition, the turbulent case modeling stage is simply pre-release maintenance. Post-release maintenance, then, is identical in practice to the turbulent case modeling stage.
When a problem is located, the design model is checked to determine if it is in error. Because of the visual nature of the model, a logical error is more readily identifiable in the model than it is in the
implementation. If the model is in error, it can be corrected by adding, deleting, or modifying states, transitions, or actions. Once corrected, the model can be verified in the normal manner.
After verification of the corrected model, the modification can be translated to the implementation, and the implementation verified. If the model is not in error, then a translation error has occurred or a weakness in the target software architecture has been exposed. In either case, the design model and
implementation remain in lock-step with one another.
The most valuable effect of maintenance at a design/model level is that the maintainer acquires a better understanding of the developers intent, and is therefore better equipped to preserve it.
Final Points
The protocol development methodology outlined here goes directly to the heart of managing the complexity of real-time protocols by providing a systematic approach to accurately and completely express a protocol in software. The approach
doesnt preclude using inspections, reviews, metrics gathering, integration stages, or iterative life-cycle management. Instead, it provides a technical complement to other good engineering practices. Protocol development is difficult enough without the added burden of an unfocused development approach. The use of this development approach can greatly enhance the ability of the protocol developer to withstand the turbulence, stay in the boat, and master the complex protocol.
David Nix is a
systems software engineer at AMD in Austin, TX, currently working on the development of a wireless local loop application. His experience includes 10 years of development in the telecommunications field with AT&T Wireless, Motorola, and NEC, with the last five years devoted to developing various wireless protocols. He can be reached at david.nix@amd.com.
References
1. Harel, David, Statecharts: A Visual Formalism for Complex Systems,
Science of Computer Programming
8
, North-Holland, 1987, p. 231.
2. Bell, Rodney, Code Generation from Object Models,
Embedded Systems Programming
, March 1998, p. 74.
|