To read original PDF of the print article, click here.
Distributed Software Design: Challenges and Solutions
In contrast to centralized systems, distributed software systems add a new layer of complexity to the already difficult problem of software design. In spite of that and for a variety of reasons, more and more modern-day software systems are distributed.
In some cases, such as telecommunications systems, distribution is inherent in the problem domain and cannot be avoided. In other cases, we rely on physical distribution to achieve properties that cannot be effectively realized with centralized systems. For instance, we may want to use multiple physically distributed processors to achieve a high degree of availability. Thus, if one processor experiences a hard failure, another one can take over its functionality. Another reason for distribution is performance. By splitting the processing load across multiple processors, it is generally possible to achieve higher throughput and faster response than with just a single processor.
We define a distributed software system (see Figure 1) as a system with two or more independent processing sites that communicate with each other over a medium whose transmission delays may exceed the time between successive state changes. Note that this definition is broad enough to encompass logically distributed systems. These are software systems that are concentrated on a single processing node but, for one reason or another,1 exhibit the above properties. An example of such a system is one that spans multiple heavyweight processes running on the same processor and which interact by exchanging asynchronous messages. Since each process is an independent unit of failure, it fits our definition. (For this reason, our term distributed site should not be taken necessarily as a synonym for the term “processing node.”) While logically distributed systems cannot achieve the full benefits of physically distributed systems, the problems that they present to developers are practically the same.
The challenges of distributed software
Failures, faults, and errors
For example, a bit in memory that is stuck at “high” is a fault. This will result in an error when a “low” value is written to that bit. When the value of that bit is read and used in a calculation, the outcome will be a failure.
Of course, this classification is a relative one. A fault is typically a failure at some lower level of abstraction (that is, the stuck-high bit may be the result of a lower-level fault due to an impurity in the manufacturing process).
When a failure occurs, it is first necessary to detect it and then to perform some basic failure handling. The latter involves diagnosis (determining the underlying cause of the fault), fault removal, and failure recovery. Each of these activities can be quite complex.
Consider, for example, failure diagnosis. A single fault can often lead to many errors and many different cascading failures, each of which may be reported independently. A key difficulty lies in sorting through the possible flurry of consequent error reports, correlating them, and determining the basic underlying cause (the fault).
Processing site failures
Communication media failures
A different type of media failure is an intermittent failure. These are failures whereby messages travelling through a communication medium are lost, reordered, or duplicated. Note that these are not always due to hardware failures. For example, a message may be lost because the system may have temporarily run out of memory for buffering it. Message reordering may occur due to successive messages taking different paths through the communication medium. If the delays incurred on these paths are different, they may overtake each other. Duplication can occur in a number of ways. For instance, it may result from a retransmission due to an erroneous conclusion that the original message was lost in transit.
One of the central problems with unreliable communication media is that it is not always possible to positively ascertain that a message that was sent has actually been received by the intended remote destination. A common technique for dealing with this is to use some type of positive acknowledgement protocol. In such protocols, the receiver notifies the sender when it receives a message. Of course, there is the possibility that the acknowledgement message itself will be lost, so that such protocols are merely an optimization and not a solution. The most common technique for detecting lost messages is based on time-outs. Namely, if we do not get a positive acknowledgement that our message was received within some reasonable time interval, we conclude that it was dropped somewhere along the way. The difficulty of this approach is distinguishing whether a message (or its acknowledgement) is simply slow or actually lost. If we make the time-out interval too short, we risk duplicating messages and, in some cases, reordering. If we make the interval too long, the system may become unresponsive.
Two different types of problems are caused by message delays. One type of problem results from variable delays (jitter). That is, the time taken for messages to reach the destination may vary significantly. The delays depend on a number of factors, such as the route taken through the communication medium, congestion in the medium, congestion at the processing sites (for example, a busy receiver), intermittent hardware failures, and so on. If the transmission delay were constant, we could better assess when a message has been lost. For this reason, some communication networks are designed as synchronous networks, so that delay values are fixed and known in advance.
However, even if the transmission delay is constant, the problem of out-of-date information still exists. Since messages convey information about state changes between components of the distributed system, the information in these messages may be out of date if the delays experienced are greater than the time required to change from one state to the next. This can lead to unstable systems. Imagine trying to drive a car in a situation where the visual input to your eyes is delayed by several seconds.
Transmission delays also lead to a complex situation that we will refer to as the relativistic effect. This is because transmission delays between different processing sites in a distributed system may be different. As a result, different sites could see the same set of messages but in a different order. In Figure 2 case, distributed sites NotifierP and NotifierQ each send out a notification about an event to the two clients (ClientA and ClientB). Due to the different routes taken by the individual messages and the different delays along those routes, we see that ClientB sees one sequence (event1 followed by event2), whereas ClientA sees another. As a consequence, the two clients may reach different conclusions about the state of the system.Note that the mismatch here is not the result of message overtaking (although this effect is compounded if overtaking occurs), but is merely a consequence of the different locations of the distributed agents relative to each other.
Distributed agreement problems
We introduce this problem with the following apocryphal story. Consider the case of two army generals of ancient times, when communication between distant sites could only be achieved by physically carrying messages between the sites. The two generals, Amphiloheus and Basileus, are in a predicament whereby the enemy host lies between them. While neither has the strength to single-handedly defeat the enemy, their combined force is sufficient. Thus, they must commence their attack at precisely the same time or they risk being defeated in turn. Their problem then is to agree on a time of attack.
Unfortunately for them, a messenger going from one general to the other must pass through enemy lines with a high likelihood of being caught.
Assume then, that Amphiloheus sends his messenger to Basileus with a proposed time of attack. To ensure that the message was received, he demands that the messenger return with a confirmation from Basileus. (While this is going on, Basileus could be in the process of sending his own messenger with his proposal-possibly different-for the time of the attack.)
The problem is obvious: if the messenger fails to get back to Amphiloheus, what conclusions can be reached? If the messenger succeeded in reaching the other side but was intercepted on the way back, there is a possibility that Basileus will attack at the proposed time but not Amphiloheus (since he did not get a confirmation). However, if the messenger was caught before he reached Basileus, then Amphiloheus is in danger of acting alone and suffering defeat. Furthermore, even if the messenger succeeds in getting back to Amphiloheus, there is still a possibility that
Basileus will not attack, because he is unsure that his confirmation actually got through. To remedy this, Basileus may decide to send his own messenger to Amphiloheus to ensure that his confirmation got through. But, the only way he can be certain of that is if he gets a confirmation of his confirmation. Since there is a possibility that neither messenger got through to Amphiloheus, Basileus is no better off than before if his second messenger does not return.
Clearly, while sending additional messengers can increase the likelihood that a confirmation will get through, it does not fundamentally solve the problem since there will always be a finite probability that messengers will get intercepted. It's a case of “does he know that I know that he knows that…?” and so on.
The Byzantine generals problemA common paradigm for a particular form of the distributed agreement problem is the so-called Byzantine generals problem.4 In this problem it is assumed that at least one processing site is faulty. Furthermore, any faulty processing sites are assumed to be malicious in the sense that they are trying to subvert agreement by intentionally sending incorrect information. While the number of such sites is known, their identity is not. The objective is to find an algorithm whereby all the non-faulty sites can reach agreement despite the disruptive actions of the faulty sites.This may seem like a rather exotic and needlessly paranoid problem of little practical value. However, it is actually quite important for fault-tolerant systems. The point is that any solution that can survive a malicious attack will be robust enough to survive practically any type of fault.
Distributed algorithms and techniques
A major difficulty with common knowledge is that all parties have to know about each other. In dynamic systems, where sites may come and go, this can add a significant amount of overhead. In particular, difficulties result when a new site joins while a distributed algorithm is in progress. In those cases, it is typical to refuse entry to the newcomer until the algorithm terminates. Conversely, a site may fail in the middle of executing a distributed algorithm, which means that all the other participants must be notified.
The principle is simple. In a rendezvous, for instance, one site performs a remote invocation while the receiver executes a “receive” operation. Unless the two operations are executed simultaneously, one of the parties will have to wait for the other (which is why it is called a “rendezvous”). When the rendezvous occurs, the invoking site is suspended until the invoked operation is completed at the remote site and a reply is received. The receiving site simply continues its execution.
Unfortunately, synchronous communication typically involves significant overhead (mainly because of the need to deal with failures) that reduces efficiency. Furthermore, while a synchronous communication is in progress, both participants are unreceptive to any other communications. In many reactive and real-time systems, such prolonged periods of autism are unacceptable.In contrast, asynchronous communication, or message passing, is simpler and more efficient, but does leave the problem of synchronization to the application. Messages are simply sent to the receiver regardless of whether the receiver is ready to accept them.
The Byzantine generals algorithm
Clearly, considering our previous discussions on the difficulty of achieving these conditions in distributed systems, these are rather significant constraints on the applicability of the algorithm. They are introduced to simplify the problem. Variants of the algorithm have been developed that relax some of these conditions.In the Lamport-Shostak-Pease algorithm, one of the sites initiates the agreement process. We refer to this site as the coordinator while the other sites are called followers. The algorithm generally proceeds as follows:
All non-faulty sites will correctly relay the value they received from the coordinator. Faulty sites, on the other hand, are free to do as they choose, including sending the wrong value or not sending a value at all.
Note that, if there is no majority, the algorithm cannot work. In fact, it can be shown that, in the case of n faulty sites, the algorithm only works if there are at least (3n+1) sites. This means that the algorithm can only work if there are at least four sites. (A variant of the algorithm in which faulty followers are constrained to correctly relay the coordinator's value only requires [2n+1] sites.)
This algorithm works even if the coordinator is faulty, because it guarantees that all non-faulty sites will agree on the same value. Further details, including a proof of correctness of the algorithm, can be found in Lamport, Shostak, and Pease.
Distributed mutual exclusion
When a site S receives a request from site R to enter a critical section, it will either give its permission immediately or defer the reply until later. The algorithm that it follows is:
A widely used election algorithm is called the Bully algorithm  : a site entering the election procedure first sends its bid to all other sites. The bid is typically based on some quality that is unique to each site, such as a site identifier, so that there are no duplicate bids. If a leader already exists, it responds to the new site and the new site simply enters the monitoring state. In this state, the site monitors the operational status of the elected leader (for example, through a periodic poll).
If no leader exists, the candidate site waits to receive the bids of all other candidates that it receives in response to its own bid. Once these are all in, each site evaluates the bids. The site with the highest bid is selected as the leader (which explains the name of the algorithm). This site sends a confirmation message to all other sites.
If a site in the monitoring state determines that the current leader has failed, it re-enters the election procedure by sending its bid to all sites that had previously entered a higher bid than its own.
If none of these respond, the site elects itself as the leader and informs all lower-bid sites. However, should a higher-bidding site respond, it waits to receive a confirmation message from the new leader. If such a message does not arrive, the election process is re-entered.
This simple sounding algorithm is fraught with practical implementation difficulties. As sites come and go, it is difficult to keep track of who is present and what phase of the election process they are in. If the communication medium is not perfectly reliable, the difficulties are often insurmountable. Detection of leader failures can also be problematic if the polling is based on timeouts.
Fault-tolerant distributed broadcasting
The weakest form of fault-tolerant broadcast is called a reliable broadcast. Within a given group of sites, it guarantees three basic properties:
Note that reliable broadcast does not guarantee the order in which the messages will be received. The order of reception can be crucial in some cases. For example, if an “on” message is sent after an “off” message, it is important that they are received in that order.
For this reason a number of other types of fault-tolerant broadcasts are defined with progressively stronger restrictions on the delivery order of messages.
A FIFO broadcast is a reliable broadcast that guarantees that messages sent by the same sender are delivered in sending order. Note, however, that multiple sites could be broadcasting at the same time so that the possibility exists even with FIFO broadcasts that different members of the group will receive messages from different sites in a different order. For this reason, we define atomic broadcasts, which guarantee that all non-faulty sites receive all messages in the same order.
Another type of reliable broadcast is causal broadcast. In this case, the delivery order is guaranteed to respect causal relationships. That is, it is not possible for a message to be delivered before another message that precedes it in the system cause-effect chain. Finally, it is sometimes useful to provide a causal atomic broadcast that combines the features of causal and atomic broadcasts.
These distributed groups provide the basic services required by many different distributed algorithms and applications. A typical group facility provides the following: