A proliferation of communications networks, protocols, and access technologies has created a need for a wide array of communications systems and platforms designed to meet the increased demand. As more complex systems evolve more and more interconnection between nodes exists within a worldwide communications infrastructure.
The issue of node reliability therefore becomes one of the more important aspects to be considered. Node reliability is often specified in terms of a system mean time between failure (MTBF)1 or a system availability.
When doing system MTBF analysis, the backplane of a communication node is often overlooked. Yet, the backplane it is one of the most important pieces in the system. In this paper, we will use MTBF analysis to investigate the reliability of a communications node with regard specifically to the bus topology. In particular we examine both a multi-drop parallel bus (PB) and a point-to-point star bus (SB) topology.
A communication node consists of modules connected together (Figure 1 below) . For example, a typical system consists of a chassis and backplane which allows various network interface modules (NM) and channel interface modules (CM) to be plugged into it through a series of connectors as shown below. The channel cards accept payload from the user side such as packet data, voice, audio or video signals. The network interface module (NM) aggregates and formats the channel payload from the channel cards for presentation to the network. The transport of the payload between channel cards and the network module is performed by the physical backplane and associated drivers.
In general, modules can be categorized as either common system modules (CSM) or sub-system modules (SSM). A CSM is a module that provides system resources (i.e. power, timing etc.) that are common or shared by all the other modules in the system. If a CSM fails the entire system fails. Examples of CSMs are NMs and power supply modules.
A subsystem module (SSM), on the other hand, only provides resources for a subset of the system. Therefore if an SSM fails only the subsystem associated with the SSM fails. Examples of SSMs are channel modules (CM).
It should be noted that the bus itself is also a system module which can fail. Ultimately we are motivated to select a bus topology, which most easily and efficiently provides the capability to support both module redundancy and bus redundancy. For high availability nodes, it is desirable to use a star topology to meet these objectives. The motivation for using a star bus topology stems from its inherent robustness and ability to provide module and bus redundancy effectively and typically at lower cost and complexity compared with a parallel multidrop bus. Below, we'll show why the start architecture provides the best approach to bus design.
A parallel multidrop bus, as shown in Figure 2a , consists of a set of data lines that are shared by all the modules in the system. Typically the bus bandwidth can be partitioned in a fashion where each module is assigned a portion of the bandwidth and writes and reads from the bus during its allocated time. An example of a q -wide parallel system is shown for the case with a single NM and m CMs.
A point-point star bus, as shown in Figure 2b , consists of point-to-point high-speed serial buses from each CM to a common switch matrix (SMX) often located on the NM. The switch matrix can consist of a simple Layer-2 Ethernet switch for packet-based backplanes or a time-slot interchange (TSI) chip for TDM-based backplanes. The high-speed buses from each CM form the “spokes” of the star bus. The SMX can rout data from one CM to another CM or can aggregate bandwidth from the CMs and rout it to the network.
In the parallel bus topology the bus is a common system module (CSS) and is a single point of failure. A single misbehaving channel card can take down the entire system. Failures include failed I/O driver, corrupted memory, failed CPU or software failure that causes the bus to be written to at the wrong time.
CompactPCI provides a good example of a parallel system. PCI is a parallel bus technology initially designed for low reliability computers. It has been adapted for telecomm applications simply because it was available early on but it certainly is a non-optimal solution for high availability communications nodes. The implications of this are:
- Hot-swap (live insertion) of cards needs to be addressed carefully since a glitch on the backplane caused by inserting a module live can cause errors on other module's data.
- Since all the modules are on the bus there can be up to n or more line drivers and line receivers on each thread of the bus. This implies that there will be a large capacitive load due to all the drivers/receivers hanging on the bus. This in turn requires a large current (larger driver) to charge the bus rapidly at high data rates. Higher current implies higher power, crosstalk and EMI.
- Each module is required to have q drivers and receivers where q is typically 32 or 64 for PCI.
- To accommodate redundancy for equipment failure requires hardware duplication and bus duplication. A platform is composed of a primary node and an identical backup node that will provide service if the primary fails as shown. Such a solution can be prohibitively expensive, as everything must be duplicated including the buses, which require each CM to have 2 q drivers and receivers.
In the star topology, each bus is an SSM as compared to the parallel bus where the bus is a CSM. This is an important distinction. Several benefits of the star topology are immediately obvious:
- A single channel module (CM) failure does not fail the entire system.
- Hot-swap is less of an issue with regard to the data and timing buses since live insertion of a CM cannot affect any bus other than the bus associated with the module being inserted. Replacing failed boards does not require a special sequencing for turning bus drivers on or off.
- Bus loading is light since only a single driver and receiver is connected to a bus. This allows higher speeds to be attained with less current which lowers power and minimizes crosstalk.
- The number of drivers a CM needs to support is reduced significantly.
- The number of CMs is not limited as in the case for PCI.
The systems in Figure 2 also show a redundant NM. In this configuration the network lines are input to a common switch (S1), which switches the lines between the primary or secondary NMs. Since the switch is usually significantly less complex than the NM itself, the MTBF of S1 can be made orders of magnitude larger then the MTBF for the NM. This will be generally true and should be a system requirement.
Now that we've described the basic bus topologies, let's look at the concepts of system MTBF and compare the performance of the parallel and star buses. In order to illustrate the concepts in a generic way we define a reference system which consists of a single shelf with one or two NMs and n channel modules. We also define:
m is the minimum MTBF (hours) for a NM or CM
n is the number of CMs in the system
MTR is the mean time to repair a failed module (typical = 24 hours)
Ts is the time (seconds) required to switch over from primary to standby modules in a system with redundancy (typical = 50 ms)
We also define the following function that returns the parallel value of two elements:
Parallel Bus without Redundancy
Let's now look at the MTBF for a parallel bus without NM redundancy, as shown in Figure 3 .
With regard to reliability, we say that the CMs and NMs are series-connected. That is a failure on any one element in series fails the system. It can be shown that the MTBF of two modules series connected with MTBF = m1 and m2 respectively is given by:1
The MTBF of n identical elements is:
If the MTBF of the bus-related failure mechanisms are known, then the MTBF of the CMs could be replaced with the bus-related MTBF which would be higher than the MTBF of the entire module. We will designate the bus-related MTBF of the CMs as k m , where k is a scale factor relating the bus-related MTBF to the module MTBF. In order to keep the analysis generic we assume that the NMs and CMs have the same minimum module level MTBF:
In reality the MTBF of each card will be different and the appropriate numbers can be substituted as needed. However it is typical that the MTBF of the modules will all be within an order of magnitude of each other. The MTBF for the parallel bus topology is n series connected CMs which, in turn are series connected with the NM. Therefore the total MTBF of the system is:
In most cases the bus-related MTBF is not known in which case the MTBF of the module must be used (i.e. k =1). An advantage of the star bus topology is that it does not require knowledge of the bus-related MTBF. It only requires knowledge of the module MTBF, which is known.
Parallel Bus with Redundancy
Next let's investigate the MTBF for a redundant parallel bus with NM redundancy, as shown in Figure 4 .
For this case although we have redundant NMs, the bus is still a single point of failure. It should be noted that even running redundant parallel buses does not solve the problem completely since all cards are connected to both buses and a single card can still bring down both buses. This can happen for example if a CPU fails and causes a module to write data to the bus in the wrong time or place. This type of failure could occur on both buses. Therefore even with redundant buses the buses still remain a single point of failure.
In the redundant parallel bus, if a switchover occurs in less then Tr (ms), the system is considered not to have failed. The value of Tr=50 ms is considered typical in the PSTN.
The NMs in this configuration are configured as parallel-redundant modules in hot standby with a switch-over time (TR) and a mean repair time (MTR). It can be shown  that the MTBF of two modules connected in hot-standby with module MTBF = m is given by:1
A typical value of MTR used is 24 or 48 hours. The CMs are series connected and the equivalent MTBF of the series connected CMs is:
The CMs and NMs are also series connected with each other, therefore the total system MTBF is par(m1 , m2 ), which yields:
Since the mean time to repair is in the order of a few days, the term 2 k MTR in the denominator is almost always orders of magnitude smaller than the m n term from which we can approximate the system MTBF as:
We can compare the approximation with the exact for the specific case for k =1, m =106 hours and MTR =24 hours from which:
Star Bus without NM Redundancy
In this section, we'll investigate the reliability of a basic non-redundant star bus, as shown in Figure 5 . For this configuration, each module is connected directly to the NM, therefore the bus in this topology is no longer a single point of failure. A shorted bus or misconfigured channel card cannot bring the entire system down. This topology is inherently more reliable then the parallel bus structure and as we will see it allows the system to take full advantage of NM redundancy.
Since a channel module can fail and be replaced without taking the system down they can be considered as being connected in hot-standby with module MTBF = k m . The MTBF of the single pair of channel cards is:
An upper bound on the number of channel cards is assumed to be 32. However, the number will typically be lower. Then for 32 channel cards the MTBF is:
We can use m 32 as a lower bound on the MTBF for the channel cards. Now the channel cards are series connected with the network card. Therefore, the total MTBF for the star bus without redundancy is par(m32> , m) which is:
One thing to notice is that the system MTBF is not a function of the channel module MTBF or the number of CMs installed. Figure 6 compares the star bus without redundancy with the parallel bus with and without redundancy for m =106 hours and MTR =24 hours.
As Figure 6 shows, the MTBF of the parallel bus system is improved if the number of CMs installed is small. However even for a moderate number of CMs the system reliability is improved only marginally. The reason is that the bus is a single point of failure thereby causing the CMs to be series connected with each other. This in turn causes the collective MTBF of the CMs to swamp out any benefits that might otherwise be achieved by the redundant NMs.
Given k =5 the results above indicate that for a system with more than five CMs, the non-redundant star bus has a higher MTBF than the redundant parallel bus. If we do not know the bus-related MTBF, which is typically the case, we would use a value of k =1.
In the case of k =1, the non-redundant star bus is better than the parallel bus for any number of CMs. We will see in the next section that the star bus with NM redundancy is orders of magnitude better than any of these previous scenarios.
Star Bus with NM Redundancy
The last configuration we want to investigate is a system with a star bus and NM redundancy as shown in Figure 7 . For this configuration there are two independent star buses one primary and one secondary. The primary star bus is connected from each channel module to the primary NM while the secondary bus is connected from each channel module to the secondary or standby NM. Often a heartbeat or keep-alive signal is sent from the primary NM to the standby NM and CMs. If the primary NM fails, the keep-alive disappears and the system switches to the standby bus and standby NM.
For a system with two redundant NMs, the MTBF of the redundant NMs is:
Therefore the total system MTBF is:
Figure 8 shows the MTBF for the redundant parallel bus and both redundant and non-redundant star topologies with m =106 hours and MTR =24 hours. We see from Figure 8 that the MTBF of the star bus star bus with NM redundancy is orders of magnitude better than the parallel bus with NM redundancy.
Now we can translate the system MTBF results into an alternate metric referred to as the system availability. The availability of a system is the percentage of up-time to down-time. On average a system failure occurs once every MTBF hours. During this failure the system is down for MTR hours. Therefore the system availability is:
Table 1 provides a summary of the availability for the redundant parallel bus, star bus, and redundant star bus systems for m =106 hours and MTR =24 hours.
An often used objective for high reliability nodes is to meet so-called five-nines. Five-nines mean that the availability of the system is 99.999% or better. The results above indicate that for this scenario (m =106 , MTR =24), the star bus with redundancy is available greater than 99.999998% of the time which far exceeds five-nines.
On the other hand, the parallel bus system cannot meet five-nines and is typically three or four-nines. Thus, from the results in Table 1, we can see that the star bus with NM redundancy is orders of magnitude better than the parallel bus topology for this specific case and in general this will be true.
As a last example consider a system with redundant NMs and 16 CMs. Assume we want a five-nines system so we must determine what is the minimum module MTBF required to satisfy this requirement. Solving availability for the parallel bus with redundancy and n=16 gives:
The result indicates that the module MTBF of the channel cards must be 3.84e7 to 7.68e6 hours. This is generally not feasible. More typical numbers for module level MTBF are between 500,000 and 800,000 at 25 degrees C and typically an order of magnitude lower at 50 degrees C.
Next, solving availability for the star bus with redundancy and n=16 yields:
The result indicates that the module MTBF of the channel cards must be 44,253 hours. This is generally very easy to meet even at 50 degrees C.
In this article, we investigated the reliability of a communications node with regard specifically to the bus topology. In particular, we examined both a parallel bus and a point-to-point star bus topology.
In an effort to meet five-nines reliability we are motivated to select a bus topology, which most easily and efficiently provides the capability to provide both module and bus redundancy. For high-availability nodes the star topology is particularly well suited to meet these objectives. The results shown in this article indicate that for the typical scenarios described the star bus with NM redundancy is orders of magnitude better than the parallel bus topology and for these typical cases is able to achieve five-nines reliability.
- Tutorial on Analyzing High Reliability: Part 1http://www.commsdesign.com/showArticle.jhtml?articleID=18311424
- Tutorial on Analyzing high Availability: Part 2http://www.commsdesign.com/showArticle.jhtml?articleID=18311631
About the Author
Jeffrey S. Pattavina is the chief system engineer for Harris Corporation's Intraplex access products group. A member of IEEE, Jeff holds a Masters in Electrical Engineering from Northeastern University, Boston, MA. Jeff can be reached at .