Software Considerations For Host Processor Hot Swap

Compact PCI hardware now provides thecapability to switch control between redundant host systemprocessors. To fully take advantage of the hardware capability,operating systems, device drivers, and applications software mustbe configured to handle the implications of a host processorswitch.

The ability to replace I/O boards in a system without shuttingoff power (hot swap) provides a tremendous boost to themaintainability and availability of a system. It simplifies theprocess of replacement for failed boards, minimizes system downtime, and eliminates the need to reboot a system after boardreplacement. Extending hot swapability to system boards andproviding redundant system boards can provide the further benefitof allowing the system to be tolerant of both system software andsystem board failures. If the active system board fails, thereplacement board simply gets swapped in, and the system continuesoperation with minimal interruption.

Compact PCI systems already support the hot swapping ofnon-system cards, power supplies, and peripheral components. Boththe hardware needs and the board software algorithms required forthe hot swapping of non-system slots are well described by thePICMG Compact PCI Hot Swap Extensions standard ( To allowsystem processor slots to hot swap, several facilities must work inconcert. For one, the hardware must allow the Compact PCI busdomains to have their control transferred from one processor toanother without disrupting the bus operation. The software mustalso allow the transfer, and must do so at all levels of thesystem, from the system controller through each of the system I/Oboards.

The hardware requirements have been solved. System chassis suchas the Motorola Computer Group CPX8216 and CPX8221 have thehardware necessary to allow this bus takeover. However,successfully performing a domain takeover requires some adjustmentto the system software.

Controlling the Bus

Before swapping host controllers, you must either first haltactivity on the system bus, or ensure that the post-swap activitywill not cause failure of the new host. You can halt bus activityby changing the functional configuration of boards in the system orby using slot control signals as defined by the High-AvailabilityHot Swap Standard. It is important to understand the effect ofthese on a board in order to apply them properly.

Adjusting each function's configuration space is one way to stopbus activity. Whether a board presents a bridge or a device as thesingle PCI load in the slot, you have control over that function'sPCI mastering and target response capabilities. Specifically, youcan use the master enable bit in the command register to disableorigination of PCI bus cycles by a slot. Disabling a function's busmastering capability, however, may result in overruns or underrunsfor that function. System software must therefore account for thosepossibilities anytime a function or bus has traffic suspended forlonger than a few microseconds.

You can also stop bus activity by using one of two slot controlsignals specified by the High-Availability Hot Swap Standard: BdSelor PCIReset. The choice of signal has significant impact on how thesystem recovers following a hot swap.

Negating the BdSel signal removes back-end power from a Compact PCIboard. This moves a board into the H0 state of the hardwareconnection as described in the standard. In this state, the boardis effectively powered off. Only early power, which is used tostabilize the connection to the PCI bus signals in the floatingcondition, remains active. It should be noted that the time toenter the H0 state following the negation of BdSel for a slot isunspecified and will be determined by the hardware implementationof the slot payload.

Negating BdSel to a slot has the disadvantage of requiring thatthe board go through a power-up sequence prior to returning toservice.

Asserting the PCIReset signal to a slot causes the PCI interfacefor that slot to reset and float its electrical connections for theduration of the reset. PCIReset will propagate onto the board's PCIbus in accordance with the PCI specification, and may reset theentire board or only the PCI bus, depending on hardwareimplementation. The time from signal assertion until the PCIinterface is reset and floating is not specified and will bedetermined by the hardware implementation.

Negating the reset allows the board to progress to the H2/S0state. When the new host releases the board from reset, the normalPICMG hot swap enumeration process begins. This process allows adevice driver to be configured and PCI resource allocations to bemade for the I/O board.

Using PCIReset to halt bus activity allows the board to maintainits power, so volatile memory should not be lost. Whether the boardsoftware can recover from PCIReset without complete initialization,however, is a matter for its software designer to determine.

Processor Hot Swap Classifications

Once the bus traffic is quieted, host processor hot swap canproceed. There are several ways to go. Processor hot swaps can beclassified along two orthogonal criteria: the relationship of thetwo processors during the switchover and the maintenance of statewithin the payload and its associated driver.

There are two possibilities for processor relationship during ahot swap: cooperative and pre-emptive. If both system processorsare capable of participating in the bus domain switchover, then theswitchover is considered a cooperative switchover. Otherwise, theswitchover is considered to be pre-emptive.

In a cooperative switchover the claiming processor notifies thecurrent domain owner of the intent to switch and waits for theowner's consent before claiming the bus domain. A pre-emptiveswitchover is initiated in the same manner. However, if theclaiming system processor determines that the time allotted for thecooperative switchover has elapsed prior to receiving the currentowner's consent, the bus domain is forcibly switched to the newprocessor.

Cooperative switchovers are desired where possible. Certaintypes of software faults, however, can cause the current owner tonot notice a simple request. To maximize the probability that thecurrent owner will take notice, even in the face of softwarefaults, the switchover request should trigger an interrupt.

A cooperative switchover procedure will attempt to notify allI/O functions of the switchover and allow them to halt bus activitybefore proceeding. Intelligent I/O functions may be allowed tocomplete checkpoint transfers. Additionally, the current domainowner may attempt to complete state checkpointing of drivers andother items before consenting to the takeover.

By performing the notifications and checkpointing, theswitchover procedure is most likely to preserve the system stateand halt bus activity. Preserving the system state and halting thebus maximizes the probability of a clean takeover and thesubsequent recovery and continuation of the system function.

Certain hardware or software faults may interfere with acooperative takeover. For example, the checkpoint link betweenprocessors may have failed, preventing a clean checkpoint frombeing established. Another possibility is that the current ownermay have established an interrupt-inhibited environment, causing itto fail to recognize the takeover request. Other types of softwareor hardware faults may have similar effects. The result is apre-emptive switchover.

A pre-emptive switchover is simply any switchover that did notsatisfy the conditions for a cooperative switchover. In apre-emptive switchover, the most recent checkpoint may be stale,the I/O functions may not have been notified of the change, the busmay not have halted, or any combination of the foregoing conditionsmay be in effect.

Payload and Driver State
There are three levels of domain switchover related to payload anddriver state maintenance, designated cold, warm, and hot. In thecold switchover, the I/O devices and their associated new driversdo not maintain any state from before the switchover. In the warmswitchover, I/O devices maintain at least some state from beforethe switchover and will be notified in some manner that a switchhas occurred. In the hot switchover, the I/O devices are unawarethat a switch has occurred.

Cold switchovers are accomplished by either using the PCIResetfor each board, or by using the BdSel. Because of this, cold vs.warm/hot strategies can be mixed on a slot-by-slot basis. Followingthe cold switchover, boards are sequenced through the normal I/Ohot swap sequences, allowing the standard enumeration procedures towork.

Since there is no state maintained across a cold switchover,very little needs to be done beyond the standard I/O Hot Swapsteps. Only the protocols for causing the processor to swap need beadded. Additionally, the lack of state maintenance means that thereis little advantage of a cooperative switchover vs. a pre-emptiveswitchover. While applications may benefit from a cooperativeswitchover, the non-system payload gains no benefit fromcooperation.

Warm switchovers are accomplished by disabling the I/O payload'sbus mastering capabilities following the bus exchange. The primarymechanism for disabling the bus master capabilities is the PCIconfiguration header command register. Additional mechanisms, suchas device CSRs may be available on a device-dependent basis. Theprimary requirement for warm switchover is that both the device andits driver are capable of communications regarding the device'sstate and usage of system resources. This communication must bepossible without the I/O device requiring bus mastership.

A communication and potential reconfiguration of PCI resourcestakes place before the new driver permits the payload to againperform bus master operation. The mastership hiatus permits anynecessary PCI reconfiguration to occur. Resources such as busnumbers, PCI memory and I/O space allocations, and DMA bufferallocations are done anew by the new device driver.Device-to-driver communications protocols can be resynchronized,and then bus mastership capabilities can be re-enabled.

Cooperative switchovers have an advantage over pre-emptiveswitchovers in the warm switchover mode. Cooperative switchoversallow extant device status to be checkpointed to the new systemprocessor and the device to halt activity prior to switchover.Devices may thereby avoid unexpected over/underruns.

Warm switchovers have the advantage over cold switchovers ofenabling system continuation without interrupting payload states.This is quite desirable in systems where the payload intelligenceis a large part of the system intelligence, such as call switchingor cellular applications. In these applications, the existing callscan be maintained.

Warm switchovers maintain state with little support from thehost operating systems, since the burden of managing the switchoverfalls on the device intelligence and its associated driver.However, this is also the drawback to warm switchover. Theprotocols and checkpointing required to re-allocated resources andresynchronize driver and payload may be quite complex. It isunlikely that standard payload downloads will be capable of suchoperations.

Hot switchovers are accomplished by quickly switching a domaininto an identically configured system processor. The I/O devicesthen resume operation without reconfiguration. While the devicesmay be notified of the switchover as an aid to recovering frompotential under/overruns, basic operation of the device payloadremains undisturbed.

In a hot switchover, cooperative switchovers have an advantageover pre-emptive switchovers. Cooperative switchovers allow extantdevice status to be checkpointed to the new system processor andthe device to halt activity prior to switchover. Devices maythereby avoid unexpected over/underruns.

To perform a successful hot switchover, the new system processormust maintain a resource configuration identical to that of theoriginal system processor. This requires careful checkpointing ofsystem resource allocations such as PCI bus numbers, PCI I/O andmemory space address, and DMA buffer physical addresses. Mostoperating systems will need modification to support this form ofsystem processor switchover. Additionally, the system processordevice drivers must be capable of configuration and checkpointingwithout access to real hardware.

The primary advantage of a hot switchover is that it may beimplemented without modification to the payload devices' downloads.Only the drivers for the host processor require modifications.These drivers typically implement simple backplane packetinterfaces, rather than the complex protocols of the I/O devices,and will deal only with status, service control and encapsulateddata packets. In an environment where complex protocols acquiredfrom third parties run on the payload devices, and the source codeis not available, the hot switchover may be a necessity.

Processor Hot Swap System Resource Management

The two processor relationships and three driver maintenancelevels yield six possible implementations for processor hot swap,as shown in Figure 1. Each implementation must go through asequence of configuring the system, making the switchover, andreconfiguring the system. The sequences for each implementation aregiven in the Figure 1 links. Following the sequence is not allthere is to implementing a successful hot swap. You may also needto carefully manage system resources.

cold_co warm_co hot_co hot_pre warm_pre cold_pre

Figure 1: The six examples of possible domain switchoversequences for a given system are application, device, and driverdependent. Detection of when a switchover should be performed isnot considered in these sequences. The examples assume that thedrivers, operating systems, and payloads have the requisitecapabilities to handle each class of switchover.

Cold and warm domain switchovers require little in the way ofspecial resource management. This is because they allow PCIreconfiguration between the switchover and I/O resumption. The samecannot be said for hot switchovers. Because the device I/O isallowed to continue without reconfiguration, every resource relatedto I/O operations must be carefully managed in a hot switchover.These resources include, but may not be limited to, PCI busnumbers, PCI I/O space, PCI memory-mapped I/O space, PCIprefetchable memory space, PCI interrupts, and DMA physical bufferand control addresses. Additionally, device driver configurationmust be managed in the absence of physical hardware.

PCI Resources
Hot switchovers require considerable resource management. Theobvious management need is for the collective set of PCI resources.These resources must be identical on both processors participatingin the hot switchover, yet most operating systems supporting PCIHot Swap have dynamic allocation mechanisms. For example, PCI busnumbers are allocated as PCI-to-PCI bridges are encountered in theenumeration process. Typically, bus numbers for I/O host swap areallocated in blocks to allow for subordinate bridges. The CPX8216chassis, for instance, contains two domain bridges. After a smallallocation to allow for PMC bridges on the system processor, theremaining bus numbers are divided equally between the twodomains.

Typically, operating systems enumerate the PCI bus eitherautomatically through the receipt of the ENUM signal or on demandby the system management interface. In either case, the results maynot be identical each time. When configuring for the hot switchoverof the system processor, the system not owning the domain must havea means of tracking the allocations made by the owning domain, asit cannot make its own allocations and have them match.

The key for performing bus number allocations for hot switchoveris to make sure that the domain bridges have identical allocationsbased on the domain rather than based upon the PCI BDF (bus,device, function) triple. This is a requirement that is notaccommodated by most currently available operating systems, whichgenerally just allocate in order as bridges are discovered, and thediscovery process normally proceeds based on the BDF triple.

PCI I/O and Memory Allocations
PCI-to-PCI bridges used as domain bridges currently have only onewindow for each of the three PCI windows: I/O, memory-mapped I/O,and prefetchable memory spaces. This single window means that theavailable address pool for each must be divided among the domainbridges. The current recommendation is to expose the entireresource pool through each domain bridge window. The effect ofdynamically changing the window size to accommodate insertion andextraction is undetermined, and dependent on the bridgeimplementation.

When subordinate allocations are made for devices downstream ofthe domain bridges, the same allocation must be made in the otherhost's virtual resource pool. This may be done by checkpointing theallocations to the other processor as they are made. Thisrequirement is not yet accommodated by most available operatingsystems, as normal strategy is to only make allocations whenphysical hardware is discovered. The operating system concept ofresource allocation must be extended to apply to virtual devicesnot yet physically present.

PCI interrupts are allocated according to the hardware wiringfor a given chassis. When an interrupt is allocated in the systemcurrently owning a domain, the logically equivalent interrupt linesmust be configured on the non-owning processor.

DMA Buffers
Because I/O devices may have pending DMA requests at the time ofdomain hot switchover, it is necessary that the physical addressesused for DMA by the active domain be similarly allocated in thestandby domain. This requirement is not normally met by currentoperating systems.

Additionally, in order to manage allocations for multipledomains, the available DMA memory pool must either be divided andallocated into segments for each domain, or an MP safe allocationalgorithm must be used to allow the two processors to communicatetheir allocations as they occur. In any event, the DMA allocationsmust be checkpointed to the standby system and device drivers.

The physical addresses of the DMA pools must be the same on bothsystem processors, even if the amount of memory on the twoprocessors differs. If virtual addresses are used in any of thepacket or control data exchanged between the device and the driver,then the virtual address of such structures must also be identicalon

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.