Hitless I/O: Overcoming challenges in high availability systems

High availability systems such as servers, communication gateways and base stations need to be continuously operational. Once installed in the field, software upgrades handle feature enhancements and bug fixes. As a result, these systems are designed in such a way that their functionality can be updated without interrupting its normal operation. Programmable Logic Devices (PLDs) are commonly required to support in-system design updates. The improved design convenience and excellent performance at a lower cost make PLDs ideal as board hardware management devices in these systems, where they manage on-board DC-DC converters, monitor and control critical signals, aggregate serial communications, and perform other housekeeping functions.

The Indispensable PLD
A PLD consists of a number of programmable function units. These units are configured and interconnected to implement board specific hardware management functions. Typically, a software design tool converts a given logic function, such as a board hardware management, into a PLD-specific configuration bit stream, which configures the program function units and interconnects them. The configuration bit stream is stored within a PLD’s on-chip configuration Flash memory. When the board is powered on, the contents in the configuration Flash memory are automatically transferred to its on-chip configuration SRAM, which in turn configures the programmable function units to perform the desired hardware management task. To update the hardware management functionality, a different bit stream is loaded into the configuration Flash memory, in background, at any time without interrupting hardware management functions performed by the PLD. To transfer the newly stored Flash memory configuration Flash onto the on-chip SRAM the board is power cycled, interrupting the system normal operation (Fig.1).


Figure 1. Most PLDs must be power cycled to reconfigure (Source: Lattice Semiconductor)  

Maintaining steady outputs during configuration
High Availability systems cannot tolerate interruptions through power cycling. Because the PLD I/Os are used to enable the DC-DC converters and control reset signals for the main ASICs and CPUs on the board, the outputs of the PLD should not toggle during reconfiguration. Holding the outputs steady during the PLD reconfiguration offers many challenges.


Figure 2: PLD reconfiguration steps using MachXO2/MachXO3 without Hitless I/O (Source: Lattice Semiconductor)  

Lattice’s MachXO2 or MachXO3 PLD series include features that enable zero-downtime updates (Fig.2). First, the PLD undergoes a “background update” loading new configuration data into its configuration Flash memory via JTAG, SPI or I2C. Once the upload is complete, a “TransFR” command is issued to move the new PLD image from configuration Flash memory to the PLD's configuration SRAM. Invoking the “TransFR” command also triggers a “Leave Alone” function, which ensures that all I/O values are held in their last known value during the transfer. Lastly, during the “logic initialization” step, the state machines will begin to restart the Power Management and reset distribution functions. This will result in turning supplies off, forcing the board to undergo power recycling.

Hitless I/O
In order to support zero-downtime updates, the PLD device must be able to hold the outputs controlling the supplies and other logic control signals unchanged, while the state machines created by the new image are undergoing initialization. After the new algorithms initialize, they should take over the control of supplies and other logic signals.

To hold the critical I/Os unchanged during the initialization process, ‘Hitless I/O’ elements are added to the user design. As shown in Figure 3, this involves adding a latch-mux to every critical output. The latch-mux holds the outputs in their last known value during the state machine initialization process and hands the output control back to the state machine after its initialization process is complete. The circuit can differentiate between normal (power on) start-up and after a reconfiguration event using a separate “Hitless_IO_Enable” input, preventing an I/O lock of critical outputs during a normal power-on sequence.

Figure 3 illustrates the role of Hitless I/O in the state machine’s initialization process, after the new configuration is loaded into the MachXO2/MachXO3 device configuration SRAM.


Figure 3: Hitless I/O holding the critical I/O states in their last known state during initialization (Source: Lattice Semiconductor)

A MUX-Latch is added to every output that needs to remain unchanged by holding the output at its current value as long as its MUX control input is at “0”. This means that the DC-DC converter remains “on” (if it was previously on) regardless of the state machine output status. When the control signal is at logic “1” the DC-DC converter status is controlled by the state machine directly. The state machine controls the MUX output through the ‘Normal Operation’ node. An external input signal ‘Hitless_IO_Enable’ is added to the design to differentiate between normal “power on” configuration (when the DC-DC converter outputs are controlled during state machine initialization process) and hitless reconfiguration process (when the DC-DC converters are not changed during the state machine initialization process).

Let’s assume that the “Hitless_IO_Enable” signal controlling the hitless process is set to “1”.Before initialization, the state machine resets the ‘Normal Operation’ signal to ‘0’. The MUX-Latch will ignore the outputs from the state machine and the DC-DC converter “Enable” signals are left unchanged. When the PLD's logic is ready to resume normal operations, it sets the “Normal Operation” signal to a logical “1” (high), allowing it to assert control over the DC-DC converters. The board's DC-DC converters and resets are now controlled by the updated power and reset control state machine.

Page 2 >>

Device phases
There are four phases to the MachXO2/MachXO3 configuration swap over (Figure 4).


Figure 4. The sequence of events that takes place within the MachXO2/MachXO3 device during configuration swap over. (Source: Lattice Semiconductor)

Phase 1: The MachXO2/MachXO3 devices’ on-chip configuration Flash memory is programmed in the background by the external controller via I2 C/JTAG. During background programming, the MachXO2/MachXO3 device is operating normally, without interruption. After the programming is complete and it is time to switch to the newly programmed configuration, the logic resets the ‘Normal Operation’ node to 0. This transfers the Hitless I/O signal output control from the internal state machine to the MUX-Latch. The Hitless I/O ‘Enable’ pin is driven “high” by the external controller.

Phase 2: The external controller issues a ‘TransFR’ command to transfer the contents of the on-chip configuration Flash memory to on-chip configuration SRAM via I2 C/JTAG port. In response to the command, the boundary scan ‘LeaveAlone’ feature samples and locks the Hitless I/O outputs to their previous value.

Phase 3: The on-chip Flash contents are transferred to the SRAM. At this point the logic in the device is not functional. But the Hitless I/O outputs are held stable by the boundary scan ‘LeaveAlone’ feature.

Phase 4: The device wakes up with the new logic in the configuration SRAM. The outputs are still held by the boundary scan ‘LeaveAlone’ I/O feature. The ‘Normal Operation’ node is held at logic 0 by the MachXO2/3 devices’ ‘Global Set-Reset’ node. At this point, the new MUX-Latch from the new configuration logic samples the output and holds it at the input of the tristate output gate. At the end of phase 4, the internal logic releases the ‘Global Output’ (GOE) signal and the internal ‘Global Set-Reset’ node, and the ‘LeaveAlone’ I/O feature releases output control to the logic in the fabric.

‘Normal Operation’: The new state machine begins its initialization process. But the outputs of the state machine logic are still not controlling the Hitless I/O outputs, as they are held at the previous value by the MUX-Latch. After the state machine initialization is complete, the state machine sets the ‘Normal Operation’ node to logic 1. After that the MUX-Latch hands over the control of the Hitless I/O outputs to the state machine.

A Real-world Example
The block diagram in Figure 5 illustrates the usage of a PLD to power, monitor and manage the cluster of CPUs and other board-level subsystems in a Rack Server, including a Platform Controller Hub (PCH), Baseboard Management Controller (BMC) and Host Bus Adapter (HBA). In this role, the PLD is primarily responsible for sequencing the board's point-of-load voltage regulators during power-up and power-down, and holding resets and control signals in the proper states during power cycling. During normal operation, the PLD monitors the subsystems for alarm conditions (temperature, voltage, memory and I/O faults, etc.) or status changes, while holding control signals static in the proper state.


Figure 5. Control / Housekeeping functions for a Rack Server integrated into a PLD (Source: Lattice Semiconductor)

The BMC updates the server's control PLD in the background and initiates the “TransFR” command to swap to the updated PLD configuration. Without the Hitless I/O feature the control and reset signals or the VR (Voltage regulator) signals will toggle during the initialization procedure. For example, if a “Reset” signal on the CPU or one of its peripherals happens to toggle during the reconfiguration process, the CPU will reinitialize and start the reboot process regardless of function it was performing. Likewise, if the “Power Enable” is toggled, the Voltage Regulator (VR) or Point of Load (PoL) power supply will shut off, causing the device powered by that VR to go into an unexpected state. This can cause the board to halt operation, lose or scramble data, or even physically damage the board's electronic components.

Adding a Hitless I/O mechanism to critical signals enables the PLD to freeze its external output control signals during the reconfiguration process. In doing so, the server's mission critical functions are not interrupted during routine maintenance and upgrades of the PLD. This capability is also valuable during product development because it provides fast turnaround during debug, or the creation of specialized product variants during the rack installation procedure.

Conclusions
PLDs can serve as a flexible, cost-effective solution for controlling DC converters, bridging I/O channels and performing other board-level hardware management functions in complex electronic systems. The device’s ability to accept live updates gives manufacturers the flexibility to perform on-the-fly configuration changes, which correct design errors, or add new capabilities to their products. With the recent introduction of Hitless I/O architecture, PLDs now have the ability to be reconfigured in a glitch-free, deterministic manner. The architecture typically adds less than 1% to a design's gate count and can be implemented without any external components. By enabling reliable configuration changes without power cycling, hitless update logic makes PLDs the smart choice for hardware management solutions in networking, data center storage equipment and other mission-critical applications.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.