Making IPMI Work in ATCA Designs - Embedded.com

Making IPMI Work in ATCA Designs

The move to PICMG 2.16 backplanes in CompactPCI designs sparked a big change in the way communication equipment designers looked at standards-based hardware platforms. Support for the Intelligent Platform Management Interface (IPMI) is one of the key reasons that designers and manufacturers became hyped about PICMG 2.16. Through IPMI, designers have a means of managing system health and, in turn, helping CompactPCI platforms to reach the coveted 99.999 percent high availability (HA) mark.

As the communication equipment market starts shifting its attention to the AdvancedTCA (ATCA) architecture, designers are once again turning toward IPMI as a key element for managing system resources. In this article, we'll look at the basic elements that make up version 1.5 and 2.0 of the IPMI spec. We'll then show how designers tap into IPMI in an ATCA design.

IPMI Overview
Introduced in 1998, IPMI is a messaging protocol that defines how to monitor systems hardware, control system components, and retrieve hardware event logs and more. IPMI messages flow “in-the-box” (on channels like the I2 C-based Intelligent Platform Management Bus (IPMB)) or “out-of-the-box” (on channels like TCP/IP/Ethernet, RS232/PPP/TCP/IP, etc.). IPMI even describes how multiple embedded management controllers collaborate. The latest revision, IPMI v2.0, added standardized console access (called serial-over-LAN (SOL) re-direction), stronger security (via AES encryption, etc.), as well as enhanced support for bladed/modular systems.

There are significant customer benefits to using an autonomous management subsystem in an ATCA shelf. Since the management subsystem is not affected by failures in the main CPU or O/S, a higher level of system manageability is achieved. Consider how IPMI complements ATCA with respect to thermal stress.

ATCA itself leverages some extensive thermal models. ATCA provides a wider board pitch and specific airflow specifications. This provides component/card designers more generous design constraints but even these will be pushed. By monitoring thermal sensors, and providing platform event traps (PETs) for threshold violations, an IPMI sub-system can alert remote management systems to pending thermal faults before they occur. This is just one example of a predictive/pro-active failure management. If failures can be obviated in this sort of way, significant progress can be made on overall system reliability.

Even better, by retrieving event logs and component information, IPMI provides additional trouble-shooting tools/processes can be used. IPMI delivers the field replaceable unit (FRU) information that uniquely identifies the components in question. In other words, IPMI enables “diagnose-before-dispatch” automation. This improves “serviceability” (and so reduces field maintenance costs).

While reliability, availability and serviceability (RAS) were early design goals for IPMI, IPMI v2.0 also addresses operational costs. For instance, during a system restart, IPMI's serial-over-LAN (SOL) feature enables a remote system manager to watch various components (like storage cards) go through their power-on self test (POST) processing. In many cases, remote managers can interact with these processes to make configuration adjustments, run additional diagnostics, etc. These features ensure that unattended, lights out remote system management is always possible.

Inside IPMI
Before we discuss specific IPMI features within ATCA, we need to understand some of IPMI's general principles and functions on which an IPMI/ATCA implementation is based. Figure 1 shows a generic IPMI system.


Figure 1: Diagram of a generic IPMI subsystem. (Note: Image courtesy of Intel).

At the heart of any IPMI-based subsystem lies the baseboard management controller (BMC). BMCs often leverage commodity Intel Architecture features. For example, they might use the low pin count (LPC) bus as their system interface (via either KCS, SMIC or BT protocols). For contexts where LPC might not be available (e.g. mezzanine cards), a frugal I2 C/SMbus-based interface is now defined by IPMI v2.0, which is called the SMbus system interface (SSIF).

There are even instances where a peripheral management controller (PMC) is dedicated to a specific device (like a power supply). In some of these circumstances, very low-cost 8051-type controllers are sometimes used (via ASICs that include parallel address/data bus support).

BMCs typically integrate a number of analog-to-digital converters (ADCs) for voltage monitoring, counters for fan speed monitoring, pulse width modulation (PWM) or digital-to-analog (D/A) outputs for driving fans, general-purpose I/Os, serial ports, and I2 C buses for interfacing to external sensors and expansion components. BMCs and their external interface components (NICs, RS-232 transceivers, etc.) are powered by a standby voltage rail, allowing BMCs to provide manageability functions regardless of the system's power state – a unique advantage of IPMI over O/S-resident agents.

Showing Its Greatest Value
IPMI shows it's greatest value when the main system interface becomes unavailable. During such scenarios, the BMC provides alternative paths for management of the server. One path is via a network (LAN) interface into the BMC — typically referred to as out-of-band (OOB). This can be implemented via a NIC or RS-232 interface into the BMC.

Through the BMC's serial port, an administrator can use a terminal program to manage the system via a command-line interface. The BMC's serial port can also bridge the BMC to the system serial port, via which the BIOS and O/S output their consoles (UNIX/Linux or Microsoft emergency management services). The BMC can redirect these consoles over the LAN.

One of the crucial features that a BMC provides is to record various events for abnormal platform behavior, like CPU temperature overheating, fan failure, chassis intrusion, etc. These events are stored in centralized non-volatile memory as system event logs (SEL). BMC firmware implements a logical SEL device that provides an interface to the outside world for accessing a SEL so administrators can retrieve them for troubleshooting.

IPMI also provides a framework for storing and accessing inventory information of FRUs. Each major subsystem can be associated with an EEPROM containing its FRU information, aiding in simple identification and replacement.

Adding Unique Features
Hardware designers have found IPMI an ideal framework for unique features. For example, the BMC's watchdog timer can be used to detect and recover from BIOS and operating system hangs. The front panel buttons can be routed through the BMC, allowing lock-out of local access. Another example may be providing some predictive failure features for DRAM. The BMC can add a custom sensor that tracks single-bit memory errors, as a predictor of DRAM failures. The BMC can also implement logic that interprets sub-normal fan speed readings, useful in scheduling replacement prior to fan failures.

IPMI also enables multiple vendors' management software to work across different firmware and hardware platforms, and vice versa. For example, IPMI abstracts sensor-specific characteristics from management software. Management software queries raw readings from sensors via the BMC. Factors for converting to actual measurement units are stored in sensor data records (SDRs). SDRs are stored separately in non-volatile memory. Such abstraction enables management software to work across platforms, without platform-specific knowledge of how to interpret the sensors.

Essentially, IPMI is a self-describing, message based, hardware instrumentation standard. As you implement IPMI within a system, you can add external RAM/ROM, connect additional I2 C buses, employ additional general purpose I/O (GPIO) pins, change sensors, resize the system event log, etc. You can even restructure the hierarchy of components that comprise your system. Yet, IPMI still enables your remote systems management software (SMS) to generically adapt to all such instrumentation changes without any platform-specific instrumentation knowledge.

Module Management Overview
When discussing IPMI within ATCA, there are some terminology changes relating to the BMC, the intelligent management chip that hosts IPMI firmware. Depending on it's location, the BMC chip is referred to as a system manager, shelf management controller, (ShMC), IPMC (IPMI Controller) on an ATCA board, or as a modular management controller (MMC) residing on an advanced mezzanine card (AMC) [Figure2] .


Figure 2: Diagram showing how IPMs relate to one another in an ATCA design.

The system manager, usually available by using a systems management software (SMS) console, resides outside the ATCA chassis/shelf and controls one or more systems/shelves. The SMS console typically uses a TCP/IP LAN to connect to various ATCA shelf managers (although emergency options, like POTS dial-up, are supported via IPMI). Irrespective of the transport media and transport protocol, IPMI always uses the remote management control protocol (RMCP) to carry IPMI messages. This is the most fundamental way to communicate with an ATCA shelf manager.

The ATCA shelf manager is responsible for overall shelf health, for communicating with remote system management software (SMS) and for taking certain corrective actions. Hot-swap events (e.g. hardware component entry-removal events) are also handled (and reported) by the ShMC. This can include latch/lock management, power budgeting, in-rush current sequencing, and electronic keying (E-Keying).

ATCA shelf managers are usually connected to at least two, independent I2 C/IPMB buses (or more). While left to vendor discretion, a good rule of thumb is to use redundant shelf managers whenever a shelf has more than seven ATCA (line) cards. While ATCA specifies an active-active I2 C/IPMB failover scheme, ShMC coherency is left to vendor implementations.

IPM Devices
ATCA (line) cards (e.g. board) support system services. Different ATCA line/carrier cards can do everything from call processing (SS7) to storage management. The ATCA specification defines long, short and hybrid form factors for these ATCA cards/boards. On each, the ATCA specification refers to the management controller as an IPM controller (IPMC).

The IPMC chip, and its hosted IPMI firmware, might be the same on many different boards. The IPMC chip/firmware can also be different on many different boards with a single ATCA chassis.

The AMC defines common elements (mechanical, power, management, etc.) for a relatively small card (sometimes called a module). These mezzanine cards are designed to fit onto a full-sized ATCA card — adding functionality in a modular way. ATCA mezzanine cards/modules lie parallel to their ATCA carrier board. Single and double-width (and full/half height) modules are supported.

ATCA's IPMI-based instrumentation technology is extended to ATCA mezzanine cards. Each mezzanine card has its own management controller. This is often a called an MMC.

As a side note, the AMC specification is still in the final PICMG approval process. AMC based products are expected soon. See the PICMG website for further release details.

Radial vs. Bussed IPMB Topologies
There are two related IPMI/IPMB topologies: radial and bused. The advantages and disadvantages for each relate to the overall security, performance, design complexity and overall build costs that the TEM is targeting. The best of both worlds, of course, is to combine them into a hybrid, or mesh topology. However, ATCA dictates redundant IPMB channels to provide higher availability during IPMB failure conditions.

In Figure 2 above, IPMB-0 refers to the aggregation of the two separate I2 C/IPMB buses. Each IPM device connects to, and supports, IPMB-0. IPMB-0 allows up to 24 different management controllers (xMCs) to collaborate. Another IPMB (as a separated address space) is referenced as IPMB-L (L for local).

IPMB-L is typically supported by an ATCA MMC. IPMB-L is electrically isolated from IPMB-0. An MMC, therefore, cannot interface directly with IPMB-0. Each MMC relies on the carrier card's IPMC to communicate within the shelf (via IPMI defined message “bridging”).

Popular Options and/or Extensions
Improved serviceability of the shelf and/or individual carriers is highly desirable. While the IPMI v2.0 spec is not mandatory for ATCA, serial over LAN (SOL) support within IPMI v2.0 provides for redirection of various boot and OS consoles that are typically output to a local serial port (headless systems), to be accessible over the LAN.

The SOL protocol would be supported within the Shelf ShMC, IPMC, BIOS and the system manager as an addition to the base IPMI v1.5 firmware and software. This OOB management allows administrators to manage or power-cycle servers remotely, resolving service and application outages, even when the OS is not present or is unresponsive.

SOL extends access to BIOS, boot, Windows emergency management system (EMS)/special administration console (SAC), and Linux consoles over the LAN. SOL greatly eliminates the time consuming and costly task of monitoring, diagnosing, and repairing remote system outages by allowing control over the pre and post OS booting process.

Within the IPMI context, in-band refers to management access to a ShMC, IPMC, or MMC via a software agent running in the OS. In the previous examples, all access to IPMC/MMC was done OOB, meaning access was directly with the MC with no OS intervention. In-band management standards and architecture vary between vendors and markets. Traditionally, CORBA and SNMP have been used to satisfy management requirements of telecommunications equipment.

RMCP is used as the protocol when connecting externally over a LAN using IPMI. This requires either direct support within a console application (system manager) or by placing a management access point (proxy) 'in-between' that bridges between the native management console protocol and IPMI. Command line access is a useful user interface for in-band management, to access and control the various MCs in a shelf. This command line protocol (CLP) can be implemented directly within the ShMC or externally within the system manager or using software proxy technologies. The latest developments within the DMTF's SMASH working group may easily be leveraged towards usage within ATCA.

In vs. Outsourcing
After selecting an IPMC and/or MMC and a hardware platform, and any optional extensions, designers must decide how to add IPMI firmware and software to their systems. Designers have many options at this point, including developing their own firmware or utilizing an MC vendor's firmware. Another possible solution would be to commission a third party that delivers IPMI firmware/software building blocks.

References

  1. IPMI — Intelligent Platform Management Bus Communications Protocol Specification V1.0 Document Revision 1.0, November 15, 1999 Copyright 1998, 1999 Intel Corporation, Hewlett-Packard Company, NEC Corporation, Dell Computer Corporation. All rights reserved.
  2. IPMI — Intelligent Platform Management Interface Specification, v1.5. Document Revision 1.1, February 20, 2002. Copyright 1999, 2000, 2001, 2002 Intel Corporation, Hewlett-Packard Company, NEC Corporation, Dell Computer Corporation. All rights reserved.
  3. PICMG — AdvancedTCA Base Specification document (PICMG 3.0 R1.0) and as amended by ECN 3.0-1.0-001.
  4. PICMG — AMC.0 — Advanced Mezzanine Card Short Form Specification, June 15, 2004, Version D0.9a.
  5. DMTF — CIM Operations over HTTP, v1.1, DSP0200.
  6. SMASH — Standard Command Line Protocol for Managing Servers presented by Arvind Kumar and Perry Vincent at IDF, Fall 2004
  7. Service Availability Forum/Hardware Platform Interface — SAF HPI B.01.01.

About the Author
Steve Rokov is the technical director at OSA Technologies, an Avocent company. Steve holds a B.SS. in computer science from the University of Brighton, UK and can be reached at .

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.