Embedding an IPMI platform management subsystem to monitor server system health - Embedded.com

Embedding an IPMI platform management subsystem to monitor server system health

Primarily utilized in enterprise systems, platform management provides the ability to monitor and report on the health of the system hardware via isolated  hardware/software that does not rely on the operational state of the system’s hardware or software.  

The platform management hardware typically resides on the same board as the system hardware; however, because it is isolated it can remain functional even if the system hardware is non-operational. The platform management hardware is usually powered by a separate power supply.

Servers make up the vast majority of these enterprise systems and they are the backbone of the Internet. There are thousands upon thousands of these servers in server farms all over the world. When a server fails or is about to fail, it is important for the technical caretakers to find, fix, or replace the system quickly.

Platform management has become an essential part of enterprise-class systems. Why? How can adding cost and complexity to a system save money? There are two answers. One is the system operational ratio: up-time divided by total time. The other is total cost of ownership (TCO).

Platform management hardware and software enables these failing systems to be located, taken offline if necessary, and serviced quickly and efficiently. This ability reduces TCO while, at the same time, improving the overall reliability of the server farm.

 

Figure 1 – Board Management Architecture System Monitoring

The platform management subsystem monitors the health of the system: temperature, air flow, voltage levels, and software (Figure 1 above ).

The temperature of specific devices is monitored by platform management subsystem using on-die temperature circuits or by placing temperature-measuring devices near the hotter devices. An alarm can be activated if the temperature exceeds a threshold, and the system can be shut down if the temperature exceeds a higher threshold.

Air flow is not directly monitored; however, the fans in the system can be monitored by measuring their rotational speed in revolutions per minute (RPM). If the RPMs fall below a threshold, an alarm can be activated. A fan failure is not typically a shut-down event, but a shut down can occur if the failed fan causes the temperature of a device to exceed the shut-down threshold.

The system voltages can be monitored using analog-to-digital converters. If a voltage goes down or up too far, an alarm or system shut down can be initiated.  

Lastly, the system software can be monitored through the use of a watchdog timer. The system software must reset the watchdog timer hardware within a specified time period or the timer will indicate that the system software has crashed. The platform management subsystem logs the failure, notifies the system administrator, and resets the system. If the system fails to come up, the system administrator is notified of the failure.

In each of the monitored functions, the amount of monitoring hardware added is up to the system designer. The greater the amount of system monitoring, the more likely a failure or pending failure is discovered earlier, which increases the system operational ratio.

System platform management capability can be sized to fit the needs of the installation. In general, there are three levels of platform management sophistication. In all of these levels, the ability to monitor is the same. It is the reporting sophistication that is different:

· Local – Unit observable

· R emote – Limited observable

· Remote – Highly observable

At the local level, meaning the system administrator evaluates the system with his eyes and feet, the platform management subsystem can report status using indicators such as LEDs, small chassis-mounted display, through an I/O port (USB or serial for example), or a combination of these (Figure 2 below ).

Assuming there are several systems in the server room, the most effective signal would be a bright LED, which when lit indicates that the system has a failure or is about to have one. The system administrator could then get the error condition by reading the small chassis display or connecting a laptop computer to the platform management subsystem’s status/error I/O port.

 

Figure 2 – Unit-observable Platform Management

The hardware implementation of this level can easily fit into a field-programmable gate-array (FPGA). A microcontroller could be used as the platform management subsystem brains, but it is not a requirement because state-machine driven hardware is capable of controlling the platform management function. If the platform management hardware is implemented in an FPGA, the hardware can be easily modified and can even be field-updated. AMD’s latest embedded server reference designs have an FPGA for control, and the source code is available as well.  

Remote – Limited-observable – Medium to Large Installations

The platform management system can send system status and alarms to a remote location via an ethernet connection (Figure 3 below ). With this type of connection, the system administrator can observe the platform status from anywhere in the world.  

The Intelligent Platform Management Interface (IPMI) is an open-source specification that describes the structure and format of the interfaces necessary to enable these platform management services. It does not specify a particular solution. With a platform management solution compatible with the IPMI standard, system health can be monitored; if something has failed or is about to fail, an alarm and/or system status can be observed remotely by the system administrator. If the system needs maintenance, such as a fan replacement, it can be scheduled before the failure actually occurs.

 

Figure 3 – Limited-observable Platform Management

The platform management solution can also assert system reset and power on/off. This means the system can be powered on or off and/or reset remotely, or if a severe failure occurs, the platform management solution can power off the system automatically and report the failure to the system administrator.

This type of sophistication requires that the platform management system be controlled by a microcontroller.  

Remote – Highly observable – Generally Large Installations

In addition to the capabilities of the “remote – limited-observable” features, the “remote – highly-observable platform management” solution provides remote control of the keyboard and mouse and remote visibility to the display contents (Figure 4 below ).  

This level of observation and control is made possible through a feature called keyboard, video, and mouse over internet protocol (KVMIP). The system administrator sees exactly what is being displayed from the system.  

The system’s display output is captured by the platform management hardware, converted into IP packets, and sent to the system administrator’s system where the IP packets are reassembled into the display output for the system administrator to view. The same is done with the keyboard and mouse input, except in the other direction.

 

Figure 4 – Highly-observable Platform Management

To facilitate the development of this type of platform management, AMD in partnership with other companies has created an open, royalty-free connector and pinout standard to allow third-party developers to create sophisticated platform management solutions as standard products.  

The standard is called Open Platform Management Architecture (OPMA). OPMA leverages the IPMI specification to provide for the basic platform management solution and adds the KVMIP capability to achieve the premium platform management solution that the large enterprise server installations demand.

Conclusion

Platform management is vital for enterprise-class systems. There are three sophistication levels to platform management design. Each level can have the same degree of monitoring hardware. It is the difference in reporting capability that distinguishes each level.

Platform management’s ability to improve system operational ratio is critical to the reliable operation of large server farms and individual installations. It also lowers TCO by automating failure reporting. Fewer spare systems are needed because the systems in use are up more of the time.

James Edwards is Senior Technical Marketing Manager for the AMD Embedded Solutions group in Ft. Collins, Colorado.James previously worked at Compaq for 11 years where he was responsible for the main system board design for several portable computers as well as with Cyrix and then National Semiconductor where he was the Geode processor (X86 compatible) processor.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.