Embedding an IPMI platform management subsystem to monitor server system health

James Edwards

September 14, 2010

James Edwards

Primarily utilized in enterprise systems, platform management provides the ability to monitor and report on the health of the system hardware via isolated  hardware/software that does not rely on the operational state of the system’s hardware or software.  

The platform management hardware typically resides on the same board as the system hardware; however, because it is isolated it can remain functional even if the system hardware is non-operational. The platform management hardware is usually powered by a separate power supply.

Servers make up the vast majority of these enterprise systems and they are the backbone of the Internet. There are thousands upon thousands of these servers in server farms all over the world. When a server fails or is about to fail, it is important for the technical caretakers to find, fix, or replace the system quickly.

Platform management has become an essential part of enterprise-class systems. Why? How can adding cost and complexity to a system save money? There are two answers. One is the system operational ratio: up-time divided by total time. The other is total cost of ownership (TCO).

Platform management hardware and software enables these failing systems to be located, taken offline if necessary, and serviced quickly and efficiently. This ability reduces TCO while, at the same time, improving the overall reliability of the server farm.

 

Figure 1 – Board Management Architecture System Monitoring

The platform management subsystem monitors the health of the system: temperature, air flow, voltage levels, and software (Figure 1 above).

The temperature of specific devices is monitored by platform management subsystem using on-die temperature circuits or by placing temperature-measuring devices near the hotter devices. An alarm can be activated if the temperature exceeds a threshold, and the system can be shut down if the temperature exceeds a higher threshold.

Air flow is not directly monitored; however, the fans in the system can be monitored by measuring their rotational speed in revolutions per minute (RPM). If the RPMs fall below a threshold, an alarm can be activated. A fan failure is not typically a shut-down event, but a shut down can occur if the failed fan causes the temperature of a device to exceed the shut-down threshold.

The system voltages can be monitored using analog-to-digital converters. If a voltage goes down or up too far, an alarm or system shut down can be initiated.  

Lastly, the system software can be monitored through the use of a watchdog timer. The system software must reset the watchdog timer hardware within a specified time period or the timer will indicate that the system software has crashed. The platform management subsystem logs the failure, notifies the system administrator, and resets the system. If the system fails to come up, the system administrator is notified of the failure.

In each of the monitored functions, the amount of monitoring hardware added is up to the system designer. The greater the amount of system monitoring, the more likely a failure or pending failure is discovered earlier, which increases the system operational ratio.

System platform management capability can be sized to fit the needs of the installation. In general, there are three levels of platform management sophistication. In all of these levels, the ability to monitor is the same. It is the reporting sophistication that is different:

·                    Local – Unit observable

·                    Remote – Limited observable

·                    Remote – Highly observable

< Previous
Page 1 of 2
Next >

Loading comments...

Most Commented

  • Currently no items

Parts Search Datasheets.com

KNOWLEDGE CENTER