Checking your system health

You can use your system BMC to monitor your system vital signs. Additionally, the BMC logs any potential health issues into the System Event Log (SEL). You can view the vital signs by using the ipmitool Sensor Data Record (SDR) command.

To see all available SDRs in your system, run the following command:

# ipmitool sdr list

For illustration purposes, focus on CPU temperature readings. In the IPMI 2.0 environment that runs RHEL version 5.2, the temperature readings are listed as CPU * Temp records. To see all CPU temperature reading, run the following Sensor Data Record (SDR) command:

# ipmitool sdr list | grep Temp	

Ambient Temp | 24 degrees C  | ok
CPU 1 Temp   | 42 degrees C  | ok
CPU 2 Temp   | disabled    | ns
CPU 3 Temp   | disabled    | ns
CPU 4 Temp   | disabled    | ns

The first column is the sensorid name. This name is used to reference the sensor in other commands also. The sensor reading in the second column indicates a healthy system. The value of disabled for several CPUs indicates that these CPU sockets are empty. The last column displays the reading relative to threshold values.

You can find more information about the possible CPU temperature Sensor States by examining the results of CPU 1. You can run a command by using the sensorid name (CPU 1 Temp in this example). When the sensorid name contains blanks, it must be surrounded by double quotation marks. The following command lists all the possible states of the CPU 1 temperature sensor:


# ipmitool event "CPU 1 Temp" list 
Finding sensor CPU 1 Temp... ok 
Sensor States: 
  lnr : Lower Non-Recoverable 
  lcr : Lower Critical 
  lnc : Lower Non-Critical 
  unc : Upper Non-Critical 
  ucr : Upper Critical 
  unr : Upper Non-Recoverable 

If a CPU temperature becomes too cold, a new record is created in the System Event Log (SEL). You can simulate a CPU temperature to become too cold by selecting a sensorid name and a Sensor State name - CPU 1 Temp and lnc: Lower Non-Critical respectively, to pretend that CPU 1 is overheating to low temperature:

# ipmitool event "CPU 1 Temp" "lnc : Lower Non-Critical" 
Finding sensor CPU 1 Temp... ok 
 0 | Pre-Init Time-stamp | Temperature CPU 1 Temp | Lower Non-critica l | 
going low  | Reading -128 < Threshold -128 degrees C 

This command simulates a -128°C reading (below the -128°C threshold), even though the actual CPU 1 Reading from the sdr list command was 42°C and creates a log in the System Event Log (SEL). You can confirm that the event is logged with the SEL command viewing he last event entry:

# ipmitool sel list | tail -1 
	1c0 | 11/19/2008 | 21:38:22 | Temperature #0x98 | Lower Non-critical going low

The first column is a unique record number in the hexadecimal format. The next two columns are the date and time stamp. The fourth column shows the corresponding sensor. The final column shows a description.

Also, you can check whether your system had a possible bad health event by viewing the entire history of your System Event Log:

# ipmitool sel list

The ipmievd daemon is related to the SEL. The ipmievd event daemon is packaged with IPMItool that checks for events from the BMC that are being sent to the SEL and also logs the messages to a system log file. The daemon can run in the following modes: Using the Event Message Buffer and asynchronous event notification from the OpenIPMI kernel driver or actively polling the contents of the SEL for new events. For more information about the ipmievd daemon, see the ipmievd man page.