IBM Support

A day with RAS for PowerLinux Sysadmin

Technical Blog Post


Abstract

A day with RAS for PowerLinux Sysadmin

Body

image

By: Aravinda Prasad.

Problem determination is definitely a key area when it comes to systems administrators (sysadmins). Sysadmins tend to spend hours debugging and trying to find out what is wrong with the system. Engineers at the IBM Linux Technology Center around the world are working on ways to simplify the experience of sysadmins in managing IBM systems.

An outcome of such an effort is the upcoming facilities like Light Path Diagnostics, improved diagnostic tools and other features in IBM PowerLinux, which integrates well with the existing Reliability Availability and Serviceability (RAS) capabilities. IBM believes that such facilities will help sysadmins to perform the administration tasks easily and quickly.

In this article, we emphasize the PowerLinux RAS advantage for sysadmins in determining and resolving the problem from the administrator point of view.

 
An example.

A PowerLinux sysadmin receives a notification alert of a serviceable event. The sysadmin knowing that the service log infrastructure of RAS on PowerLinux is capable of sending such notifications, logs into the system and checks the service log event using the servicelog tool to get more details about the serviceable event. The detailed log by servicelog mentions that one of the Ethernet cards has gone bad, giving additional information about the location code, device serial number etc.

root@ras.ibm.com ~]#  servicelog --dump

Servicelog ID:      27
Log Timestamp:      Fri Nov 30 10:44:02 2012
Event Timestamp:    Fri Nov 30 10:44:02 2012
Update Timestamp:   Fri Nov 30 10:44:02 2012
Type:               Operating System Event
Severity:           6 (ERROR)
Platform:           ppc64
Model/Serial:       8406-71Y/108B7AA
Node Name:          ras.ibm.com
Reference Code:     BF778E00
Serviceable Event:  Yes
Predictive Event:   No
Disposition:        1 (Unrecoverable)
Call Home Status:   1 (Call Home Candidate)
Status:             Open
Kernel Version:     #1 SMP Wed Jun 13 18:19:27 EDT 2012
Subsystem:          net
Driver:             e1000e
Device:             0001:00:01.0

Description:
Message forwarded from syslog:
Fri Nov 30 10:44:02  ras kernel: e1000e 0001:00:01.0: Invalid MAC Address

Description: The MAC address read from the adapter's EEPROM is not a valid Ethernet
 address.

Action: 1. Execute diagnostics on the adapter, using "ethtool -t".
2. Check the EEPROM level on the failing adapter.
3. Replace the adapter.

<< Callout 1 >>
Priority            L
Type                32
Procedure Id:       see explain_syslog
Location:           U78A5.001.WIH8464-P1
FRU:                
Serial:             
CCIN:


The clever sysadmin, knowing the RAS capabilities behind this entire setup, recalls that the OS running on the PowerLinux server which detected the bad Ethernet card has logged an error in /var/log/messages, which was converted into service log event by syslog_to_svclog tool by logging the event into service log database. The service log database upon receiving a serviceable event has sent the notification. The sysadmin also quickly recalls that the serviceable events are not only restricted to Ethernet devices, but are also supported on SCSI enclosures for which events are logged to service log database by diag_encl tool and RTAS related events, which are logged into the database by rtas_errd.

The sysadmin orders a new Ethernet card after collecting the required information like model, serial number etc of the bad Ethernet card with the help of the new -l flag to lscfg command, which takes location code, which was logged in service log, as input and prints VPD (Vital Product Data) information.


root@ras.ibm.com ~]# lscfg -vl U78A5.001.WIH8464-P1
  0001:00:01.0 eth1 ethernet U78A5.001.WIH8464-P1
                                         Port 2 - IBM 2 PORT 10/100/1000
                                         Base-TX PCI-X Adapter (14107910)
        Manufacturer Name.........IBM
        Machine Type-Model........82546GB Gigabit Ethernet Controller
        Network Address...........001a64a81bb7
        Device Specific.(YC)......3
        Location Code.(YL)........U78A5.001.WIH8464-P1


The sysadmin is also very happy to know that with the new light path diagnostics facility (coming in RHEL and SLES service pack updates in the future) and the service log notifier would have notified the light path diagnostics subsystem lp_diag about the bad Ethernet card and the light path infrastructure would have enabled the fault indicator for the bad Ethernet card slot helping in easy identification of the physical location of the Ethernet card.

The sysadmin replaces the bad part by identifying the slot with the help of fault indicator LEDs. The hot-plug facility automatically identifies and initializes the newly plugged card. The sysadmin closes the serviceable event using log_repair_action tool and the service log facility upon closure of serviceable event notifies the light path infrastructure to turn off the fault indicators. The sysadmin now updates the VPD using vpdupdate tool to reflect the changes in the hardware.

The sysadmin appreciates the RAS capabilities of PowerLinux and its seamless integration with the OS, which helped in quickly identifying and resolving the problem. The sysadmin checks for new notifications knowing that the PowerLinux RAS is not just restricted to identifying faulty devices but is capable of lot more things and provides many service and productivity tools

 
For more information about service and productivity (aka RAS) tools for your PowerLinux system, see the related article in the Linux Information Center.
 

[{"Business Unit":{"code":"BU054","label":"Systems w\/TPS"},"Product":{"code":"HW1W1","label":"Power ->PowerLinux"},"Component":"","Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"","Edition":"","Line of Business":{"code":"","label":""}}]

UID

ibm16171567