A day with RAS for PowerLinux Sysadmin
AravindaPrasad 270005WE36 Comment (1) Visits (2710)
By: Aravinda Prasad.
Problem determination is definitely a key area when it comes to systems administrators (sysadmins). Sysadmins tend to spend hours debugging and trying to find out what is wrong with the system. Engineers at the IBM Linux Technology Center around the world are working on ways to simplify the experience of sysadmins in managing IBM systems.
An outcome of such an effort is the upcoming facilities like Light Path Diagnostics, improved diagnostic tools and other features in IBM PowerLinux, which integrates well with the existing Reliability Availability and Serviceability (RAS) capabilities. IBM believes that such facilities will help sysadmins to perform the administration tasks easily and quickly.
In this article, we emphasize the PowerLinux RAS advantage for sysadmins in determining and resolving the problem from the administrator point of view.
A PowerLinux sysadmin receives a notification alert of a serviceable event. The sysadmin knowing that the service log infrastructure of RAS on PowerLinux is capable of sending such notifications, logs into the system and checks the service log event using the servicelog tool to get more details about the serviceable event. The detailed log by servicelog mentions that one of the Ethernet cards has gone bad, giving additional information about the location code, device serial number etc.
firstname.lastname@example.org ~]# servicelog --dump Servicelog ID: 27 Log Timestamp: Fri Nov 30 10:44:02 2012 Event Timestamp: Fri Nov 30 10:44:02 2012 Update Timestamp: Fri Nov 30 10:44:02 2012 Type: Operating System Event Severity: 6 (ERROR) Platform: ppc64 Model/Serial: 8406-71Y/108B7AA Node Name: ras.ibm.com Reference Code: BF778E00 Serviceable Event: Yes Predictive Event: No Disposition: 1 (Unrecoverable) Call Home Status: 1 (Call Home Candidate) Status: Open Kernel Version: #1 SMP Wed Jun 13 18:19:27 EDT 2012 Subsystem: net Driver: e1000e Device: 0001:00:01.0 Description: Message forwarded from syslog: Fri Nov 30 10:44:02 ras kernel: e1000e 0001:00:01.0: Invalid MAC Address Description: The MAC address read from the adapter's EEPROM is not a valid Ethernet address. Action: 1. Execute diagnostics on the adapter, using "ethtool -t". 2. Check the EEPROM level on the failing adapter. 3. Replace the adapter. << Callout 1 >> Priority L Type 32 Procedure Id: see explain_syslog Location: U78A
The clever sysadmin, knowing the RAS capabilities behind this entire setup, recalls that the OS running on the PowerLinux server which detected the bad Ethernet card has logged an error in /var/log/messages, which was converted into service log event by syslog_to_svclog tool by logging the event into service log database. The service log database upon receiving a serviceable event has sent the notification. The sysadmin also quickly recalls that the serviceable events are not only restricted to Ethernet devices, but are also supported on SCSI enclosures for which events are logged to service log database by diag_encl tool and RTAS related events, which are logged into the database by rtas_errd.
The sysadmin orders a new Ethernet card after collecting the required information like model, serial number etc of the bad Ethernet card with the help of the new -l flag to lscfg command, which takes location code, which was logged in service log, as input and prints VPD (Vital Product Data) information.
email@example.com ~]# lscfg -vl U78A
The sysadmin is also very happy to know that with the new light path diagnostics facility (coming in RHEL and SLES service pack updates in the future) and the service log notifier would have notified the light path diagnostics subsystem lp_diag about the bad Ethernet card and the light path infrastructure would have enabled the fault indicator for the bad Ethernet card slot helping in easy identification of the physical location of the Ethernet card.
The sysadmin replaces the bad part by identifying the slot with the help of fault indicator LEDs. The hot-plug facility automatically identifies and initializes the newly plugged card. The sysadmin closes the serviceable event using log_repair_action tool and the service log facility upon closure of serviceable event notifies the light path infrastructure to turn off the fault indicators. The sysadmin now updates the VPD using vpdupdate tool to reflect the changes in the hardware.
The sysadmin appreciates the RAS capabilities of PowerLinux and its seamless integration with the OS, which helped in quickly identifying and resolving the problem. The sysadmin checks for new notifications knowing that the PowerLinux RAS is not just restricted to identifying faulty devices but is capable of lot more things and provides many service and productivity tools
For more information about service and productivity (aka RAS) tools for your PowerLinux system, see the related article in the Linux Information Center.