IBM®
Skip to main content
    Country/region [select]      Terms of use
 
 
    
     Home      Products      Services & solutions      Support & downloads      My account     
 
developerworks > My developerWorks >  Dashboard > Linux for Power Architecture > ... > Previous Home > RAS
developerWorks
Log In   View a printable version of the current page.
Overview Connect Spaces Forums Wikis
RAS
Added by baublys, last edited by wburos on Aug 04, 2008  (view change)
Labels: 
(None)

Warning

This page still needs heavy tweaking!

Problem determination is definitely a key area when it comes to system administration. System administrators tend to spend hours debugging and trying to find out what is wrong with the system. In this chapter, we discuss how to do centralized system logging using the syslog and the evlog. We also discuss Linux rescue methods, and provide tips that allow you to fix boot partition errors, corrupt file systems, and configuration problems.

We also discuss the ppc64 Run Time Abstraction Service (RTAS). RTAS allows you to query and extract information from the pSeries firmware. We share with you the tools developed by the IBM Linux Technology Center to allow you to work more closely with AIX, commands which are used in AIX to extract hardware information, and patch system firmware.

Finally, we discuss common tuning methods in Linux, and explain which tools you can use to tune Linux to allow it to do more for you.

Linux on pSeries RAS

The IBM pSeries hardware has been known for its RAS capabilities due to IBM's knowledge and experience in developing mainframes and mission-critical servers. Much of the RAS design has been developed to analyze failures within the Central Electronic Complex (CEC) to either eliminate the errors or to contain and reduce them to avoid bringing the entire server down. Some of the RAS features that you see available for Linux on pSeries are:

  • Persistent deallocation for memory and processor during boot-time
  • Automatic First Failure Data Capture and diagnostics capability
  • ECC and chipkill correction in the real memory
  • Fault tolerance with N+1 redundancy of power and cooling
  • Dual line cords
  • Predictive failure analysis and diagnostics
  • Robust journaled file system using JFS, reiserfs and others
    Many of the design efforts put into the development of the pSeries server RAS have been designed to be operating system-independent. This basically means that you do not need the AIX operating systems to exploit most of the RAS capability inside the hardware.

In the booting up process, the Built-In Self Test (BIST) and Power-on Self Test (POST) are designed to check the processors, caches, memory prior to loading the operating systems. If a critical error is detected, the system tries to deallocate the component and continue the boot-up process. In this way, your system is not at risk of running with a faulty component. Detected errors are logged into the system non-volatile RAM (NVRAM). Refer to 4.1.2, "IBM diagnostics tools" on page 168 for more information on the nvram.

Surveillance of the system operation is provided by the service processor. The service processor basically records and automatically checks for heartbeats from the operating systems. It can be configured to automatically reboot the system if the service processor does not detect any heartbeat within a default time interval. If the system is unable to come up successfully, the service processor logs the error and leaves the system powered on. The service processor is also designed to report errors to the Service Focal Point. In environments where the system is attached to a Hardware Management Console (HMC), the errors are logged and reported to the Service Focal Point application running in the HMC.

In additional, the IBM diagnostic tools for Linux on pSeries records and analyzes pSeries-specific messages, and logs them into the Linux system log facility. Generic software and hardware errors are also recorded and analyzed by the Linux error log analysis (LELA).

Refer to 4.1.2, "IBM diagnostics tools" on page 168 for the description of the tools packages inside the IBM diagnostics tools.

4.1.1 RunTime abstraction service in PPC64

Specific to the PowerPC kernel, /proc/rtas/* gives you some interface to interact with the service processor directly. The RunTime Abstraction Service or RTAS in Linux is enabled by default by the SuSE Linux Enterprise Server (SLES) 8, or you can recompile the kernel by hand with the CONFIG_PPC_RTAS option. In Figure 4-1, you can see how the RTAS in Linux interacts with the pSeries firmware.

The open source community is continuously improving RTAS in the PowerPC kernel; here are some of the RTAS service in the /proc file systems that we can use today:

  • /proc/ppc64/rtas/progress (read/write)
    The "progress" file allows you to write the LCD operator panel. For an HMC-attached device, this is shown in the Operator Panel Value.

This is very useful for displaying uptime or system performance of that LPAR or server:

echo "this is a testing string" > /proc/ppc64/rtas/progress

One way to make use of the progress indicator is to display information about the health of the system. The script shown in Example 4-1 basically gets the output from the system load from the uptime command and displays it on the operator panel.

Example 4-1 Shell script to echo system load to the pSeries operator panel

#!/bin/sh
## Script to echo system load to Operator Panel
##
while true
do
UPTIME='uptime | sed 's/^.*verage: //''
echo "                      " > /proc/ppc64/rtas/progress
echo "$UPTIME" > /proc/ppc64/rtas/progress
sleep 1
done

After you run the script in the background shown in Example 4-1, you notice that the LED operator panel in the HMC displays the uptime as shown in Figure 4-2 on page 167.

Such a script can be run by the cron daemon based on a certain interval. In this way, the administrator knows about the system load without having to log on to the system to check.

  • /proc/ppc64/rtas/clock (read/write)
    The date command in Linux for pSeries currently only changes the date/time for that particular session. If you need to change the time permanently, you need to update this file. The format of the file is the same as of the command
    ''date +%s''.
    # echo "1068155156" > /proc/ppc64/rtas/clock
    
  • /proc/ppc64/rtas/sensors (read)
    The sensors file allows you to be aware of the hardware operations of the server; the file presents you with a list of the sensors detected in the hardware, and gives you its environmental performance.

Example 4-2 on page 168 shows the available sensors detected in a standalone p630 when you run the command ''cat /proc/ppc64/rtas/sensors''.

Example 4-2 Sensors for standalone servers

# cat /proc/ppc64/rtas/sensors
RTAS (RunTime Abstraction Services) Sensor Information
Sensor          Value           Condition       Location
********************************************************
Key switch:     Normal          (read ok)       ---
Power source:   AC              (read ok)       ---
Interrupt queue:        Enabled (read ok)       ---
Surveillance:            0      (read ok)       ---

Example 4-3 shows sensors of an LPAR in a p650 server.

Example 4-3 Sensors for an LPAR server

# cat /proc/ppc64/rtas/sensors
RTAS (RunTime Abstraction Services) Sensor Information
Sensor          Value           Condition       Location
********************************************************
Key switch:     Normal          (read ok)       ---
Power source:   AC              (read ok)       ---
EPOW Sensor:    EPOW Reset      (read ok)       ---
Interrupt queue:        Enabled (read ok)       ---

The sensors that are available differ from those of servers, and will also be different if you are in LPAR mode. In LPAR mode, most of the hardware monitoring capabilities are done by the Hardware Management Console (HMC).

  • /proc/ppc64/rtas/poweron (read/write)
    The file poweron allows you to set the date and the time to power on the system. This is very useful for a development server that needs to be shut down at night.
    date -d 'tomorrow 9:00' + %s > /proc/ppc64/rtas/poweron
    
  • /proc/ppc64/rtas/frequency & /proc/rtas/volume (read/write)
    Frequency and volume allows you to manage the speaker in older RS/6000® hardware.

4.1.2 IBM diagnostics tools

IBM has recently released the Linux for pSeries Service aids for hardware diagnostics. The service aids allow system administrators to extract valuable information from the robust pSeries service processor for problem determination and servicing. Many of the commands packaged inside the service aids are very similar to the commands that you may find in AIX. The service aid can be downloaded from the Web site:

http://techsupport.services.ibm.com/server/Linux_on_pSeries

The IBM diagnostics tools require POWER4-based systems and a supported release of Linux on pSeries (SuSE SLES8 SP3 or Red Hat Advance Server 3).

To install the packages, run:

# rpm -ivh ppc64-utils-0.4-77.rpm
# rpm -ivh lsvpd-0.9.2-1.ppc.rpm
# rpm -ivh diagela-1.1.0.1-2.ppc.rpm
# rpm -ivh IBMinvscount-2.1-1.ppc.rpm
# rpm -ivh devices.chrp.base.ServiceRM-2.1.0.0-2.ppc.rpm

You need to initialize the lsvpd if you are running it for the first time:

# /etc/init.d/lsvpd start

Make sure that the lsvpd service is started. This basically creates a symbolic link inside /etc/rc.d/rc3.d and /etc/rc.d/rc5.d:

# chkconfig lsvpd 35 # (this will start it at runlevel 3 & 5)

After installing the packages, you find the following commands available to extract information from your pSeries server using Linux. These commands are installed into the /usr/sbin/ibmras or /usr/sbin/ directory. Some of these
commands require root access.

  • nvram
    This command is used to query and print the data stored in the nvram of the PowerPC-64 system. Example 4-4 shows the output of the nvram command.
  • print-config: prints out all the variables in the open firmware
  • print-vpd: prints out all the VPD
  • print-all-vpd: prints out all the VPD including vendor specific data
  • print-event-scan: displays the event scan log
  • print-err-log: displays error log information

Example 4-4 nvram command option to extract event logs

# nvram --print-event-scan
Number of Logs: 7
xLog Entry 0: flags: 0x00 type: 0x52
Severity: WARNING
Disposition: FULLY RECOVERED
Extended error type: 3
c6 00 00 08 21 32 58 00 20 03 10 23 00 00 00 00 |....!2X....#....|
Log Entry 1: flags: 0x00 type: 0x52
Severity: WARNING
Disposition: FULLY RECOVERED
Extended error type: 3
c6 00 00 08 22 05 02 00 20 03 10 16 00 00 00 00 |...."...........|
Log Entry 2: flags: 0x00 type: 0x52
Severity: WARNING
Disposition: FULLY RECOVERED
Extended error type: 3
c6 00 00 08 23 42 52 00 20 03 10 08 00 00 00 00 |....#BR.........|
Log Entry 3: flags: 0x00 type: 0x52
Severity: NO ERROR
Disposition: FULLY RECOVERED
Extended error type: 0
Log Entry 4: flags: 0x00 type: 0x52
Severity: NO ERROR
Disposition: FULLY RECOVERED
Extended error type: 0
Log Entry 5: flags: 0x00 type: 0x52
Severity: NO ERROR
Disposition: FULLY RECOVERED
Extended error type: 0
Log Entry 6: flags: 0x00 type: 0x52
Severity: NO ERROR
Disposition: FULLY RECOVERED
Extended error type: 0number: 19
  • snap
    The snap command in Linux for pSeries provides functionality that is similar to that of the snap command in AIX. For example, it captures your /var/log/messages, your device-tree inside the /proc file system, /proc tuning, /dev/nvram and your yaboot.conf. It gzips the file for IBM technical support to analyze.
  • update_flash
    This command allows you to update the pSeries firmware directly from Linux. If you are updating the firmware for an LPAR, the respective LPAR requires "service authority" capability. There are cases where you need to download the latest firmware to give your service processor more intelligence and more capability. You can specify it with the option -f and the firmware file that you have just downloaded.

Remember to run the checksum (using the command sum) on the firmware file prior to installation. This ensures that you can download the proper and complete file. Incomplete downloads can be disastrous and could corrupt your server. You can download the latest firmware from the Web site:
https://techsupport.services.ibm.com/server/mdownload

While you are updating your firmware, you will see that the operator panel of LPAR will indicate that it is in the process of flashing. Figure 4-3 shows you what you will see after the update_flash command is run.

Refer to the IBM Redbook Effective System Management Using the IBM Hardware Management Console for pSeries, SG24-7038, for information on how to set service authority.

Corresponding to the above screen, you will notice that the Operator Panel of the LPAR will show as "Flashing". Figure 4-4 on page 172 shows the Operator Panel.

  • lscfg
    This command lists all the hardware information that is available in the system.
  • v: prints out in verbose mode
  • vp: prints out in details including vendor specific information
    The following example shows how to query number of processors:
    # lscfg -v | grep proc
    proc0 U0.1-P1-C1 Processor
    proc1 U0.1-P1-C2 Processor
    
  • diagela
    This command is part of the error log analysis tool that provides automatic analysis and notification errors reported by the RunTime Abstraction Service on the pSeries hardware. When an error is detected and corrective actions are required, notification is automatically sent to the Service Focal Point on the Hardware Management Console, or to users specified in the /etc/diagela/mail_list configuration file. At the same time, the logs of the
    analysis are inside /var/log/messages file.

Figure 4-5 shows what is happening at the background of the diagela application.

Whenever the rtas_errd background daemon scans and detects any error reported by the system firmware, it basically activates the analysis program to deduce what kind of problem it is facing.

After analysis, it reports it back to the system logs and the respective mechanism that you have configured for it to use. Example 4-5 on page 174 shows the analysis of power failure output from the diagela daemon.

Example 4-5 /var/log/messages showing analysis of power failure of the system

Diagela for Linux for pSeries
Oct 28 14:29:41 lpar1 diagela: 10/28/2003 14:29:40
Oct 28 14:29:41 lpar1 diagela: Automatic Error Log Analysis reports the following:
Oct 28 14:29:41 lpar1 diagela:
Oct 28 14:29:41 lpar1 diagela: 651204 ANALYZING SYSTEM ERROR LOG
Oct 28 14:29:41 lpar1 diagela: A loss of redundancy on input power was detected.
Oct 28 14:29:41 lpar1 diagela:
Oct 28 14:29:41 lpar1 diagela: Check for the following:
Oct 28 14:29:41 lpar1 diagela: 1. Loose or disconnected power source connections.
Oct 28 14:29:41 lpar1 diagela: 2. Loss of the power source.
Oct 28 14:29:41 lpar1 diagela: 3. For multiple enclosure systems, loose or
Oct 28 14:29:41 lpar1 diagela: disconnected power and/or signal connections
Oct 28 14:29:41 lpar1 diagela: between enclosures.
Oct 28 14:29:41 lpar1 diagela:
Oct 28 14:29:41 lpar1 diagela: Supporting data:
Oct 28 14:29:41 lpar1 diagela: Ref. Code: 10111520
Oct 28 14:29:41 lpar1 diagela:
Oct 28 14:29:41 lpar1 diagela: Analysis of /var/log/platform sequence number: 3

5.1 Service Aids

This section describes the hardware service diagnostic aids and productivity tools available for IBM servers running Linux operating systems on Power4 and Power5 processors.

The latest versions of these tools can be found at http://www14.software.ibm.com/webapp/set2/sas/f/lopdiags/home.html.

There are utility packages shipped now by in the distributions, but in different packages. RHEL5.2 ships a ppc64-utils package, while SLES10 SP2 ships a powerpc-utils package and a lsvpd package. The RHEL ppc64-utils package contains all the utilities that are in separated into the two (powerpc-utils and lsvpd) packages in SLES.

You can have problems with the dependencies when installing the upstream IBM packages, especially if installing the IBM packages over (or instead of) your distribution's packages.

These service and productivity tools perform the following functions.

Base tools

  • Access Power platform features that are provided by the system's firmware
  • Gather hardware inventory and microcode level
  • Access system boot lists, LEDs, reboot policies, etc. from the operating system command line
  • Communicate with an attached Hardware Management Console

Service tools

  • Analyze errors or events and perform actions to increase system availability or protect data integrity
  • Communicate errors to an attached Hardware Management Console or to IBM Service
  • Survey hardware and communicate the results to the IBM Machine Reported Product Data database

Productivity tools

  • Hotplug add, remove, or replace PCI devices
  • Dynamically add or remove processors or I/O slots from a running partition using an attached Hardware Management Console

SLES9 tools

To install and use the following tools under SLES9, ensure that rdist-6.1.5-792.1 and compat-2004.7.1-1.2 are installed from the SLES9 media.

Title Download

  • Platform Enablement Library librtas-1.1-17.ppc64.rpm
  • SRC src-1.2.1.0-0.ppc.rpm
  • RSCT utilities rsct.core.utils-2.3.3.4-0.ppc.rpm
  • RSCT core rsct.core-2.3.3.4-0.ppc.rpm
  • CSM core csm.core-1.3.3.2-69.ppc.rpm
  • CSM client csm.client-1.3.3.2-69.ppc.rpm
  • ServiceRM devices.chrp.base.ServiceRM-2.2.0.0-1.ppc.rpm
  • DynamicRM DynamicRM-1.1-2.ppc.rpm
  • Service Aids ppc64-utils-2.1-0.ppc64.rpm
  • Hardware Inventory lsvpd-0.12.7-1.ppc.rpm
  • Error Log Analysis diagela-1.3.0.0-5.ppc64.rpm
  • PCI Hotplug Tools rpa-pci-hotplug-1.0-10.ppc.rpm
  • Dynamic Reconfiguration Tools rpa-dlpar-1.0-12.ppc.rpm
  • Inventory Scout IBMinvscout-2.2-5.ppc.rpm
  • I/O Error Log Analysis evlog-drv-tmpl-0.8-1.ppc64.rpm

Platform Enablement Library (base tool)

The librtas package contains a library that allows applications to access certain functionality provided by platform firmware. This functionality is required by many of the other higher-level service and productivity tools.

SRC

SRC is a facility for managing daemons on a system. It provides a standard command interface for defining, undefining, starting, stopping, querying status and controlling trace for daemons.

Reliable scalable cluster technology (RSCT) core and utilities

The RSC packages provide the Resource Monitoring and Control (RMC) functions and infrastructure needed to monitor and manage one or more Linux systems. RMC provides a flexible and extensible system for monitoring numerous aspects of a system. It also allows customized responses to detected events.

Reliable scalable cluster technology (RSCT) core and utilities

The RSC packages provide the Resource Monitoring and Control (RMC) functions and infrastructure needed to monitor and manage one or more Linux systems. RMC provides a flexible and extensible system for monitoring numerous aspects of a system. It also allows customized responses to detected events.

Cluster Systems Management (CSM) core and client

The CSM packages provide for the exchange of host-based authentication security keys. These tools also set up distributed RMC features on the Hardware Management Console (HMC).

Cluster Systems Management (CSM) core and client

The CSM packages provide for the exchange of host-based authentication security keys. These tools also set up distributed RMC features on the Hardware Management Console (HMC).

Service Resource Manager (ServiceRM)

Service Resource Manager is a Reliable, Scalable, Cluster Technology (RSCT) resource manager that creates the Serviceable Events from the output of the Error Log Analysis Tool (diagela). ServiceRM then sends these events to the Service Focal Point on the Hardware Management Console (HMC).

DynamicRM (Productivity tool)

Dynamic Resource Manager is a Reliable, Scalable, Cluster Technology (RSCT) resource manager that allows a Hardware Management Console (HMC) to do the following:

  • Dynamically add or remove processors or I/O slots from a running partition
  • Concurrently update system firmware
  • Perform certain shutdown operations on a partition

Service Aids (base tool) (formerly called "update_flash, snap commands")

The utilities in the ppc64-utils package enable a number of RAS (Reliability, Availability, and Serviceability) features. Among others, these utilites include the update_flash command for installing system firmware updates; the serv_config command for modifying various serviceability policies; the usysident and usysattn utilities for manipulating system LEDs; the bootlist command for updating the list of devices from which the system will boot; and the snap command for capturing extended error data to aid analysis of intermittent errors.

Hardware Inventory (base tool - formerly called "lsvpd, lscfg commands")

The lsvpd package contains the lsvpd, lscfg, and lsmcode commands. These commands, along with a boot-time scanning script called update-lsvpd-db, constitute a hardware inventory system. The lsvpd command provides Vital Product Data (VPD) about hardware components to higher-level serviceability tools. The lscfg command provides a more human-readable format of the VPD, as well as some system-specific information.

The information these tools provide is only correct if vpdupdate has been run since any changes have been made. If you're unsure any changes have been made, run vpdupate manually and then use the lsmcode, lsvpd, etc, tools.

Error Log Analysis (service tool)

The Error Log Analysis tool provides automatic analysis and notification of errors reported by the platform firmware on IBM eServer pSeries systems. This RPM analyzes errors written to /var/log/platform. If a corrective action is required, notification is sent to the Service Focal Point on the Hardware Management Console (HMC), if so equipped, or to users subscribed for notification via the file /etc/diagela/mail_list. The Serviceable Event sent to the Service Focal Point and listed in the e-mail notification may contain a Service Request Number. This number is listed in the "Diagnostics Information for Multiple Bus Systems" manual.

PCI Hotplug Tools (Productivity tool)

The rpa-pci-hotplug package contains two tools to allow PCI devices to be added, removed, or replaced while the system is in operation: lsslot, which lists the current status of the system's PCI slots, and drslot_chrp_pci, an interactive tool for performing hotplug operations.

Dynamic Reconfiguration Tools (Productivity tool)

The rpa-dlpar package contains a collection of tools allowing the addition and removal of processors and I/O slots from a running partition. These tools are invoked automatically when a dynamic reconfiguration operation is initiated from the attached Hardware Management Console (HMC).

IBM Inventory Scout (Service tool)

The Inventory Scout package provides an application to gather hardware inventory for a system, including but not limited to:

  • Features installed
  • EC levels of hardware
  • Microcode
    IBM uses this information to determine required repair parts and to assist in configuring system upgrades. If you have an attached Hardware Management Console (HMC), you can initiate Inventory Scout functionality on a partition by using the "Service Applications" panel of the HMC.

I/O Error Log Analysis (Service tool)

The I/O Error Log Analysis package provides automatic analysis and notification of I/O errors on IBM POWER-based systems. I/O errors will be written to evlog, and notification will be sent to the Service Focal Point on the Hardware Management Console (HMC), if so equipped. The Serviceable Event sent to the Service Focal Point may contain a System Reference Code (SRC). These codes are documented in the "eServer Hardware Information Center."

The evlog-drv-tmpl tool requires evlog-1.6.0-xx (shipped with SLES9). This RPM will: Install the driver templates for bcm5700, e100, e1000, emulex, ipr, olympic, and pcnet32. Update evlog ELA scripts. Update the evlog startup script to load or unload ELA rules during boot and shutdown.

After installation, you must restart evlog to load these new ELA rules. To restart evlog, run the following command: /ets/init.d/evlog restart

Here's a little script for automatic download/installation of the tools. The example is for SLES9 in a HMC managed environment.
This script can run on any system with connection to the internet (like on my Laptop) and has the advantage that you can download/install the tools in a fast and easy way (w/o knowing the
exact version numbering etc.):

#! /bin/bash

# The page which holds all images
DOWNLOAD_URI="http://www14.software.ibm.com/webapp/set2/sas/f/lopdiags/images/"

# Adopt the list as appropriate for your environment
LIST="librtas src rsct.core.utils rsct.core csm.core csm.client devices.chrp.base.ServiceRM DynamicRM ppc64.utils lsvpd diagela rpa-pci-hotplug rpa-dlpar IBMinvscout evlog-drv-tmpl"


wget http://www14.software.ibm.com/webapp/set2/sas/f/lopdiags/suselinux/hmcmanaged/sles9.html -O /tmp/lop

for SLOT in $LIST 
do
	PACKAGE=`cat /tmp/lop | grep $SLOT- | sed 's/^.*images\///' | sed 's/".*//'`
	wget $DOWNLOAD_URI/$PACKAGE
	# As an alternative and if your LPAR/System has an Internet connection
	# you can rpm the packages directly:
	# rpm -Uvh $DOWNLOAD_URI/$PACKAGE
done

rm -f /tmp/lop
Posted by pjuerss at Mar 21, 2006 04:44 | Permalink

 
    About IBM Privacy Contact