POWER7 information

General diagnostic information

Use the general diagnostic information to view logs, to run tests, and to use diagnostic service utilities that can help a service provider.

For more information about working with Linux, see the Linux Information Center.

AIX operating system message files

English is the default language displayed by the diagnostic programs when run from disk. If you want to run the diagnostic programs in a language other than English, you must install the AIX® message locale file for the wanted language to the system.

Firmware and microcode

There are several types of firmware that are used by the system:

Power subsystem firmware (if applicable)
Service power control network (SPCN) firmware (if applicable)
Service processor firmware (if applicable)
System firmware

The following types of microcode are used by the system:

Adapter microcode
Device microcode

If a management console is attached to the server, the management console must be used to manage the firmware and microcode levels on the server.

If a management console is not attached to the server, diagnostic tasks can be used to display device and adapter microcode levels. The tasks can also be used to update device and adapter microcode. Diagnostic tasks also provide the capability to update firmware.

To determine the level of server firmware, and device and adapter microcode, use the Display Microcode Level task in diagnostic service aids. This task presents a list of resources that are currently installed and supported by this task. You then select the resource whose microcode level you with to check. If you are using the AIX operating system and using online diagnostics, the lsmcode command and the diag command can also be used to display the firmware and microcode levels of individual entities in the system from the command line. For more information, see Display Microcode Level. For adapters and devices not supported by this task, refer to the instructions provided by the manufacturer to determine the microcode levels.

Use the Update and manage system flash task to update firmware on the server. When the flash update is complete, the server automatically reboots. See Updates for detailed scenarios that explain how to use the update and manage system flash task.

Use the Download microcode service aid on systems running AIX 5.2.0.30 or later to update the microcode on adapters and devices. For details on updating adapter and device microcode, see Updates.

If your system is running the Linux operating system, you can use the service aids in the stand-alone diagnostics to update most system flash, adapter, and device microcode.

CEREADME file

A CEREADME (CE readme file) is available on all diagnostic media. This file might contain information such as:

Errata information for the service information
Service hints for problems
Diagnostic information that might not be included in service information
Other pertinent (release-specific) information

The CEREADME file is helpful in describing differences in diagnostics between the current version and the preceding version.

You can view the CEREADME file by using the Service Hints service aid after the diagnostics are loaded. Also, you can read the file directly from the disk using the pg command to display /usr/lpp/diagnostics/CEREADME. The CEREADME file can be copied or printed using the normal commands. For information about using the service hints, see Display Service Hints.

Print the CEREADME file from disk

You can print the CEREADME file from disk using the cat command. The path to this file is as follows: /usr/lpp/diagnostics/CEREADME

A copy of this file should be printed and stored with the Service Information. lp0 is normally the printer attached to the parallel port. If a printer is attached to the parallel port and is considered as lp0, the command for printing the file is as follows:

cat /usr/lpp/diagnostics/CEREADME > /dev/lp0

Print the CEREADME file from a source other than disk

The CEREADME file cannot be printed while diagnostics are being run from a source other than from the disk. The file can be printed on a system when the operating system is running in a normal user environment. The procedure involves copying the file from the diagnostic media to a temporary file on disk, printing the file, and then deleting the file from disk. Check for directory /tmp/diag. To determine whether this directory exists, enter:

cd /tmp/diag

If the directory does not exist, the message /tmp/diag: not found displays. Do not attempt to print the CEREADME file if this message is not displayed. To print the CEREADME file, choose the appropriate section below and follow the steps listed.

Print the CEREADME file from CD-ROM

Insert the diagnostic CD-ROM disc into the CD-ROM drive, and then enter the following commands:

mkdir /tmp/diag
mount -o ro -v cdrfs /dev/cd0  /tmp/diag
cd /tmp/diag/usr/lpp/diagnostics
cat CEREADME > /dev/lp0
cd /tmp
unmount /dev/cd0

The CEREADME file prints on lp0, which is the printer normally attached to the parallel port. If this file is not the same as the CEREADME file on the disk, a copy of this file should be printed and stored with the Service Information.

CE login

CE login enables a user to perform operating system commands that are required to service the system without being logged in as a root user. CE login must have a role of RunDiagnostics and a primary group of system. This command enables the user to:

Run the diagnostics including the service aids, such as hot plug tasks, certify, and format.
Run all the operating system commands run by system group users.
Configure and unconfigure devices that are not busy.

In addition, CE login can have shutdown group enabled to allow:

Use of the Update System Microcode service aid.
Use of shutdown and reboot operations.

To use CE login, ask the customer to create a unique user name and configure these characteristics for that name. After the user name is set up, you will need to obtain the user name and password from the customer to log in with these capabilities. The recommended CE login user name is qserv.

Automatic diagnostic tests

All automatic diagnostic tests run after the system unit is turned on and before the AIX operating system is loaded.

The automatic diagnostic tests display progress indicators (or checkpoints) to track test progress. If a test stops or hangs, the checkpoint for that test remains in the display to identify the unsuccessful test. The descriptions of these tests are contained in Reference code finder.

Power-on self-test

Power-On Self-Test (POST) programs check the devices needed to accomplish an initial program load. The POST also checks the memory, and portions of the central electronics complex, common interrupt handler, and the direct memory access (DMA) handler.

Configuration program

The configuration program determines which features, adapters, and devices are present on the system. The configuration program, which is part of the AIX operating system, builds a configuration list that is used by the diagnostic programs. This list is used to control which tests are run during system checkout.

On systems running AIX, the configuration program displays numbers between 2E6 through 9FF and 2300 through 27FF in the operator panel display (if present). See Reference code finder for a listing of program actions associated with displayed numbers. On systems running logical partitions, LPAR displays in the operator panel (if present) after the hypervisor (the system firmware that controls the allocation of resources) is loaded. When a partition runningAIX is then booted, the configuration codes display on the Reference code column in the management console Contents area.

Devices attached to serial and parallel ports are not configured. The Dials and Lighted Program Function Keys (LPFKs) can be tested from online diagnostics after they are manually configured. No other device attached to the serial and parallel ports is supported by the diagnostics.

CPU and memory testing and error log analysis

Except for the floating-point tests, all CPU, and memory testing on the system units are done by POST and BIST. Memory is tested entirely by the POST. The POST provides an error-free memory MAP. If POST cannot find enough good memory to boot, it halts and displays an error message. If POST finds enough good memory, the memory problems are logged and the system continues to boot.

If any memory errors were logged, they are reported by the base system or memory diagnostics, which must be run to analyze the POST results.

The CPU and memory cannot be tested after the diagnostics are loaded; however, they are monitored for correct operation by various checkers such as processor runtime diagnostics.

Single-bit memory errors are corrected by ECC (Error Checking and Correction) on systems equipped with ECC memory.

Diagnostic programs

This section provides overview of the various diagnostic programs.

Diagnostic controller

The diagnostic controller runs as an application program on the AIX operating system. The diagnostic controller carries out the following functions:

Displays diagnostic menus
Checks availability of needed resources
Checks error log entries under certain conditions
Loads diagnostic application programs
Loads task and service aid programs
Displays test results

To test an adapter or device, select the device or adapter from the diagnostic selection menu. The diagnostic controller then loads the diagnostic application program for the selected device or adapter.

The diagnostic application program loads and runs test units to check the functions of the device or adapter.

The diagnostic controller checks the results of the tests done by the diagnostic application and determines the action needed to continue the testing.

The amount of testing that the diagnostic application does depends on the mode (service, maintenance, or concurrent) under which the diagnostic programs are running.

Error log analysis

If you are running the stand-alone diagnostics, error log analysis occurs on errors logged while booting the stand-alone diagnostics CD, or while running the stand-alone diagnostics.

When you select the diagnostics or advanced diagnostics option, the diagnostic selection menu is displayed (other menus might be shown before this menu). You can select the purpose for running diagnostics by using this menu.

When you select the problem determination option, the diagnostic programs read and analyze the contents of the error log.

Note: Most hardware errors in the operating system error log contain sysplanar0 as the resource name. The resource name identifies the resource that detected the error; it does not indicate that the resource is faulty or should be replaced. Use the resource name to determine the appropriate diagnostic to analyze the error.

If the error log contains recent errors (approximately the last seven days), the diagnostic programs automatically select the diagnostic application program to test the adapter or device that the error was logged against.

If there are no recent errors logged or the diagnostic application program runs without detecting an error, the diagnostic selection menu is displayed. You can select a resource for testing by using this menu.

If an error is detected while the diagnostic application program is running, the A PROBLEM WAS DETECTED screen displays a service request number (SRN).

Note: After a FRU is replaced based on an error log analysis program, the error log entries for the problem device must be removed or the program might continue to indicate a problem with the device. To accomplish this task, run the errclear command from the command line. Alternatively, you can use the System Management Interface Tool (SMIT) to select Problem Determination/Error Log/Clear the Error Log. Fill out the appropriate menu items.

Enhanced FRU isolation

The diagnostics provide enhanced field replaceable unit (FRU) isolation by automatically selecting associated resources. The typical way in which diagnostics select a resource is to present a list of system resources, and you are then asked to select one. Diagnostics begin with that same type of selection.

If the diagnostic application for the selected resource detects a problem with that resource, the diagnostic controller checks for an associated resource. For example, if the test of a disk drive detects a problem, the diagnostic controller tests a sibling device on the same controller. This test determines whether the drive or the controller is failing. This extra FRU isolation is apparent when you test a resource and notice that the diagnostic controller continues to test another resource that you did not select.

Advanced diagnostics function

The advanced diagnostics function are normally used by a service representative. These diagnostics might ask you to disconnect a cable and install a wrap plug.

The advanced diagnostics run in the same modes as the diagnostics used for normal hardware problem determination. The advanced diagnostics provide additional testing by allowing the service representative to do the following tasks:

Use wrap plugs for testing.
Loop on a test (not available in concurrent mode) and display the results of the testing.

Task and service aid functions

If a device does not show in the test list, or a diagnostic package is not loaded for a device, check it by using the display configuration and resource list task. If the device you want to test has a plus (+) sign or a minus (-) sign preceding its name, the diagnostic package is loaded. If the device has an asterisk (*) preceding its name, the diagnostic package for the device is not loaded or is not available.

Tasks and service aids provide a means to display data, check media, and check functions without being directed by the hardware problem determination procedure. For more information about tasks and service aids, see Tasks and service aids.

System checkout

The system checkout program uses the configuration list generated by the configuration procedure to determine which devices and features to test. These tests run without interaction. To use system checkout, select All Resources on the resource selection menu.

Missing resource description

In diagnostics version earlier than 5.2.0, missing devices are presented on a missing resource screen. This happens as a result or running diag -a or by booting online diagnostics in service mode.

In diagnostics version 5.2.0 and later, missing devices are identified on the diagnostic selection screen by an uppercase M preceding the name of the device that is missing. The diagnostic selection menu is displayed anytime you run the diagnostic routines or the advanced diagnostics routines. The diagnostic selection menu can also be entered by running diag -a when there are missing devices or missing paths to a device.

When a missing device is selected for processing, the missing resource menu checks several items. It checks whether the device is turned off, removed from the system, moved to a different physical location, or if it is still present.

When a single device is missing, the fault is probably with that device. When multiple devices with a common parent are missing, the fault is most likely related to a problem with the parent device.

The diagnostic procedure might include testing the parent of the device, analyzing which devices are missing, and any manual procedures that are required to isolate the problem.

Missing path resolution for MPIO resources

Diagnostics also identifies a multipath I/O device that has multiple configured paths, all of which are missing as a missing device. If some, but not all, paths to a multipath I/O device are missing, then diagnostics identifies those paths as missing. In such an instance, an uppercase P displays in front of the multipath I/O device.

When a device with missing paths is selected from the diagnostic selection menu, the missing path selection menu displays showing the missing paths for the device. The menu requests the user to select a missing path for processing. If the device has only one missing path, then the selection menu is bypassed. In either case, a menu is displayed showing the selected missing path and other available paths to the device (which might be missing or available). Use the menu to check whether the missing path has been removed, has not been removed, or should be ignored. The procedures are as follows:

If the Path Has Been Removed option is selected, diagnostics removes the path from the data base.
If the Path Has Not Been Removed option is selected, diagnostics determines why the path is missing.
If the Run Diagnostics on the Selected Device option is selected, diagnostics runs on the device and does not change the system configuration.

Automatic error log analysis (diagela)

Automatic error log analysis (diagela) is only supported when running the online diagnostics. The diagela command provides the capability to perform error log analysis when a permanent hardware error is logged, by enabling the diagela program on all platforms.

Note: If you are using the Linux operating system, the ppc64-diag service aid is used for error log analysis. See Obtaining service and productivity tools for Linux.

The diagela program determines whether the error should be analyzed by the diagnostics. If the error should be analyzed, a diagnostic application is invoked and the error is analyzed. No testing is done if the diagnostics determine that the error requires a service action. Instead it sends a message to your console, and either the Service Management applications for systems with a management console, or to all system groups. The message contains the SRN.

Running diagnostics in this mode is similar to using the diag -c -e -d Device command.

Notification can also be customized by adding a stanza to the PDiagAtt object class. The following example illustrates how a program can be invoked in place of the normal mail message. The example also shows that you can send the message to the Service Management application when there is no HMC.

PDiagAtt:
     DClass = " "
     DSClass = " "
     DType = " "
     attribute = "diag_notify"
     value = "/usr/bin/customer_notify_ program $1 $2 $3 $4 $5"
     rep = "s"

If DClass, DSClass, and DType are blank, then the customer_notify_program applies for all devices. If you enter specifics in the DClass, DSClass, and DType the customer_notify_program is invoked only for that device type.

After the above stanza is added to the ODM data base, problems are displayed on the system console. Then, the program specified in the value field of the diag_notify predefined attribute is invoked. The following keyword is expanded automatically as arguments to the notify program:

$1 the keyword diag_notify
$2 the resource name that reported the problem
$3 the Service Request Number
$4 the device type
$5 the error label from the error log entry

If no diagnostic program is found to analyze the error log entry, or analysis is done but no error was reported, a separate program can be specified to be invoked. This is accomplished by adding a stanza to the PDiagAtt object class with an attribute = diag_analyze. The following example illustrates how a customer's program can be invoked for this condition:

PDiagAtt:
     DClass = " "
     DSClass = " "
     DType= " "
     attribute = "diag_anaylze"
     value = "/usr/bin/customer_analyzer_program $1 $2 $3 $4 $5"
     rep = "s"

If DClass, DSClass, and DType are blank, then the customer_analyzer_program applies for all devices. Specifying the DClass, DSClass, and DType with details causes the customer_analyzer_program to be invoked only for that device type.

After the above stanza is added to the ODM data base, the program specified is invoked if there is no diagnostic program specified for the error, or if analysis was done, but no error found. The following keywords expand automatically as arguments to the analyzer program:

$1 the keyword diag_analyze
$2 the resource name that reported the problem
$3 the error label from the error log entry if from ELA, the keyword PERIODIC if from Periodic Diagnostics, or the keyword REMINDER if from a Diagnostic Reminder.
$4 the device type
$5 the keywords:
- no_trouble_found if the analyzer was run, but no trouble was found.
- no_analyzer if the analyzer is not available.

To activate the automatic error log analysis feature, log in as root user (or use the CE login) and type the following command:

/usr/lpp/diagnostics/bin/diagela ENABLE

To disable the automatic error log analysis feature, log in as root user (or use the CE login) and type the following command:

/usr/lpp/diagnostics/bin/diagela DISABLE

The diagela program can also be enabled and disabled using the periodic diagnostic service aid.

Log repair action

Note: The log repair action is only supported when using online diagnostics.

The diagnostics perform error log analysis on most resources. The default time for error log analysis is seven days; however, this time can be changed from 1 to 60 days by using the display or change diagnostic run time options task. To prevent false problems from being reported when error log analysis is run, repair actions need to be logged whenever a FRU is replaced. A repair action can be logged by using the log repair action task or by running advanced diagnostics in system verification mode.

The log repair action task lists all resources. Replaced resources can be selected from the list, and when commit (F7 key) is selected, a repair action is logged for each selected resource.

Updates

Learn about obtaining machine code updates for your management console, server firmware, I/O adapter and device, as well as operating system updates.

Updates provide changes to your software, Licensed Internal Code, or machine code that fix known problems, add new function, and keep your server or management console operating efficiently. For example, you might install updates for your operating system in the form of a program temporary fix (PTF). Or, you might install a server firmware update with code changes that are needed to support new hardware or new functions of the existing hardware.

A good update strategy is an important part of maintaining and managing your server. If you have a dynamic environment that changes frequently, install updates on a regular basis. If you have a stable environment, you do not have to install updates as frequently. However, you should consider installing updates whenever you make any major software or hardware changes in your environment.

You can get updates using various methods, depending on your service environment. For example, if you use an HMC to manage your server, you can use the HMC interface to download, install, and manage your HMC and firmware updates. If you do not use an HMC to manage your server, you can use the functions specific to your operating system to get your updates. In addition, you can download or order many updates through Internet websites.

You must manage several types of updates to maintain your hardware. The following figure shows the different types of hardware and software that might require updates.

Figure 1. This diagram shows the hardware and software that might require updates.

HMC user interface

Learn about the Hardware Management Console (HMC) graphical user interface.

The HMC provides a menu (also called the context menu) for quick access to menu choices. The menu lists the actions found in the Selected and Object menus for the current object or objects.

The user interface provided with the Hardware Management Console (HMC) uses navigation that provides hierarchical views of system resources and tasks. This user interface is made up of several major components: the banner, the navigation pane, the work pane, the task bar, and the status bar. The following sections describe each of these components.

System fault indicator and system identify indicator

Some systems support the system identify indicator and, or the system fault indicator.

The system identify indicator is used to help physically identify a particular system in a room. The system fault indicator is used to help physically identify a particular system that has a fault condition.

On a system that supports system fault indicator, the indicator is set to fault condition when a fault is detected. After the problem with the system is fixed, the system fault indicator must be set back to normal. This is done by using the log repair action task. For more information, see Log repair action.

Note: This action keeps the system fault indicator from being set to the fault state due to a previous error, that has already been serviced, in the error log.

Both of these indicator functions can be managed by using the system identify indicator and system fault indicator tasks. For more information, see System Fault Indicator or System Identify Indicator.

Array bit steering

An advanced feature of many systems is array bit steering. The processors in these systems have internal cache arrays with extra memory capacity that can be configured to correct certain types of array faults.

This reconfiguration can be used to correct arrays for faults detected at IPL or run time. If a fault is detected during run time, the recoverable fault is reported with a Repair Disposition Pending Reboot indicator set. This setting allows diagnostics to callout a service request number that identifies the array and directs the service representative to a MAP for problem resolution that uses array bit steering. If the array bit steering cannot be used for the reported fault, then the FRU with that array is replaced.

Enhanced I/O error handling

Enhanced I/O Error Handling (EEH) is an error recovery strategy for errors that can occur during I/O operations on the PCI bus. Not all systems support EEH; if you get an SRN involving an EEH error, follow the action listed.

Send feedback Rate this page

Last updated: Tue, September 23, 2014