Use the general diagnostic information to view logs, to run tests, and to use diagnostic service utilities that can help a service provider.
For more information about working with Linux, see the Linux Information Center.
English is the default language displayed by the diagnostic programs when run from disk. If you want to run the diagnostic programs in a language other than English, you must install the AIX® message locale file for the wanted language to the system.
If a management console is attached to the server, the management console must be used to manage the firmware and microcode levels on the server.
If a management console is not attached to the server, diagnostic tasks can be used to display device and adapter microcode levels. The tasks can also be used to update device and adapter microcode. Diagnostic tasks also provide the capability to update firmware.
To determine the level of server firmware, and device and adapter microcode, use the Display Microcode Level task in diagnostic service aids. This task presents a list of resources that are currently installed and supported by this task. You then select the resource whose microcode level you with to check. If you are using the AIX operating system and using online diagnostics, the lsmcode command and the diag command can also be used to display the firmware and microcode levels of individual entities in the system from the command line. For more information, see Display Microcode Level. For adapters and devices not supported by this task, refer to the instructions provided by the manufacturer to determine the microcode levels.
Use the Update and manage system flash task to update firmware on the server. When the flash update is complete, the server automatically reboots. See Updates for detailed scenarios that explain how to use the update and manage system flash task.
Use the Download microcode service aid on systems running AIX 5.2.0.30 or later to update the microcode on adapters and devices. For details on updating adapter and device microcode, see Updates.
If your system is running the Linux operating system, you can use the service aids in the stand-alone diagnostics to update most system flash, adapter, and device microcode.
The CEREADME file is helpful in describing differences in diagnostics between the current version and the preceding version.
You can view the CEREADME file by using the Service Hints service aid after the diagnostics are loaded. Also, you can read the file directly from the disk using the pg command to display /usr/lpp/diagnostics/CEREADME. The CEREADME file can be copied or printed using the normal commands. For information about using the service hints, see Display Service Hints.
You can print the CEREADME file from disk using the cat command. The path to this file is as follows: /usr/lpp/diagnostics/CEREADME
cat /usr/lpp/diagnostics/CEREADME > /dev/lp0
cd /tmp/diag
If the directory does not exist, the message /tmp/diag: not found displays. Do not attempt to print the CEREADME file if this message is not displayed. To print the CEREADME file, choose the appropriate section below and follow the steps listed.
mkdir /tmp/diag
mount -o ro -v cdrfs /dev/cd0 /tmp/diag
cd /tmp/diag/usr/lpp/diagnostics
cat CEREADME > /dev/lp0
cd /tmp
unmount /dev/cd0
The CEREADME file prints on lp0, which is the printer normally attached to the parallel port. If this file is not the same as the CEREADME file on the disk, a copy of this file should be printed and stored with the Service Information.
To use CE login, ask the customer to create a unique user name and configure these characteristics for that name. After the user name is set up, you will need to obtain the user name and password from the customer to log in with these capabilities. The recommended CE login user name is qserv.
All automatic diagnostic tests run after the system unit is turned on and before the AIX operating system is loaded.
The automatic diagnostic tests display progress indicators (or checkpoints) to track test progress. If a test stops or hangs, the checkpoint for that test remains in the display to identify the unsuccessful test. The descriptions of these tests are contained in Reference code finder.
Power-On Self-Test (POST) programs check the devices needed to accomplish an initial program load. The POST also checks the memory, and portions of the central electronics complex, common interrupt handler, and the direct memory access (DMA) handler.
The configuration program determines which features, adapters, and devices are present on the system. The configuration program, which is part of the AIX operating system, builds a configuration list that is used by the diagnostic programs. This list is used to control which tests are run during system checkout.
On systems running AIX, the configuration program displays numbers between 2E6 through 9FF and 2300 through 27FF in the operator panel display (if present). See Reference code finder for a listing of program actions associated with displayed numbers. On systems running logical partitions, LPAR displays in the operator panel (if present) after the hypervisor (the system firmware that controls the allocation of resources) is loaded. When a partition runningAIX is then booted, the configuration codes display on the Reference code column in the management console Contents area.
Devices attached to serial and parallel ports are not configured. The Dials and Lighted Program Function Keys (LPFKs) can be tested from online diagnostics after they are manually configured. No other device attached to the serial and parallel ports is supported by the diagnostics.
Except for the floating-point tests, all CPU, and memory testing on the system units are done by POST and BIST. Memory is tested entirely by the POST. The POST provides an error-free memory MAP. If POST cannot find enough good memory to boot, it halts and displays an error message. If POST finds enough good memory, the memory problems are logged and the system continues to boot.
If any memory errors were logged, they are reported by the base system or memory diagnostics, which must be run to analyze the POST results.
The CPU and memory cannot be tested after the diagnostics are loaded; however, they are monitored for correct operation by various checkers such as processor runtime diagnostics.
Single-bit memory errors are corrected by ECC (Error Checking and Correction) on systems equipped with ECC memory.
This section provides overview of the various diagnostic programs.
To test an adapter or device, select the device or adapter from the diagnostic selection menu. The diagnostic controller then loads the diagnostic application program for the selected device or adapter.
The diagnostic application program loads and runs test units to check the functions of the device or adapter.
The diagnostic controller checks the results of the tests done by the diagnostic application and determines the action needed to continue the testing.
The amount of testing that the diagnostic application does depends on the mode (service, maintenance, or concurrent) under which the diagnostic programs are running.
If you are running the stand-alone diagnostics, error log analysis occurs on errors logged while booting the stand-alone diagnostics CD, or while running the stand-alone diagnostics.
When you select the diagnostics or advanced diagnostics option, the diagnostic selection menu is displayed (other menus might be shown before this menu). You can select the purpose for running diagnostics by using this menu.
If the error log contains recent errors (approximately the last seven days), the diagnostic programs automatically select the diagnostic application program to test the adapter or device that the error was logged against.
If there are no recent errors logged or the diagnostic application program runs without detecting an error, the diagnostic selection menu is displayed. You can select a resource for testing by using this menu.
If an error is detected while the diagnostic application program is running, the A PROBLEM WAS DETECTED screen displays a service request number (SRN).
The diagnostics provide enhanced field replaceable unit (FRU) isolation by automatically selecting associated resources. The typical way in which diagnostics select a resource is to present a list of system resources, and you are then asked to select one. Diagnostics begin with that same type of selection.
If the diagnostic application for the selected resource detects a problem with that resource, the diagnostic controller checks for an associated resource. For example, if the test of a disk drive detects a problem, the diagnostic controller tests a sibling device on the same controller. This test determines whether the drive or the controller is failing. This extra FRU isolation is apparent when you test a resource and notice that the diagnostic controller continues to test another resource that you did not select.
The advanced diagnostics function are normally used by a service representative. These diagnostics might ask you to disconnect a cable and install a wrap plug.
If a device does not show in the test list, or a diagnostic package is not loaded for a device, check it by using the display configuration and resource list task. If the device you want to test has a plus (+) sign or a minus (-) sign preceding its name, the diagnostic package is loaded. If the device has an asterisk (*) preceding its name, the diagnostic package for the device is not loaded or is not available.
Tasks and service aids provide a means to display data, check media, and check functions without being directed by the hardware problem determination procedure. For more information about tasks and service aids, see Tasks and service aids.
The system checkout program uses the configuration list generated by the configuration procedure to determine which devices and features to test. These tests run without interaction. To use system checkout, select All Resources on the resource selection menu.
In diagnostics version earlier than 5.2.0, missing devices are presented on a missing resource screen. This happens as a result or running diag -a or by booting online diagnostics in service mode.
In diagnostics version 5.2.0 and later, missing devices are identified on the diagnostic selection screen by an uppercase M preceding the name of the device that is missing. The diagnostic selection menu is displayed anytime you run the diagnostic routines or the advanced diagnostics routines. The diagnostic selection menu can also be entered by running diag -a when there are missing devices or missing paths to a device.
When a missing device is selected for processing, the missing resource menu checks several items. It checks whether the device is turned off, removed from the system, moved to a different physical location, or if it is still present.
When a single device is missing, the fault is probably with that device. When multiple devices with a common parent are missing, the fault is most likely related to a problem with the parent device.
The diagnostic procedure might include testing the parent of the device, analyzing which devices are missing, and any manual procedures that are required to isolate the problem.
Diagnostics also identifies a multipath I/O device that has multiple configured paths, all of which are missing as a missing device. If some, but not all, paths to a multipath I/O device are missing, then diagnostics identifies those paths as missing. In such an instance, an uppercase P displays in front of the multipath I/O device.
The diagela program determines whether the error should be analyzed by the diagnostics. If the error should be analyzed, a diagnostic application is invoked and the error is analyzed. No testing is done if the diagnostics determine that the error requires a service action. Instead it sends a message to your console, and either the Service Management applications for systems with a management console, or to all system groups. The message contains the SRN.
Running diagnostics in this mode is similar to using the diag -c -e -d Device command.
Notification can also be customized by adding a stanza to the PDiagAtt object class. The following example illustrates how a program can be invoked in place of the normal mail message. The example also shows that you can send the message to the Service Management application when there is no HMC.
PDiagAtt:
DClass = " "
DSClass = " "
DType = " "
attribute = "diag_notify"
value = "/usr/bin/customer_notify_ program $1 $2 $3 $4 $5"
rep = "s"
If DClass, DSClass, and DType are blank, then the customer_notify_program applies for all devices. If you enter specifics in the DClass, DSClass, and DType the customer_notify_program is invoked only for that device type.
If no diagnostic program is found to analyze the error log entry, or analysis is done but no error was reported, a separate program can be specified to be invoked. This is accomplished by adding a stanza to the PDiagAtt object class with an attribute = diag_analyze. The following example illustrates how a customer's program can be invoked for this condition:
PDiagAtt:
DClass = " "
DSClass = " "
DType= " "
attribute = "diag_anaylze"
value = "/usr/bin/customer_analyzer_program $1 $2 $3 $4 $5"
rep = "s"
If DClass, DSClass, and DType are blank, then the customer_analyzer_program applies for all devices. Specifying the DClass, DSClass, and DType with details causes the customer_analyzer_program to be invoked only for that device type.
/usr/lpp/diagnostics/bin/diagela ENABLE
/usr/lpp/diagnostics/bin/diagela DISABLE
The diagela program can also be enabled and disabled using the periodic diagnostic service aid.
The diagnostics perform error log analysis on most resources. The default time for error log analysis is seven days; however, this time can be changed from 1 to 60 days by using the display or change diagnostic run time options task. To prevent false problems from being reported when error log analysis is run, repair actions need to be logged whenever a FRU is replaced. A repair action can be logged by using the log repair action task or by running advanced diagnostics in system verification mode.
The log repair action task lists all resources. Replaced resources can be selected from the list, and when commit (F7 key) is selected, a repair action is logged for each selected resource.
Learn about obtaining machine code updates for your management console, server firmware, I/O adapter and device, as well as operating system updates.
Updates provide changes to your software, Licensed Internal Code, or machine code that fix known problems, add new function, and keep your server or management console operating efficiently. For example, you might install updates for your operating system in the form of a program temporary fix (PTF). Or, you might install a server firmware update with code changes that are needed to support new hardware or new functions of the existing hardware.
A good update strategy is an important part of maintaining and managing your server. If you have a dynamic environment that changes frequently, install updates on a regular basis. If you have a stable environment, you do not have to install updates as frequently. However, you should consider installing updates whenever you make any major software or hardware changes in your environment.
You can get updates using various methods, depending on your service environment. For example, if you use an HMC to manage your server, you can use the HMC interface to download, install, and manage your HMC and firmware updates. If you do not use an HMC to manage your server, you can use the functions specific to your operating system to get your updates. In addition, you can download or order many updates through Internet websites.
You must manage several types of updates to maintain your hardware. The following figure shows the different types of hardware and software that might require updates.

Learn about the Hardware Management Console (HMC) graphical user interface.
The HMC provides a menu (also called the context menu) for quick access to menu choices. The menu lists the actions found in the Selected and Object menus for the current object or objects.
The user interface provided with the Hardware Management Console (HMC) uses navigation that provides hierarchical views of system resources and tasks. This user interface is made up of several major components: the banner, the navigation pane, the work pane, the task bar, and the status bar. The following sections describe each of these components.
Some systems support the system identify indicator and, or the system fault indicator.
The system identify indicator is used to help physically identify a particular system in a room. The system fault indicator is used to help physically identify a particular system that has a fault condition.
On a system that supports system fault indicator, the indicator is set to fault condition when a fault is detected. After the problem with the system is fixed, the system fault indicator must be set back to normal. This is done by using the log repair action task. For more information, see Log repair action.
Both of these indicator functions can be managed by using the system identify indicator and system fault indicator tasks. For more information, see System Fault Indicator or System Identify Indicator.
An advanced feature of many systems is array bit steering. The processors in these systems have internal cache arrays with extra memory capacity that can be configured to correct certain types of array faults.
This reconfiguration can be used to correct arrays for faults detected at IPL or run time. If a fault is detected during run time, the recoverable fault is reported with a Repair Disposition Pending Reboot indicator set. This setting allows diagnostics to callout a service request number that identifies the array and directs the service representative to a MAP for problem resolution that uses array bit steering. If the array bit steering cannot be used for the reported fault, then the FRU with that array is replaced.
Enhanced I/O Error Handling (EEH) is an error recovery strategy for errors that can occur during I/O operations on the PCI bus. Not all systems support EEH; if you get an SRN involving an EEH error, follow the action listed.