AIX and Linux problem analysis

You can use this procedure to find information about a problem with your server hardware when service is managed by the AIX® or Linux operating system.

Remember the following points while troubleshooting problems:
  • Has an external power outage or momentary power loss occurred?
  • Has the hardware configuration changed?
  • Has system software been added?
  • Have any new programs or program updates (including PTFs) been installed recently?

Before you use this procedure, ensure that you performed the steps in Beginning problem analysis.

After reviewing these considerations, follow these steps:

  1. Is the operating system operational?
    • Yes: Continue with the next step.
    • No: Go to step 14.
  2. Are any messages (for example, a device is not available or reporting errors) related to this problem displayed on the system console or sent to you in email that provides a reference code?
    Note: A reference code can be an 8 character system reference code (SRC) or an service request number (SRN) of 5, 6, or 7 characters, with or without a hyphen.
    • Yes: Continue with the next step.
    • No: Go to step 4.
  3. The reference code description might provide information or an action that you can take to correct the failure.
    Use the search function of IBM® Knowledge Center to find the reference code details. The search function is located in the upper-left corner of IBM Knowledge Center. Read the reference code description and return here. Do not take any other action at this time.

    For more information about reference codes, see Reference codes.

    If the reference code description provides information to resolve the problem without replacing FRUs in the failing item list, perform the steps.

    Were you able to resolve the problem?

    • Yes: This ends the procedure.
    • No: Continue with the next step.
  4. Are you running Linux?
    • Yes: Continue with the next step.
    • No: Go to step 9.
  5. Do you suspect a problem with a 3D graphics adapter?
  6. Do you suspect a problem with a PCIe3 1.6 TB NVMe Flash adapter (FC EC54 and EC55; CCIN 58CB) or a PCIe3 3.2 TB NVMe Flash adapter (FC EC56 and EC57; CCIN 58CC)?
  7. Do you suspect a problem with a PCIe3 1.92 TB CAPI NVMe Flash accelerator adapter (FC EJ1K: CCIN 58CD)?
  8. To locate the error information in a system or logical partition running the Linux operating system, complete these steps:
    Note: Before proceeding with this step, ensure that the diagnostics package is installed on the system.
    1. Log in as root user.
    2. At the command line, type grep RTAS /var/log/platform and press Enter.
    3. Look for the most recent entry that contains a reference code.

    Continue with step 11.

  9. To locate the error information in a system or logical partition running AIX, complete these steps:
    1. Log in to the AIX operating system as root user, or use CE login. If you need help, contact the system administrator.
    2. Type diag to load the diagnostic controller, and display the online diagnostic menus.
    3. From the Function selection menu, select Task selection.
    4. From the Task selection list menu, select Display previous diagnostic results.
    5. From the Previous diagnostic results menu, select Display diagnostic log summary.
    Continue with the next step.
  10. A display diagnostic log is shown with a time ordered table of events from the error log.

    Look in the T column for the most recent entry that has an S entry. Press Enter to select the row in the table and then select Commit.

    The details of this entry from the table are shown; look for the SRN entry near the end of the entry and record the information shown.

    Continue with the next step.

  11. Do you find a serviceable event or an open problem near the time of the failure?
    • Yes: Continue with the next step.
    • No: Contact your hardware service provider. This ends the procedure.
  12. The reference code description might provide information or an action that you can take to correct the failure.
    Use the search function of IBM Knowledge Center to find the reference code details. The search function is located in the upper-left corner of IBM Knowledge Center. Read the reference code description and return here. Do not take any other action at this time.

    For more information about reference codes, see Reference codes.

    Was there a reference code description that enabled you to resolve the problem?

    • Yes: This ends the procedure.
    • No: Continue with the next step.
  13. Service is required to resolve the error. Collect as much error data as possible and record it. You and your service provider will develop a corrective action to resolve the problem based on the following guidelines:
    • If a field-replaceable unit (FRU) location code is provided in the serviceable event view or control panel, that location should be used to determine which FRU to replace.
    • If an isolation procedure is listed for the reference code in the reference code lookup information, include it as a corrective action even if it is not listed in the serviceable event view or control panel.
    • If any FRUs are marked for block replacement, replace all FRUs in the block replacement group at the same time.
    From the Error Event Log view, complete the following steps:
    1. Record the reference code.
    2. Record the error details.
    3. Contact your service provider.

    This ends the procedure.

  14. Details about errors that occur when the operating system is not running or when the operating system is now not accessible can be found in the control panel or in the Advanced System Management Interface (ASMI).

    Do you choose to look for error details using ASMI?

    • Yes: Go to step 16.
    • No: Continue with the next step.
  15. At the control panel, complete the following steps.
    1. Press the increment or decrement button until the number 11 is displayed in the upper-left corner of the display.
    2. Press Enter to display the contents of function 11.
    3. Look for a reference code in the upper-right corner.

    Is a reference code displayed on the control panel in function 11?

    • Yes: Go to step 17.
    • No: Contact your hardware service provider. This ends the procedure.
  16. On the console connected to the ASMI, complete the following steps.
    Note: If you are unable to locate the reported problem, and there is more than one open problem near the time of the reported failure, use the earliest problem in the log.
    1. Log in with a user ID that has an authority level as general, administrator, or authorized service provider.
    2. In the navigation area, expand System Service Aids and click Error/Event Logs. If log entries exist, a list of error and event log entries is displayed in a summary view.
    3. Scroll through the log under Serviceable Customer Attention Events and verify that there is a problem to correspond with the failure.

    For information about the ASMI, see Managing the Advanced System Management Interface.

    Do you find a serviceable event, or an open problem near the time of the failure?

    • Yes: Continue with the next step.
    • No: Contact your hardware service provider. This ends the procedure.
  17. The reference code description might provide information or an action that you can take to correct the failure.
    Use the search function of IBM Knowledge Center to find the reference code details. The search function is located in the upper-left corner of IBM Knowledge Center. Read the reference code description and return here. Do not take any other action at this time.

    For more information about reference codes, see Reference codes.

    Was there a reference code description that enabled you to resolve the problem?

    • Yes: This ends the procedure.
    • No: Continue with the next step.
  18. Service is required to resolve the error. Collect as much error data as possible and record it. You and your service provider will develop a corrective action to resolve the problem based on the following guidelines:
    • If a field-replaceable unit (FRU) location code is provided in the serviceable event view or control panel, use that location to determine which FRU to replace.
    • If an isolation procedure is listed for the reference code in the reference code lookup information, include the isolation procedure as a corrective action even if it is not listed in the serviceable event view or control panel.
    • If any FRUs are marked for block replacement, replace all FRUs in the block replacement group at the same time.

    To find error details on the control panel, complete the following steps:

    1. Press Enter to display the contents of function 14. If data is available in function 14, the reference code has a FRU list.
    2. Record the information in functions 11 through 20 on the control panel.
    3. Contact your service provider and report the reference code and other information.

    To find error details on the ASMI, complete the following steps from the Error Event Log view:

    1. Record the reference code.
    2. Select the corresponding check box on the log and click Show details.
    3. Record the error details.
    4. Contact your service provider.

    This ends the procedure.




Last updated: Tue, October 17, 2017