Linux problem isolation procedure

Use this procedure when servicing a Linux® partition or a server that has Linux as its only operating system.

About this task

DANGER
When working on or around the system, observe the following precautions:

Electrical voltage and current from power, telephone, and communication cables are hazardous. To avoid a shock hazard: If IBM supplied the power cord(s), connect power to this unit only with the IBM provided power cord. Do not use the IBM provided power cord for any other product. Do not open or service any power supply assembly. Do not connect or disconnect any cables or perform installation, maintenance, or reconfiguration of this product during an electrical storm.

  • L003 label image The product might be equipped with multiple power cords. To remove all hazardous voltages, disconnect all power cords. For AC power, disconnect all power cords from their AC power source. For racks with a DC power distribution panel (PDP), disconnect the customer’s DC power source to the PDP.
  • When connecting power to the product ensure all power cables are properly connected. For racks with AC power, connect all power cords to a properly wired and grounded electrical outlet. Ensure that the outlet supplies proper voltage and phase rotation according to the system rating plate. For racks with a DC power distribution panel (PDP), connect the customer’s DC power source to the PDP. Ensure that the proper polarity is used when attaching the DC power and DC power return wiring.
  • Connect any equipment that will be attached to this product to properly wired outlets.
  • When possible, use one hand only to connect or disconnect signal cables.
  • Never turn on any equipment when there is evidence of fire, water, or structural damage.
  • Do not attempt to switch on power to the machine until all possible unsafe conditions are corrected.
  • When performing a machine inspection: Assume that an electrical safety hazard is present. Perform all continuity, grounding, and power checks specified during the subsystem installation procedures to ensure that the machine meets safety requirements. Do not attempt to switch power to the machine until all possible unsafe conditions are corrected. Before you open the device covers, unless instructed otherwise in the installation and configuration procedures: Disconnect the attached AC power cords, turn off the applicable circuit breakers located in the rack power distribution panel (PDP), and disconnect any telecommunications systems, networks, and modems.
  • Connect and disconnect cables as described in the following procedures when installing, moving, or opening covers on this product or attached devices.

    To Disconnect: 1) Turn off everything (unless instructed otherwise). 2) For AC power, remove the power cords from the outlets. 3) For racks with a DC power distribution panel (PDP), turn off the circuit breakers located in the PDP and remove the power from the Customer's DC power source. 4) Remove the signal cables from the connectors. 5) Remove all cables from the devices.

    To Connect: 1) Turn off everything (unless instructed otherwise). 2) Attach all cables to the devices. 3) Attach the signal cables to the connectors. 4) For AC power, attach the power cords to the outlets. 5) For racks with a DC power distribution panel (PDP), restore the power from the Customer's DC power source and turn on the circuit breakers located in the PDP. 6) Turn on the devices.

  • Sharp edges, corners and joints may be present in and around the system. Use care when handling equipment to avoid cuts, scrapes and pinching. (D005)

These procedures define the steps to take when servicing a Linux partition or a server that has Linux as its only operating system.

Before continuing with this procedure it is recommended that you review the additional software available to enhance your Linux solutions. See Service and productivity tools for PowerLinux servers.

Note: If the server is attached to a management console, the various codes that might display on the management console are all listed as reference codes by Service Focal Point (SFP). Use the following table to help you identify the type of error information that might be displayed when you are using this procedure.
Number of digits in reference code Reference code Name or code type
Any Contains # (number sign) Menu goal
Any Contains - (hyphen) Service request number (SRN)
5 Does not contain # or - SRN
8 Does not contain # or - system reference code (SRC)

Procedure

  1. Is the server managed by a management console that is running Service Focal Point (SFP)?
    No
    Go to step 3.
    Yes
    Go to step 2.
  2. Servers with Service Focal Point

    Look at the service action event log in SFP for errors. Focus on those errors with a timestamp near the time at which the error occurred. Follow the steps indicated in the error log entry to resolve the problem. If the problem is not resolved, continue with step 3.

  3. Look for and record all reference code information or software messages on the operator panel and in the service processor error log (which is accessible by viewing the ASMI menus).
  4. Choose a Linux partition that is running correctly (preferably the partition with the problem).

    Is Linux usable in any partition with Linux installed?

    No
    Go to step 10.
    Yes
    Go to step 5.
  5. Diagnose the RTAS events. For instructions, see Diagnosing RTAS events.
  6. Record any RTAS events found in the Linux system log

    If the system is configured with more than one logical partition with Linux installed, repeat step 5 and step 6 for all logical partitions that have Linux installed.

  7. Examine the Linux boot (IPL) log by logging in to the system as the root user and entering the following command:

    cat /var/log/boot.msg |grep RTAS |more

    Linux boot (IPL) error messages are logged into the boot.msg file under /var/log. An example of the Linux boot error log:
    RTAS daemon started
    RTAS: -------- event-scan begin --------
    RTAS: Location Code: U0.1-F3
    RTAS: WARNING: (FULLY RECOVERED) type: SENSOR
    RTAS: initiator: UNKNOWN target: UNKNOWN
    RTAS: Status: bypassed new
    RTAS: Date/Time: 20020830 14404000
    RTAS: Environment and Power Warning
    RTAS: EPOW Sensor Value: 0x00000001
    RTAS: EPOW caused by fan failure
    RTAS: -------- event-scan end ----------
  8. Record any RTAS events found in the Linux boot (IPL) log in step 7.
    Ignore all other events in the Linux boot (IPL) log. If the system is configured with more than one logical partition with Linux installed, repeat step 7 and step 8 for all logical partitions that have Linux installed.
  9. Record any extended data found in the Linux system log in Step 5 or the Linux boot (IPL) log in step 7.
    Note: The lines in the Linux extended data that begin with <4>RTAS: Log Debug: 04 contain the reference code listed in the next 8 hexadecimal characters. In the previous example, 4b27 26fb is a reference code. The reference code is also known as word 11. Each 4 bytes after the reference code in the Linux extended data is another word (for example, 04a0 0011 is word 12, and 702c 0014 is word 13, and so on).

    If the system is configured with more than one logical partition with Linux installed, repeat step 9 for all logical partitions that have Linux installed.

  10. Were any reference codes or checkpoints recorded in steps 3, 6, 8, or 9?
    No
    Go to step 11.
    Yes
    Go to the Linux fast-path problem isolation with each reference code that was recorded. Perform the indicated actions one at a time for each reference code until the problem has been corrected. If all recorded reference codes have been processed and the problem has not been corrected, go to step 11.
  11. If no additional error information is available and the problem has not been corrected, complete the following steps:
    1. Shut down the system.
    2. If a management console is not attached, see Managing your server using the Advanced System Management Interface for instructions to access the ASMI.
      Note: The ASMI functions can also be accessed by using a personal computer connected to system port 1.

      You need a personal computer capable of connecting to system port 1 on the system unit. (The Linux login prompt cannot be seen on a personal computer connected to system port 1.) If the ASMI functions are not otherwise available, use the following procedure:

      1. Attach the personal computer and cable to system port 1 on the system unit.
      2. With 01 displayed in the operator panel, press a key on the virtual terminal on the personal computer. The service ASMI menus are available on the attached personal computer.
      3. If the service processor menus are not available on the personal computer, perform the following steps:
        1. Examine and correct all connections to the service processor.
        2. Replace the service processor.
          Note: The service processor might be contained on a separate card or board; in some systems, the service processor is built into the system backplane. Contact your next level of support for help before replacing a system backplane.
    3. Examine the service processor error log.
      Record all reference codes and messages written to the service processor error log. Go to step 12.
  12. Were any reference codes recorded in step 11?
    No
    Go to step 20.
    Yes
    Go to the Linux fast-path problem isolation with each reference code or symptom you have recorded. Perform the indicated actions, one at a time, until the problem has been corrected. If all recorded reference codes have been processed and the problem has not been corrected, go to 20.
  13. Reboot the system and bring all partitions to the login prompt.
    If Linux is not usable in all partitions, go to step 17.
  14. Use the lscfg command to list all resources assigned to all partitions.
    Record the adapter and the partition for each resource.
  15. To determine whether any devices or adapters are missing, compare the list of partition assignments, and resources found, to the customer's known configuration. Record the location of any missing devices.
    Also record any differences in the descriptions or the locations of devices.

    You may also compare this list of resources that were found to an earlier version of the device tree as follows:

    Note: At the Linux command prompt, type vpdupdate, and press Enter. The device tree is stored in the /var/lib/lsvpd/ directory in a file with the file name device-tree-YYYY-MM-DD-HH:MM:SS, where YYYY is the year, MM is the month, DD is the day, and HH, MM, and SS are the hour, minute and second, respectively, of the date of creation.
    • At the command line, type the following:
      cd /var/lib/lsvpd/
    • At the command line, type the following:
      lscfg -vpz /var/lib/lsvpd/<file_name>

      Where, <file_name> is the .gz file name that contains the database archive.

    The diff command offers a way to compare the output from a current lscfg command to the output from an older lscfg command. If the files names for the current and old device trees are current.out and old.out, respectively, type: diff old.out current.out. Any lines that exist in the old, but not in the current will be listed and preceded by a less-than symbol (<). Any lines that exist in the current, but not in the old will be listed and preceded by a greater-than symbol (>). Lines that are the same in both files are not listed; for example, files that are identical will produce no output from the diff command. If the location or description changes, lines preceded by both < and > will be output.

    If the system is configured with more than one logical partition with Linux installed, repeat 14 and 15 for all logical partitions that have Linux installed.

  16. Was the location of one and only one device recorded in 15?
    No
    If you previously answered Yes to step 16, return the system to its original configuration. This ends the procedure.

    Go to MAP 0410: Repair checkout.

    If you did not previously answer Yes to step 16, go to step 17.

    Yes
    Complete the following steps one at a time. Power off the system before each step. After each step, power on the system and go to step 13.
    1. Check all connections from the system to the device.
    2. Replace the device (for example, tape or DASD).
    3. If applicable, replace the device backplane.
    4. Replace the device cable.
    5. Replace the adapter.
      • If the adapter resides in an I/O drawer, replace the I/O backplane.
      • If the device adapter resides in the CEC, replace the I/O riser card, or the CEC backplane in which the adapter is plugged.
    6. Call service support. Do not go to step 13.
  17. Does the system appear to stop or hang before reaching the login prompt or did you record any problems with resources in step 15?
    Note: If the system console or VTERM window is always blank, choose NO. If you are sure the console or VTERM is operational and connected correctly, answer the question for this step.
    No
    Go to step 18.
    Yes
    There may be a problem with an I/O device. Go to PFW1542: I/O problem isolation procedure . When instructed to boot the system, boot a full system partition.
  18. Boot the eServer™ standalone diagnostics, refer to Running the online and stand-alone diagnostics .
    Run diagnostics in problem determination mode on all resources. Be sure to boot a full system partition. Ensure that diagnostics were run on all known resources. You may need to select each resource individually and run diagnostics on each resource one at a time.
    Did standalone diagnostics find a problem?
    No
    Go to step 22.
    Yes
    Go to the Reference codes and perform the actions for each reference code you have recorded. For each reference code not already processed in step 16, repeat this action until the problem has been corrected. Perform the indicated actions, one at a time. If all recorded reference codes have been processed and the problem has not been corrected, go to step 22.
  19. Does the system have Linux installed on one or more partitions?
    No
    Return to the Beginning problem analysis.
    Yes
    Go to step 3.
  20. Were any location codes recorded in steps 3, 6, 8, 9, 10, or 11?
    No
    Go to step 13.
    Yes
    Replace, one at a time, all parts whose location code was recorded in steps 3, 6, 8, 9, 10, or 11 that have not been replaced. Power off the system before replacing a part. After replacing the part, power on the system to check if the problem has been corrected. Go to step 21 when the problem has been corrected, or all parts in the location codes list have been replaced.
  21. Was the problem corrected in step 20?
    No
    Go to step 13.
    Yes
    Return the system to its original configuration. This ends the procedure.

    Go to MAP 0410: Repair checkout.

  22. Were any other symptoms recorded in step 3?
    No
    Call support.
    Yes
    Go to the Beginning problem analysis with each symptom you have recorded. Perform the indicated actions for all recorded symptoms, one at a time, until the problem has been corrected. If all recorded symptoms have been processed and the problem has not been corrected, call your next level of support.