Use this procedure when servicing a Linux partition
or a server that has Linux as its only operating system.
DANGER
When working on or around the system,
observe the following precautions:
Electrical voltage and current
from power, telephone, and communication cables are hazardous. To
avoid a shock hazard:
- Connect power to this unit only with the IBM provided
power cord. Do not use the IBM provided power
cord for any other product.
- Do not open or service any power supply assembly.
- Do not connect or disconnect any cables or perform installation,
maintenance, or reconfiguration of this product during an electrical
storm.
- The product might be equipped with multiple power cords. To remove
all hazardous voltages, disconnect all power cords.
- Connect all power cords to a properly wired and grounded electrical
outlet. Ensure that the outlet supplies proper voltage and phase
rotation according to the system rating plate.
- Connect any equipment that will be attached to this product to
properly wired outlets.
- When possible, use one hand only to connect or disconnect signal
cables.
- Never turn on any equipment when there is evidence of fire, water,
or structural damage.
- Disconnect the attached power cords, telecommunications systems,
networks, and modems before you open the device covers, unless instructed
otherwise in the installation and configuration procedures.
- Connect and disconnect cables as described in the following procedures
when installing, moving, or opening covers on this product or attached
devices.
To Disconnect:
- Turn off everything (unless instructed otherwise).
- Remove the power cords from the outlets.
- Remove the signal cables from the connectors.
- Remove all cables from the devices.
To Connect:
- Turn off everything (unless instructed otherwise).
- Attach all cables to the devices.
- Attach the signal cables to the connectors.
- Attach the power cords to the outlets.
- Turn on the devices.
(D005)
These
procedures define the steps to take when servicing a Linux partition
or a server that has Linux as its only operating system.
Before continuing with this procedure it is recommended
that you review the additional software available to enhance your Linux solutions. This software
is available at: Linux on POWER® website
(http://www14.software.ibm.com/webapp/set2/sas/f/lopdiags
).
Note: If the server is attached to
a management console, the various
codes that might display on the management console are all listed as
reference codes by Service Focal Point™ (SFP). Use the following table to help you identify the
type of error information that might be displayed when you are using
this procedure.
| Number of digits
in reference code |
Reference code |
Name or code type |
| Any |
Contains # (number sign) |
Menu goal |
| Any |
Contains - (hyphen) |
Service request number (SRN) |
| 5 |
Does not contain # or - |
SRN |
| 8 |
Does not contain # or - |
system reference code (SRC) |
- Is the server managed by a management console that is running Service Focal Point (SFP)?
- No
- Go to step 3.
- Yes
- Go to step 2.
- Servers with Service Focal Point
Look at the
service action event log in SFP for errors. Focus on those errors
with a timestamp near the time at which the error occurred. Follow
the steps indicated in the error log entry to resolve the problem.
If the problem is not resolved, continue with step 3.
- Look for and record all reference code
information or software messages on the operator panel and in the
service processor error log (which is accessible by viewing the ASMI
menus).
- Choose a Linux partition
that is running correctly (preferably the partition with the problem).
Is Linux usable in any partition with Linux installed?
- No
- Go to step 10.
- Yes
- Go to step 5.
- Diagnose the RTAS events. For instructions,
see Diagnosing RTAS events.
- Record any RTAS events found in the Linux
system log
If the system is configured with more than
one logical partition with Linux installed, repeat step 5 and step 6 for all logical partitions
that have Linux installed.
- Examine the Linux boot
(IPL) log by logging in to the system as the root user and entering
the following command:
cat /var/log/boot.msg
|grep RTAS |more
Linux boot
(IPL) error messages are logged into the boot.msg file
under /var/log. An example of the Linux boot
error log:
RTAS daemon started
RTAS: -------- event-scan begin --------
RTAS: Location Code: U0.1-F3
RTAS: WARNING: (FULLY RECOVERED) type: SENSOR
RTAS: initiator: UNKNOWN target: UNKNOWN
RTAS: Status: bypassed new
RTAS: Date/Time: 20020830 14404000
RTAS: Environment and Power Warning
RTAS: EPOW Sensor Value: 0x00000001
RTAS: EPOW caused by fan failure
RTAS: -------- event-scan end ----------
|
- Record any RTAS events found in the Linux boot
(IPL) log in step 7. Ignore all other events in the Linux boot
(IPL) log. If the system is configured with more than one logical
partition with Linux installed, repeat step 7 and step 8 for all logical partitions
that have Linux installed.
- Record any extended data found in the Linux system
log in Step 5 or the Linux boot
(IPL) log in step 7.
Note: The lines in the Linux extended
data that begin with <4>RTAS: Log Debug: 04 contain
the reference code listed in the next 8 hexadecimal characters. In
the previous example, 4b27 26fb is a
reference code. The reference code is also known as word 11. Each
4 bytes after the reference code in the Linux extended
data is another word (for example, 04a0 0011 is
word 12, and 702c 0014 is word 13, and
so on).
If the system is configured with more than one logical
partition with Linux installed, repeat step 9 for all logical partitions
that have Linux installed.
- Were any reference codes or checkpoints
recorded in steps 3, 6, 8, or 9?
- No
- Go to step 11.
- Yes
- Go to the Linux fast-path
problem isolation with each reference code that was recorded.
Perform the indicated actions one at a time for each reference code
until the problem has been corrected. If all recorded reference codes
have been processed and the problem has not been corrected, go to
step 11.
- If no additional error information is available
and the problem has not been corrected, do the following:
- Shut down the system.
- If a management console is
not attached, see Accessing the
Advanced System Management Interface (ASMI) for instructions
to access the ASMI.
Note: The ASMI functions can also be
accessed by using a personal computer connected to
system port 1.
You
need a personal computer (and cable, part number 62H4857) capable
of connecting to system port 1 on
the system unit. (The Linux login prompt cannot be seen
on a personal computer connected to system port 1.) If
the ASMI functions are not otherwise available, use the following
procedure:
- Attach the personal computer and cable to system port 1 on
the system unit.
- With 01 displayed in the operator panel, press a key on the virtual
terminal on the personal computer. The service ASMI menus are available
on the attached personal computer.
- If the service processor menus are not available on the personal
computer, perform the following steps:
- Examine and correct all connections to the service processor.
- Replace the service processor.
Note: The service processor might
be contained on a separate card or board; in some systems, the service
processor is built into the system backplane. Contact your next level
of support for help before replacing a system backplane.
- Examine the service processor error log. Record
all reference codes and messages written to the service processor
error log. Go to step 12.
- Were any reference codes recorded in step 11?
- No
- Go to step 20.
- Yes
- Go to the Linux fast-path
problem isolation with each reference code or symptom you have
recorded. Perform the indicated actions, one at a time, until the
problem has been corrected. If all recorded reference codes have been
processed and the problem has not been corrected, go to 20.
- Reboot the system and bring all partitions
to the login prompt. If Linux is
not usable in all partitions, go to step 17.
- Use the lscfg command
to list all resources assigned to all partitions. Record
the adapter and the partition for each resource.
- To determine whether any devices or adapters
are missing, compare the list of partition assignments, and resources
found, to the customer's known configuration. Record the location
of any missing devices. Also record any differences in
the descriptions or the locations of devices.
You may also compare
this list of resources that were found to an earlier version of the
device tree as follows:
Note: At the Linux command prompt, type vpdupdate,
and press Enter. The device tree is stored in the /var/lib/lsvpd/ directory
in a file with the file name device-tree-YYYY-MM-DD-HH:MM:SS, where
YYYY is the year, MM is the month, DD is the day, and HH, MM, and
SS are the hour, minute and second, respectively, of the date of creation.
- At the command line, type the following:
cd /var/lib/lsvpd/
- At the command line, type the following:
lscfg -vpz /var/lib/lsvpd/<file_name>
Where, <file_name> is
the .gz file name that contains the database archive.
The diff command offers a way to compare
the output from a current lscfg command to
the output from an older lscfg command. If
the files names for the current and old device trees are current.out and old.out,
respectively, type: diff old.out current.out. Any
lines that exist in the old, but not in the current will be listed
and preceded by a less-than symbol (<). Any lines that exist in
the current, but not in the old will be listed and preceded by a greater-than
symbol (>). Lines that are the same in both files are not listed;
for example, files that are identical will produce no output from
the diff command. If the location or description changes, lines preceded
by both < and > will be output.
If the system is configured
with more than one logical partition with Linux installed,
repeat 14 and 15 for all logical partitions
that have Linux installed.
- Was the location of one and only one device
recorded in 15?
- No
- If you previously answered Yes to step 16, return the system to its
original configuration. This ends the procedure.
Go to MAP 0410:
Repair checkout.
If you did not previously answer
Yes to step 16, go to
step 17.
- Yes
- Complete the following steps one at a time. Power off the system
before each step. After each step, power on the system and go to step 13.
- Check all connections from the system to the device.
- Replace the device (for example, tape or DASD).
- If applicable, replace the device backplane.
- Replace the device cable.
- Replace the adapter.
- If the adapter resides in an I/O drawer, replace the I/O backplane.
- If the device adapter resides in the CEC, replace the I/O riser
card, or the CEC backplane in which the adapter is plugged.
- Call service support. Do not go to step 13.
- Does the system appear to stop or hang
before reaching the login prompt or did you record any problems with
resources in step 15?
Note: If the system console or VTERM window is always blank,
choose NO. If you are sure the console or VTERM is operational and
connected correctly, answer the question for this step.
- No
- Go to step 18.
- Yes
- There may be a problem with an I/O device.
Go to PFW1542: I/O problem isolation procedure .
When instructed to boot the system, boot a full system partition.
- Boot the eServer™ standalone diagnostics, refer
to Running
the online and stand-alone diagnostics . Run diagnostics in problem determination
mode on all resources. Be sure to boot a full system partition. Ensure
that diagnostics were run on all known resources. You may need to
select each resource individually and run diagnostics on each resource
one at a time.
Did standalone diagnostics find a problem? - No
- Go to step 22.
- Yes
- Go to the Reference
codes and perform
the actions for each reference code you have recorded. For each reference
code not already processed in step 16,
repeat this action until the problem has been corrected. Perform the
indicated actions, one at a time. If all recorded reference codes
have been processed and the problem has not been corrected, go to
step 22.
- Does the system have Linux installed
on one or more partitions?
- No
- Return to the Start a
repair action.
- Yes
- Go to step 3.
- Were any location codes recorded in steps 3, 6, 8, 9, 10, or 11?
- No
- Go to step 13.
- Yes
- Replace, one at a time, all parts whose location code was recorded
in steps 3, 6, 8, 9, 10, or 11 that have not been replaced.
Power off the system before replacing a part. After replacing the
part, power on the system to check if the problem has been corrected.
Go to step 21 when the
problem has been corrected, or all parts in the location codes list
have been replaced.
- Was the problem corrected in step 20?
- No
- Go to step 13.
- Yes
- Return the system to its original configuration. This ends
the procedure.
Go to MAP 0410: Repair checkout.
- Were any other symptoms recorded in step 3?
- No
- Call support.
- Yes
- Go to the Start a repair
action with each symptom you have recorded. Perform the indicated
actions for all recorded symptoms, one at a time, until the problem
has been corrected. If all recorded symptoms have been processed and
the problem has not been corrected, call your next level of support.