Use the following procedure to correct intermittent problems.
Performing these steps removes the known causes of
most intermittent problems.
- Discuss the problem with the customer. Look for the
following symptoms:
- A reference code that goes away when you power off and then power on the
system.
- Repeated failure patterns that you cannot explain. For example, the problem
occurs at the same time of day or on the same day of the week.
- Failures that started after system relocation.
- Failures that occurred during the time specific jobs or software were
running.
- Failures that started after recent service or customer actions, system
upgrade, addition of I/O devices, new software, or program temporary fix (PTF)
installation.
- Failures occurring only during high system usage.
- Failures occur when people are close to the system or machines are attached
to the system.
- Recommend that the customer install the latest cumulative PTF package,
since code PTFs have corrected many problems that seem to be hardware failures. The customer can order the latest cumulative PTF package electronically
through Electronic Customer Support or by calling the Software Support Center.
- If you have not already done so, use the maintenance package to
see the indicated actions for the symptom described by the customer. Attempt to perform the on-line problem analysis procedure first. If
this is not possible, such as when the system is down, go to the Start of call procedure.
Use
additional diagnostic tools, if necessary, and attempt to recreate the problem.
Note: Ensure
that the service information you are using is at the same level as the operating
system.
- Check the site for the following environmental conditions:
- Any electrical noise that matches the start of the intermittent
problems. Ask the customer such questions as:
- Have any external changes or additions, such as building wiring, air conditioning,
or elevators been made to the site?
- Has any arc welding occurred in the area?
- Has any heavy industrial equipment, such as cranes, been operating in
the area?
- Have there been any thunderstorms in the area?
- Have the building lights become dim?
- Has any equipment been relocated, especially computer equipment?
If there was any electrical noise, find its source and prevent the
noise from getting into the system.
- Site temperature and humidity conditions that are within the
system specifications. See Temperature and humidity design criteria in
the Planning topic.
- Poor air quality in the computer room:
- Look for dust on top of objects. Dust particles in the air cause poor
electrical connections and may cause disk unit failures.
- Smell for unusual odors in the air. Some gases can corrode electrical
connections.
- Any large vibration (caused by thunder, an earthquake, an explosion,
or road construction) that occurred in the area at the time of the failure.
Note: A failure that is caused by vibration is more probable if the
server is on a raised floor.
- Ensure that all ground connections are tight. These
items reduce the effects of electrical noise. Check the ground connections
by measuring the resistance between a conductive place on the frame to building
ground or to earth ground. The resistance must be 1.0 ohm or less.
- Ensure proper cable retention is used, as provided. If
no retention is provided, the cable should be strapped to the frame to release
tension on cable connections.
Ensure that you pull the cable ties tight
enough to fasten the cable to the frame bar tightly. A loose cable can be
accidentally pulled with enough force to unseat the logic card in the frame
to which the cable is attached. If the system is powered on, the logic card
could be destroyed.
- Ensure that all workstation and communications cables meet hardware
specifications:
- All connections are tight.
- Any twinaxial cables that are not attached to devices must be removed.
- The lengths and numbers of connections in the cables must be correct.
- Ensure that lightning protection is installed on any twinaxial cables
that enter or leave the building.
- Perform the following:
- Review recent service calls. Contact your next level
of support for assistance.
- Review entries in the problem log (WRKPRB). Look
for problems that were reported to the user.
- Review entries in the PAL, SAL, and service processor log. Look
for a pattern:
- SRCs on multiple adapters occurring at the same time
- SRCs that have a common time-of-day or day-of-week pattern
- Log is wrapping (hundreds of recent entries and no older entries)
Check the PAL sizes and increase them if they are smaller than
recommended.
- Review entries in the history log (Display Log (DSPLOG)). Look for a change that matches the start of the intermittent problems.
- Ensure that the latest engineering changes are installed on
the system and on all system I/O devices.
- Ensure that the hardware configuration is correct and that the
model configuration rules have been followed. Use the Display
hardware configuration service function (under SST or DST) to
check for any missing or failed hardware.
- Was a system upgrade, feature, or any other field bill of material
or feature field bill of material installed just before the intermittent problems
started occurring?
- No: Continue with the next step.
- Yes: Review the installation instructions to ensure
that each step was performed correctly. Then continue with the next step of
this procedure.
- Is the problem associated with a removable media storage device?
- No: Continue with the next step.
- Yes: Ensure that the customer is using the correct
removable media storage device cleaning procedures and good storage media.
Then continue with the next step of this procedure.
- Perform the following to help prevent intermittent thermal checks:
- Ensure that the AMDs are working.
- Exchange all air filters as recommended.
- If necessary, review the intermittent problems with your next level
of support and installation planning representative. Ensure that
all installation planning checks were made on the system. Because external
conditions are constantly changing, the site may need to be checked again. This
ends the procedure.