General intermittent problem checklist
Use the following procedure to correct intermittent problems.
About this task
Procedure
- Discuss the problem with the customer. Look for the following symptoms:
- A reference code that goes away when you power off and then power on the system.
- Repeated failure patterns that you cannot explain. For example, the problem occurs at the same time of day or on the same day of the week.
- Failures that started after system relocation.
- Failures that occurred during the time specific jobs or software were running.
- Failures that started after recent service or customer actions, system upgrade, addition of I/O devices, new software, or program temporary fix (PTF) installation.
- Failures occurring only during high system usage.
- Failures occur when people are close to the system or machines are attached to the system.
- Recommend that the customer install the latest cumulative PTF package, since code PTFs
have corrected many problems that seem to be hardware failures. The customer can order the latest cumulative PTF package electronically through Electronic Customer Support or by calling the Software Support Center.
- If you have not already done so, use the maintenance package to see the indicated actions
for the symptom described by the customer.
Attempt to perform the on-line problem analysis procedure first. If this is not possible, such as when the system is down, go to the Beginning problem analysis.
Use additional diagnostic tools, if necessary, and attempt to recreate the problem.
Note: Ensure that the service information you are using is at the same level as the operating system. - Check the site for the following environmental conditions:
- Any electrical noise that matches the start of the intermittent problems. Ask the
customer such questions as:
- Have any external changes or additions, such as building wiring, air conditioning, or elevators been made to the site?
- Has any arc welding occurred in the area?
- Has any heavy industrial equipment, such as cranes, been operating in the area?
- Have there been any thunderstorms in the area?
- Have the building lights become dim?
- Has any equipment been relocated, especially computer equipment?
If there was any electrical noise, find its source and prevent the noise from getting into the system.
-
Site temperature and humidity conditions that are within the system specifications.
See temperature and humidity design criteria in the Planning for the system topic relevant for your system.
- Poor air quality in the computer room:
- Look for dust on top of objects. Dust particles in the air cause poor electrical connections and may cause disk unit failures.
- Smell for unusual odors in the air. Some gases can corrode electrical connections.
- Any large vibration (caused by thunder, an earthquake, an explosion, or road
construction) that occurred in the area at the time of the failure. Note: A failure that is caused by vibration is more probable if the server is on a raised floor.
- Any electrical noise that matches the start of the intermittent problems. Ask the
customer such questions as:
- Ensure that all ground connections are tight. These items reduce the effects of electrical noise. Check the ground connections by measuring the resistance between a conductive place on the frame to building ground or to earth ground. The resistance must be 1.0 ohm or less.
- Ensure proper cable retention is used, as provided.
If no retention is provided, the cable should be strapped to the frame to release tension on cable connections.
Ensure that you pull the cable ties tight enough to fasten the cable to the frame bar tightly. A loose cable can be accidentally pulled with enough force to unseat the logic card in the frame to which the cable is attached. If the system is powered on, the logic card could be destroyed.
- Ensure that all workstation and communications cables meet hardware specifications:
- All connections are tight.
- Any twinaxial cables that are not attached to devices must be removed.
- The lengths and numbers of connections in the cables must be correct.
- Ensure that lightning protection is installed on any twinaxial cables that enter or leave the building.
- Perform the following:
- Review recent repair actions. Contact your next level of support for assistance.
- Review entries in the problem log (WRKPRB). Look for problems that were reported to the user.
- Review entries in the PAL,
SAL, and service processor log. Look for a pattern:
- SRCs on multiple adapters occurring at the same time
- SRCs that have a common time-of-day or day-of-week pattern
- Log is wrapping (hundreds of recent entries and no older entries)
Check the PAL sizes and increase them if they are smaller than recommended.
- Review entries in the history log (Display Log (DSPLOG)).
Look for a change that matches the start of the intermittent problems.
- Ensure that the latest engineering changes are installed on the system and on all system I/O devices.
- Review recent repair actions.
- Ensure that the hardware configuration is correct and that the model configuration rules
have been followed. Use the Display hardware configuration service function (under SST or DST) to check for any missing or failed hardware.
- Was a system upgrade, feature, or any other field bill of material or feature field bill
of material installed just before the intermittent problems started occurring?
- No: Continue with the next step.
- Yes: Review the installation instructions to ensure that each step was performed correctly. Then, continue with the next step of this procedure.
- Is the problem associated with a removable media storage device?
- No: Continue with the next step.
- Yes: Ensure that the customer is using the correct removable media storage device cleaning procedures and good storage media. Then, continue with the next step of this procedure.
- Perform the following to help prevent intermittent thermal checks:
- Ensure that the AMDs are working.
- Exchange all air filters as recommended.
- If necessary, review the intermittent problems with your next level of support and
installation planning representative.
Ensure that all installation planning checks were made on the system. Because external conditions are constantly changing, the site may need to be checked again. This ends the procedure.