Diagnosing link errors

Use this procedure to isolate link errors to a field replacement unit (FRU).

Symptoms that lead to this procedure include:

Symptom	Reporting mechanism
Link down message; HCA resource (logical switch, logical HCA, end node) disappearance reported	QLogic log, or Cluster Systems Management/Management Server (CSM/MS) log containing QLogic logs: /var/log/csm/errorlog/[CSM/MS hostname]
HCA resource (logical switch, logical HCA, node) disappearance reported	FastFabric health checking with .diff file
LED on switch or HCA showing link down	LEDs; Chassis Viewer; Fabric Viewer

Use the following procedure to isolate a link error to a FRU. Be sure to record which steps you have taken in case you have to contact your next level of support, or in case QLogic must be contacted.

The basic flow of the procedure is:

Determine if the link errors might be symptoms caused by a user action (such as a restart) or another component failing (such as a switch, or a server).
Determine the physical location of both ends of the cable.
Isolate to the FRU
Repair the FRU
Verify that the link is fixed
Verify that the configuration was not inadvertently changed
If a switch component, or HCA was replaced, perform a new health check baseline
Exit the procedure

Notes:

During this procedure, you might need to swap ports to which a cable end is connected. Be sure that you do not swap ports with a link connected to a fabric management server. This action can jeopardize fabric performance and also capability to do some verification procedures.
Once you have fixed the problem, or cannot find a problem after doing anything to disturb the cable, HCA or switch components associated with the link, it is important to perform the Fast Fabric health check described in step 16 to ensure that you have returned the cluster fabric to the intended configuration. The only changes in configuration should be VPD information from replaced parts.
If you replace the managed spine for the switch chassis, you need to redo the switch chassis setup for the switch as described in Installing and configuring vendor InfiniBand switches.

If this task is a switch to switch link, use the troubleshooting guide from QLogic. Contact QLogic service and exit this procedure.
If this task is an IBM® HCA to switch link, continue to the next step.
Map the IBM HCA GUID and port information to a physical location, and determine the switch physical location by using the procedure in Mapping fabric devices.
Before proceeding, check for other link problems in the CSM Event Management Log.
If there is an appearance notification after a disappearance notification for the link, it is possible that the HCA link bounced, or the node has rebooted.
If every link attached to a server is reported as down, or all of them have been reported disappearing and then appearing, perform the following steps:
1. Check to see if the server is powered-off or had been restarted. If the server has been powered-off or restarted, the link error is not a serviceable event; therefore, you can end this procedure.
2. The server is not powered-off nor had it been restarted. The problem is with the HCA. Replace the HCA by using the Serviceability task on the Hardware Management Console (HMC) which manages the server in which the HCA is populated, and exit this procedure.
If every link attached to the switch chassis has gone down, or all of them have been reported disappearing and then appearing, perform the following steps:
1. Check to see if the switch chassis is powered-off or was powered-off at the time of the error. If this is true, the link error is not a serviceable event; therefore, you can end this procedure.
2. If the switch chassis is not powered-off nor was it powered-off at the time of the error, the problem is in the switch chassis. Contact QLogic service and exit this procedure.
If more than two links attached to a switch chassis have gone down, but not all the links with cables have gone down or been reported disappearing and then appearing, the problem is in the switch chassis. Contact QLogic service and exit this procedure.
Check the HMC for serviceable events against the HCA. If the HCA was reported as part of a FRU list in a serviceable event. This link error is not a serviceable event; therefore, no repair is required in this procedure. If you replace the HCA or a switch component based on the serviceable event, go to step 16 in this procedure. Otherwise, you can exit this procedure.
Check the LEDs of the HCA and switch port comprising the link. Use the IBM system Manual to determine if the HCA LED is in a valid state and use the QLogic switch Users Guide to determine if the switch port is in a valid state. In each case, the LED should be lit if the link is up and unlit if the link is down.
Check the seating of the cable on the HCA and the switch port. If it appears unseated, reseat the cable and do the following steps. Otherwise go to the next step.
1. Check the LEDs.
2. If the LEDs light, the problem is resolved. Go to step 16.
3. If the LEDs do not light, go to the next step.
Check the cable for damage. If the cable is damaged, perform the following procedure. Otherwise, proceed to the next step.
1. Replace the cable. Before replacing the cable, check the manufacturer and part number to ensure that it is an approved cable. Approved cables are available in the IBM clusters with the InfiniBand switch Web site.
2. Perform the procedure in Verifying link FRU replacements.
3. If the problem is fixed, go to step 16. If the problem is not fixed, go to the next step.
If there are open ports on the switch, do the following steps. Otherwise, go to step 14.
1. Move the cable connector from the failing switch port to the open switch port.
2. In order to see if the problem has been resolved, or it has moved to the new switch port, use the procedure in Verifying link FRU replacements.
3. If the problem was “fixed”, then the failing FRU is on the switch. Engage QLogic for repair. Once the repair has been made, go to step 16. If the problem was not fixed by swapping ports, proceed to the next step.
4. If the problem was not “fixed” by swapping ports, then the failing FRU is either the cable or the HCA. Return the switch port end of the cable to the original switch port.
5. If there is a known good HCA port available for use, swap between the failing HCA port cable end to the known good HCA port. Then, do the following steps. Otherwise proceed to the next step.
  1. Use the procedure in Verifying link FRU replacements.
  2. If the problem was “fixed”, replace the HCA by using the Repair and Verify procedures for the server and HCA. Once the HCA is replaced, go to step 16.
  3. If the problem was not “fixed”, the problem is the cable. Engage QLogic for repair. Once the repair has been made, go to step 16.
6. If there is not a known good HCA port available for use, and the problem has been determined to be the HCA or the cable, replace the FRUs is the following order:
  1. Engage QLogic to replace the cable, and verify the fix by using the procedure in Verifying link FRU replacements. If the problem is fixed, go to step 16.
    Note: Before replacing the cable, check the manufacturer and part number to ensure that it is an approved cable. Approved cables are available in the IBM Clusters with the InfiniBand Switch Web site referenced in Table 2: General Cluster Information Resources, on page 16.
  2. If the cable does not fix the problem, replace the HCA, and verify the fix by using the procedure in Verifying link FRU replacements. If the problem is fixed, go to step 16.
  3. If the problem is still not fixed, contact your next level of support. If any repairs are made under direction from support, go to step 16 once they have been made.
If there are open ports or known good ports on the HCA, perform the following steps. Otherwise, go to the next step.
1. Move the cable connector from the failing HCA port to the open or known good HCA port.
2. In order to see if the problem has been resolved, or it has moved to the new HCA port, use the procedure in Verifying link FRU replacements. If the problem is fixed, go to step 16.
3. If the problem was “fixed”, then the failing FRU is the HCA, replace the HCA by using the Repair and Verify procedures for the server and HCA. After the HCA has been replaced, go to step 16.
4. If the problem was not “fixed”, then the failing FRU is the cable or the switch. Engage QLogic for repair. Once the problem is fixed, go to step 16.
There are no open or available ports in the fabric, or the problem has not been isolated yet. Do the following:
1. Engage QLogic to replace the cable, and verify the fix by using the procedure in Verifying link FRU replacements. If the problem is fixed, go to step 16.
2. If the cable does not fix the problem, replace the HCA, and verify the fix by using the procedure in Verifying link FRU replacements. If the problem is fixed, go to step 16.
3. If the HCA does not fix the problem, engage QLogic to work on the switch. Once the problem is fixed, go to step 16.
If the problem has been fixed, run Fast Fabric Health check and check for .diff files. Be aware of any inadvertent swapping of cables. For instructions on interpreting health check results, see Health checking.
1. If the only difference between the latest cluster configuration and the baseline configuration is new part numbers or serial numbers related to the repair action, run a new Health Check baseline to account for the changes.
2. If there are other differences between the latest cluster configuration and baseline configuration, perform the procedure in Reestablishing a health check baseline. This health check baseline will create a new baseline so that future health checks will not show configuration changes.
3. If there were link errors reported in the health check, you need to go back to step 1 of this procedure and isolate the problem.