Troubleshooting
Problem
This procedure is written to help BladeCenter users debug problems involving connectivity between blade servers and external fibre channel devices, switches, disk controllers, etc.
Resolving The Problem
| Source |
|---|
RETAIN tip: H19935
| Symptom |
|---|
This procedure is written to help BladeCenter users debug problems involving connectivity between blade servers and external fibre channel devices, switches, disk controllers, etc.
| Affected configurations |
|---|
The system may be any of the following IBM servers:
- BladeCenter Chassis, Type 8677, any model
Notes:
This Tip is not software specific.
This Tip is not option specific.
| Solution |
|---|
This procedure is written to help BladeCenter users debug problems involving connectivity between blade servers and external fibre channel devices, switches, disk controllers, etc.
REFERENCES:
- IBM BladeCenter Deployment Guide, IBM whitepaper WP100564 Remote Storage Area Network (SAN) Boot - IBM eServer BladeCenter HS20 and HS40, IBM doc MIGR-57563
- IBM eServer BladeCenter Remote SAN Boot - IBM eServer BladeCenter JS20, IBM doc MIGR-57235
- IBM eServer BladeCenter Remote SAN Boot - IBM eServer BladeCenter JS21, IBM doc MIGR-64763
- Fibre Channel Switch Interoperability Guide - IBM BladeCenter, IBM doc MIGR-58206
- SANsurfer Switch Manager Version 5.02.05 for Linux, IBM doc MIGR-60532
- SANsurfer Switch Manager Version 5.02.05 for Microsoft Windows, IBM doc MIGR-60498
- Qlogic 6-Port Enterprise Fibre Channel Switch Module Management Guide, IBM doc MIGR-58700
- McDATA 6-Port Fibre Channel Switch Module - IBM eServer BladeCenter Management Guide, IBM doc MIGR-59910
- Brocade SAN Switch Modules: Design, Deployment and Management (DDM) Guide, IBM doc MIGR-55327
- DS4000 Best Practices and Performance Tuning Guide, IBM Redbook sg246363
- Brocade Fabric Manager Version 5, IBM site http://www-03.IBM.com/servers/storage/san/b_type/fabric_manage r/ver5/
- McData Management Software, EFCM 9.0, McData site http://www.mcdata.com/products/software/sanmf/index.html
- Emulex HBA Management Suite, HBAnyware, Emulex site http://www.emulex.com/products.html
| TECHNICAL OVERVIEW |
|---|
The original BladeCenter chassis (Type 8677) has four bays for I/O connectivity modules accessible from the rear of the chassis. The I/O modules plug directly into the chassis midplane where they connect to DC power communications ports from all 14 blades and control/status to the Management Module (MM). All blade servers come with two built-in Ethernet ports which connect to the midplane and route to ports on I/O switch modules plugged into chassis Bays 1 and 2. Blades also have a connector for attaching an I/O expansion card which then connects to two more I/O switch modules in chassis Bays 3 and 4. The alternative BladeCenter chassis models, the BladeCenter H and BladeCenter T, have different switch bay configurations but the same principles apply.
A blade server configuration for connection to a fibre channel network requires a fibre channel expansion card in the blade and at least one fibre channel switch module or Optical Pass-through Module, OPM, in chassis Bays 3 or 4. IBM offers both 2 GB and 4 GB Host Bus Adapters, HBA's, for the blades. Make sure the blade fibre card Port configuration matches the I/O module port. For instance, the OPM ports cannot operate at speeds greater than 2 GB so a 4 GB HBA installed in a blade in this chassis would have to have its ports configured down to 2 GB.
In order to properly present a logical drive behind a SAN Storage Controller to a blade host inside a BladeCenter chassis, five general steps must be validated. Most fibre connection problems can be narrowed down to a problem with one of these five steps:
- The blade HBA port must login to the fabric.
- The HBA port should be added to an individual fabric zone in an
active zone set paired with the SAN Storage Controller port, i.e.
in a fully configured chassis of 14 Blades and two FSM's there
would be two fabrics and 14 zones per fabric.
Note: Zones may be port zones, WWN zones, or mixed, depending on the FSM manufacturer. Usually if the fabric route between the host and the SAN Storage Controller includes fibre channel switches from different manufacturers, then the switches need to be configured for interoperability mode and WWN zoning must be used. See the Fibre Channel Switch Interoperability Guide listed above for more details.
- One or more active paths must exist between the blade HBA port
and the SAN Storage Controller through the fabric, the number of
which is dependent on the multipath driver in use.
The blade HBA controller actually has two ports hardwired to two different fibre switch modules in the chassis. The two fibre switch modules are usually tied to two separate fabrics, each fabric connected to one of the two SAN Storage Controllers to preserve path redundancy. It may be necessary to reduce the connections to the SAN storage controllers to a single path for debug purposes by removing the fabric connection to one of the fibre switch modules. - A logical drive or LUN must be mapped to the blade host using the SAN Storage Controller management software.
- If the blade is to be booted from the remote logical drive, the HBA port must be enabled for boot mode and an operating system must be installed on the logical drive.
PROBLEM DETERMINATION FLOW:
A. Known Solutions:
- Host Bus Adapter expansion card fails to connect to Cisco MDS 9000. Can not see Logical Unit Number (LUN) when using IBM Fibre Channel Expansion card or IBM Small Form Factor (SFF) Fibre Channel Expansion card when connecting to Cisco MDS 9000 Family switches. Manually add the company IDs (OUI) for the Fibre Channel Expansion cards into the configuration of Cisco MDS 9000 family switches. See MIGR-59374.
- Controlled power on/off sequencing with Fibre Channel - IBM eServer BladeCenter. When applying or removing power to the IBM eServer BladeCenter system with an attached Fibre Channel storage system, you must follow a specific power on/off sequence. See MIGR-46377.
- Blades running Red Hat 3 Update 2 crash during Fibre Channel switch module failover - IBM eServer BladeCenter JS20. This behavior is corrected in Red Hat 3 Update 3 for PowerPC systems. See MIGR-56581.
- SAN devices connected through Optical Pass-through Module (OPM) not seen in hardware - IBM BladeCenter JS20 running RHEL 3 update 2. This behavior is corrected in Red Hat Linux Version 3 Update 3 (or later). See MIGR-56769.
- Cannot boot from SAN with Fibre Channel Host Bus Adapter in slot-1 - IBM eServer BladeCenter HS40. If only one HBA is to be installed, the Fibre Channel HBA must be installed in Slot-2. See MIGR-58396.
- Fibre HBA causes PCI configuration errors. After adding the BladeCenter Fibre Channel Expansion Adapter you will receive the following error messages: " Errors; PCI Configuration Error". The adapter must be re-flashed with the Fibre Channel Expansion Card Update Utility. See MIGR-58494.
- Cannot boot to SAN after RDM basic scan - IBM eServer BladeCenter HS20 and BladeCenter HS40. After performing an RDM 4.20 Basic Scan, the target system may lose its SAN connectivity. Apply RDM 4.20, Update 1 or newer. See MIGR-58345.
- When attempting to merge the McDATA 6-Port Fibre Channel Switch Module for IBM eServer BladeCenter into an existing McDATA fabric running SANtegrity, the merge fails. Install the SANtegrity option for the McDATA 6-Port Fibre Channel Switch Module for IBM eServer BladeCenter. See MIGR-60282.
- Fibre Channel Switch Interoperability Guide - IBM BladeCenter. Often when connecting fibre channel switches together from different vendors, the switches must be configured for interoperability mode. See MIGR-58206.
- Blade cannot connect to SAN after installing new Fibre Channel Host Bus Adapter. The HBA port does not log in to the switch. See MIGR-58440.
B. Initial Checkout:
- Check firmware levels for the blade HBA and the BladeCenter Fibre channel Switch Module, FSM. Find the firmware download sites under www.ibm.com/systems/support and check the change history for the lastest firmware to see if any fixes match your failure symptom.
- Was this blade and SAN configuration ever working? If so, what changed in the BladeCenter or SAN system just before the failure? Reverse the change condition if possible to see if that fixes the problem. If this is a new installation then review the document Remote Storage Area Network (SAN) Boot - IBM eServer BladeCenter HS20 and HS40 (or similar document for the BladeCenter JS20 and BladeCenter JS21 Power blades) to get an understanding of the steps involved. There are many steps and the order in which they are performed is very important.
- Login to the chassis MM. Check System Status for error conditions. View the MM Event log and look for Warnings or Error messages related to the blade or the FSM or for any power, voltage or temperature messages.
- Are multiple blade servers in this chassis attached to the same fabric and are all working except one? If so, then problem determination should start with the blade and the HBA.
- Is more than one blade server failing in the same way, i.e. more than one server cannot see its remote LUN? If so, then focus on anything common between the blades that can affect the fibre connections, i.e. operating system fibre adapter and multipath drivers, FSM's, fibre cables, shared external fibre switch, shared SAN Storage Controller hardware, SAN Storage Controller software and configuration.
- Login to each FSM in chassis Bays 3 and 4 and verify the link status is 'logged-in' for each internal port with an active blade attached to it. If an OPM is used, the link light on it only illuminates if a login has occurred. The port login can also be verified by checking the port status of the external FSM that the HBA port is connected to.
Note: When verifying the login status of a HBA
port, bear in mind that the port is only active when one of the
following is true:
a) The Ctrl-Q BIOS or Alt-E BIOS menus are active.
b) The Boot BIOS is loaded.
c) An operating system driver is loaded.
C. HBA port not logging in to the fabric:
- Start here if you've logged in to the FSM and you can see that the blade HBA is not logged in to the corresponding internal port of the switch
- If this is a new installation and it is a BladeCenter (Type 8839), see MIGR-58396.
- Is the blade powered up? Is the HBA port BIOS enabled? If not, is the blade booted into the operating system? The HBA port BIOS only needs to be enabled if the blade server is booting to a remote logical drive on the SAN. There is a configuration setting in the HBA card that can be checked either by booting the blade to the Qlogic BIOS setup utility using Ctl-Q, the Emulex BIOS setup utility using Alt-E, or by using a client-server application such as SANSurfer to interrogate the card. You can also use the BIOS setup utility to verify the HBA firmware level is up to date. If this blade server is booting from a local drive such that HBA port BIOS is not enabled then the HBA driver must be loaded under the operating system for the HBA port to login.
- If the blade is powered on and its HBA port BIOS is enabled then suspect the HBA or the corresponding FSM internal port. Is the FSM internal port connected to this blade enabled? If only one blade is failing, then the problem is most likely with the HBA. Verify there is not a speed mismatch between the HBA and the fibre switch port connected to it. For instance, if the chassis contains a 4 GB FSM and the blade contains a 2 GB HBA, make sure the FSM port speed is configured properly.
- If multiple blades in the same chassis are having port login problems and they all have their HBA ports enabled, then the problem is most likely with the FSM. Verify each fibre channel switch port attached to an HBA is enabled. Verify the external switch ports connected to the fabric are enabled. Verify the FSM has a valid license key to enable the appropriate blade switch ports. For instance, a 10 port Qlogic FSM needs to have a separate license key added to connect to more than 7 blades in the chassis with HBA's.
- If the blade is powered on and running the operating system then verify the correct driver version is loaded. If a multipath driver is installed, such as the IBM RDAC driver, then remove the driver and reboot the blade to simplify the configuration. If the HBA firmware and the operating system driver is uplevel and the driver is loaded properly then the HBA card is probably defective. You should see some fibre device related errors in the operating system system log that will confirm this. Again, if multiple blades with HBA's are all failing to login, then suspect the FSM.
D. The blade cannot see any remote logical drives:
- Check that the storage controller is available by viewing the "Name Server" list in the FSM in the BladeCenter, or nearest to it if an OPM is in use. This will show the port on which it is visible.
- Verify the zoning configuration as set up as recommended in the overview above. If the HBA port is in a zone with one or more additional host ports, this can cause this symptom.
- Check that the LUN(s) involved are on the expected controller of the redundant pair in the SAN storage controller, particularly if this issue occurs during setup when only one path is enabled. Run the management software for the SAN Storage Controller (Storage Manager for DS4000 products) and verify the WWN for the host HBA port is visible. Verify that a logical drive is mapped to the host. If the SAN Storage Controller is a DS4000 product then check which controller path to the host is active for this logical drive.
- Boot the blade to the HBA BIOS setup utility using Ctl-Q or Alt-E. Select the host adapter address corresponding to the active path and select the scan for devices option. Is a logical drive with a LUN ID, seen by the port? If not, suspect the HBA. Verify the HBA firmware is at the latest level found on the IBM support site. If no blades attached to the same SAN Storage Controller are seeing their remote drives then this might be a SAN Storage Controller hardware or firmware problem. Verify the SAN Storage Controller is running the latest code as specified on the vendor's support site.
- If a redundantpath has been configured to the SAN storage controller, or if both HBA ports have an active connection to it, verify that the appropriate multipath driver is loaded. Having multiple instances of the same LUN available to an operating system can have unexpected consequences. If the blade is set up to boot from SAN then check the setting for Automatic Volume Transfer (AVT) in Storage Manager for a DS4000 type SAN and make sure it is enabled. If the blade is a local operating system install running Windows, then AVT should be disabled to allow RDAC/MPIO to control the DS4000 storage array.
- If you cannot get one path to the SAN Storage Controller to work then try the other path. You'll need to find out how to activate the second SAN Storage Controller path for this blade's assigned logical drive using the storage management software. After assigning the logical drive to the other path to the host, repeat step 4 and select the other host adapter address.
- If multiple blades are failing then re-check the entire fabric including all switches in the paths between the hosts and the SAN Storage Controllers. Collect the fibre switch logs. Check the logs for link problems. Check the SAN Storage Controller management logs for link problems. Try simplifying the path between the blade host and the SAN Storage Controller, i.e. insert only one fibre switch in the path, to help isolate a possible fibre channel link issue.
E. The blade can see the remote logical drive but cannot boot to it:
- Some old part number HBA's will work in the blade but will not support booting from SAN. If this is a new installation and it is a BladeCenter (Type 8842), see MIGR-62662.
- If this is a new installation and it's an 8839 type blade, see MIGR-58396.
- Ensure that the HBA BIOS has been enabled in both the Host
Adapter settings and the Selectable Boot settings in the HBA BIOS
setup utility menus.
Note: IBM best practices for setting up a remote boot environment says to only enable HBA port BIOS for one of the HBA ports and to only connect that one port to the fabric during operating system installation and until the failover driver, i.e. RDAC, is installed. The port chosen is usually the blade port that connects to the switch in chassis Bay 3.
- Check that the HBA has first priority in the system BIOS for boot.
- Verify the HBA port with BIOS enabled is on the preferred, active path to the SAN Storage Controller. Use the storage management software to verify that the active path for the logical drive mapped to this host is connected to the HBA port with BIOS enabled.
- Check that the Start Options have Hard Drive 0 as the first
non-removable bootable device option - the HBA BIOS can also
install itself as Hard Disk 1 but this can have unexpected
consequences on operating systemload, particularly in Linux.
Note: For the JSxx pSeries blades, the name of the remote boot device is Hard Drive 2.
- Re-install the operating systemimage on the logical drive or map a different logical drive to the blade host that already has a known good operating systemimage installed on it.
F. The blade was running fine but was rebooted and now it won't boot to the remote logical drive:
- Ensure that the SAN storage controller can respond to requests for data from the LUN by transferring ownership of the LUN between its internal controllers, e.g. by enabling AVT on a DS 4000 series controller.
- Ensure that both HBA's are enabled for boot in Host Adapter
settings and Selectable Boot settings (Requires Qlogic HBA BIOS of
later than 1.29).
Note: IBM installation guidelines for boot from SAN setup say to enable only one path between the host and the SAN Storage Controller for booting during operating system installation. After the operating system has been installed on the remote logical drive and the blade boot sequence has been verified, the user may go back in to the HBA firmware setup utility and enable BIOS for the second port. Many IT administrators forget to do this, especially since its not needed during normal operation of the blade server. IBM DS4000 series storage units can automatically switch a logical drive from its preferred to its secondary path for performance reasons. If the blade is already running, the multipath driver running on the blade will automatically switch traffic to the secondary path and the blade will not go down. If the blade is rebooted however, the HBA port enabled for booting will not see the logical drive because it'.s been moved to the secondary path and the blade will not boot to the operating system.
- Check that an invalid boot device has not been configured in the Selectable Boot settings. Since HBA BIOS 1.29 it has not been necessary to specify any entries to boot from, the first available LUN is used. If a recent change has been made to the number of LUNs or storage controllers available to the Blade, a non-bootable LUN may now be the "first available" and a particular Boot LUN may need to be specified.
- Check the setup from "first principles", i.e. switch zoning, LUN mapping, multipath driver, fibre path and boot settings as any of these can be changed while the operating system on the server is running and only surface as a problem when the system is restarted.
G. The fibre HBA was replaced and now the blade doesn't connect to the fabric or boot to the operating system:
- Check the firmware level on the HBA and update if necessary.
Note: HBA firmware settings may change back to defaults after updating the HBA firmware. Boot the blade to the HBA BIOS setup utility to verify.
- Check the replacement part number for the new HBA and verify it is compatible with the blade.
- Check that the switch and SAN storage controller have had the new WWNs configured into the zone or mapping where necessary, i.e. WWN zoning on switches, host port mapping on storage controllers.
- Check that the replacement HBA is of the same type as the original. Although the same driver is used, not all HBA's have the same Plug and Play ID and the OS, particularly Windows may not recognize it as the boot controller.
- Check for any speed mismatch between the HBA and FSM or OPM.
H. The BladeCenter FSM is not merging into the existing fabric:
- Log in to the MM and check under I/O Module Tasks " Admin/Power/Restart " I/O Module Advanced Setup to verify that External ports are Enabled.
- If a heterogeneous mix of switches is in use, ensure Interop mode is enabled on all FSM's whether internal or external to the BladeCenter (See the Fibre Channel Switch Interoperability Guide, MIGR-58206).
- The BladeCenter FSM should not have been powered on before connecting to the fabric. This may show up as the BladeCenter FSM, ensure that a valid connection exists to the SAN and power up. Re-check the fabric status.
- Ensure that a compatible level of operating system exists on
the BladeCenter FSM's and the external FSM's. particularly those
directly connected to the BladeCenter FSM. Where possible check the
Interoperability matrices for the products.
a. Back up the current switch configuration.
b. Verify that the correct version of switch firmware is installed on each switch.
c. Ensure that each switch has a unique Domain ID and that it falls within the proper range.
d. Set all switches to the appropriate timeout values.
e. Ensure that all Zonesets and Zone names are unique.
f. Ensure that all zone members are specified by WWPN (Interoperability mode).
g. Read the Zone merging section in the user's guide for the BladeCenter FSM and the external fibre switches for more information.
I. Multi-path failover is not working:
- Check each path for functionality as per the five steps in the Technical Overview above. For example, if the preferred path uses HBA port 1, the FSM in Bay 3 and SAN Storage Controller A, then verify the HBA port.
- FSM in Bay 4 and SAN Storage Controller B are all connected and zoned together. Disconnect the preferred path at the BladeCenter FSM and then proceed to debug the secondary path. Verify the storage controller software has moved the logical drive to the secondary path.
- Ensure any failing path has valid FSM zone and storage LUN mapping.
- Ensure a valid multipath driver is installed and that it is compatible with the HBA driver used, usually the HBA and multipath driver will come in one package. For IBM DS4000 series SAN's, the multi-path driver version must be in sync with the version of Storage Manager running on the SAN Storage Controllers. It must also be compatible with the level of firmware installed on the SAN Storage Controllers. Every release of Storage Manager comes with its own version of the RDAC multi-path driver. Check the IBM TotalStorage support web site for more information.
- If the chassis is connected to a non-IBM storage controller, check the vendor's support documentation to verify you are using the correct multi-path driver.
- If MS MPIO is in use, ensure the correct hardware specific DSM is used. Although the MPIO module is hardware independent, it requires the correct DSM which is hardware specific to be loaded.
J. Other fabric issues and intermittant connectivity issues:
- IBM recommends a simple path A and path B configuration between the BladeCenter and the SAN Storage Controllers, where every blade host has it's own zoneset and every HBA port is zoned to only one SAN Storage Controller port per storage system. Some customers like to create a mesh configuration for added redundancy, where one HBA port might be zoned to both SAN Storage Controllers in one storage system or one SAN Storage Controller might be zoned to both HBA ports in one blade. The best way to debug a configuration like this is to disconnect some of the links and edit some of the zonesets until you have a completely separate path A and path B configuration. Focus on getting path A to work then move on to path B. Once you've verified that both of the simple paths are functioning then add in the links and zones that combine the paths.
- Use the fibre switch logs to track down link stability or link performance issues. Look for commands that show port statistics for the switch. Compare the statistics for a known good link to a suspect link. If the port statistics for the suspect link is showing an order of magnitude more link failures and bad frames then look for faulty fibre cables or switch transceiver modules. Any fibre cable that has been coiled tighter than the diameter of a baseball is likely to have defects and should be replaced. A high number of Flow errors (buffer full type errors) might indicate a switch configuration problem. Check zonesets to make sure each host has its own zone.
| Workaround |
|---|
None.
| Additional Information |
|---|
None.
Document Location
Worldwide
Was this topic helpful?
Document Information
Modified date:
29 January 2019
UID
ibm1MIGR-5071370