Troubleshooting
Problem
Enclosure management software running in the Baseboard Management Controller (BMC) of each canister in an ESS 3500 5141-FN2 enclosure can exhibit the following conditions and symptoms:
1) Enclosure management software can incorrectly detect that the 3.3V Electronic Circuit Breaker in the power supply assembly is open and reset them. In some rare conditions, this can lead to a reset of both power supplies at the same time leading to a momentary loss of power to each canister causing them to reboot at the same time.
2) Enclosure management software can incorrectly read the state of the NVMe drive hot plug register and can set the drive in a stuck or unallocated state from a canister view. This can result in one or more missing paths to a NVMe drive.
3) Enclosure management software can intermittently read incorrect power supply and enclosure fan operational values (such as power supply input/output voltage/current and fan rpm) and presence. In rare cases, the health monitoring system such as mmhealth and mmlsenclosure can sample the incorrect values presented by the enclosure management software and can generate a false alert including call home.
Symptom
ESS 3500 can show one or more of the following symptoms:
1) Canister reboot
• System can report the reboot of both canisters including the BMC at the same time as seen from the uptime. In most cases, the reboot will not show any stored vmcore for the reboot.
2) Missing NVMe
• System can show NVMe drive missing or path missing in the pdisk list:
mmvdisk pdisk list -L --rg <rg> --da <nvme DA> --not-ok
Where <rg> is the recoverygroup name and <nvme DA> is the NVMe declustered array name such as DA1.
• The NVMe can show the drive is powered off from one canister in the dmesg output:
[Thu Jan 26 09:49:14 2022] pcieport 0000:00:01.7: Slot(12): Powering off due to button press
[Thu Jan 26 09:49:19 2022] pci 0000:07:00.0: Removing from iommu group 42
[Thu Jan 26 09:49:20 2022] pcieport 0000:00:01.7: Slot(12): Power fault
3) Intermittent false alerts
• May show intermittent NATIVE_RAID DEGRADED in mmhealth node show and in mmhealth node eventlog
2023-02-10 04:56:55.889416 EST enclosure_needsservice WARNING Enclosure 78XXXXX needs service.
2023-02-10 04:56:55.892811 EST power_supply_failed WARNING Power supply psu2_right_id1 is FAILED.
2023-02-10 04:56:55.894619 EST voltage_sensor_failed WARNING Voltage sensor psu2_v_out_id38 is FAILED.
2023-02-10 05:01:55.252427 EST enclosure_ok INFO Enclosure 78XXXXX is OK.
2023-02-10 05:01:55.255970 EST power_supply_ok INFO Power supply psu2_right_id1 is OK.
2023-02-10 05:01:55.257572 EST voltage_sensor_ok INFO Voltage sensor psu2_v_out_id38 is OK
• May show intermittent "enclosure needs service" in mmlsenclosure.
• May show intermittent power supply parameters in ipmitool sel elist out of range as shown below:
01/21/2023 | 00:43:34 | Voltage PSU1_V_IN | Lower Critical going low | Asserted | Reading 0 < Threshold 180 Volts
01/21/2023 | 00:43:34 | Voltage PSU1_V_IN | Lower Non-recoverable going low | Asserted | Reading 0 < Threshold 170 Volts
01/21/2023 | 00:43:34 | Voltage PSU1_V_OUT | Lower Non-critical going low | Asserted | Reading 0 < Threshold 10.50 Volts
01/21/2023 | 00:43:34 | Voltage PSU1_V_OUT | Lower Critical going low | Asserted | Reading 0 < Threshold 10 Volts
01/21/2023 | 00:43:34 | Voltage PSU1_V_OUT | Lower Non-recoverable going low | Asserted | Reading 0 < Threshold 9 Volts
01/21/2023 | 00:43:34 | Fan PSU1_FAN_TACH | Lower Non-critical going low | Asserted | Reading 0 < Threshold 2500 RPM
01/21/2023 | 00:43:34 | Fan PSU1_FAN_TACH | Lower Critical going low | Asserted | Reading 0 < Threshold 2000 RPM
01/21/2023 | 00:43:34 | Fan PSU1_FAN_TACH | Lower Non-recoverable going low | Asserted | Reading 0 < Threshold 1000 RPM
• May show intermittent power fault sensor in ipmitool sensor:
PSU1_FAULT_SEN | 0x0 | discrete | 0x80c0| na | na | na | na | na | na
Cause
Environment
ESS 3500 enclosure solution running any ESS release prior to ESS release 6.1.6.0. The ESS release 6.1.6.0 incorporates updated BMC software version 12.63 with improved I2C bus error handling and recovery.
The BMC firmware level can be obtained from the ipmtool mc info command:
# ipmitool mc info | grep ^Firmware
Firmware Revision : 12.63
Resolving The Problem
Affected customers need to upgrade to ESS 6.1.6.0 or later. In certain instances, the missing NVMe issue might persist. Please contact IBM Support if you experience this.
See the following link for more details of ESS 6.1.6.0.
While IBM recommends a full system upgrade, the BMC firmware can be upgraded to the 12.63 level without a full ESS upgrade. To do so, download firmware release gpfs.ess.firmware-6.1.6.0-8.x86_64.rpm from Fix Central and then run mmchfirmware on the ESS 3500 nodes as follows:
1) Prepare for upgrade and follow all necessary preparation and checks for upgrade as stated in the Deployment and Upgrade procedure.
Download the 6.1.6.0-8 ESS firmware from the following link:
Note: if only the firmware upgrade is performed without a full system upgrade to release 6.1.6.0 and if a subsequent system upgrade is performed to a release level below 6.1.6.0, the BMC firmware will be rolled back to that release level.
2) Install the firmware rpm in each canister.
cd to the directory where the rpm is saved and run:
Verify that the rpm is upgraded:
# rpm -qa | grep gpfs.ess.firmware
3) Update the canister firmware using mmchfirmware. Run the command from each canister, one at a time:
# mmchfirmware –-type storage-enclosure –-serial-number <serial num of enclosure> -N canisterA_node_name
Where <serial num of enclosure> is the serial number of the ESS 3500 enclosure.
Example:
# mmchfirmware --type storage-enclosure --serial-number 78EXXXX -N ess3500a2-hs
mmchfirmware: Processing node ess3500a2-hs
ess3500a2-hs: update-directory /usr/lpp/mmfs/updates/latest/firmware/enclosure/.
ess3500a2-hs: [I]Found storage-enclosure firmware update-id firmwareTable version 6.1.6.0-8.
ess3500a2-hs: Updating enclosure firmware ESM_A.
ess3500a2-hs: Found storage-enclosure 5141-FN2 78E4004, update-id
/usr/lpp/mmfs/updates/latest/firmware/enclosure/ess3500fw.111Y.tar.
ess3500a2-hs: update-directory /usr/lpp/mmfs/updates/latest/firmware/enclosure/.
ess3500a2-hs: [I]Found storage-enclosure firmware update-id firmwareTable version 6.1.6.0-8.
ess3500a2-hs: Updating enclosure firmware ESM_B.
# mmchfirmware –-type storage-enclosure –-serial-number <serial num of enclosure> -N canisterB_node_name
Where <serial num of enclosure> is the serial number of the ESS 3500 enclosure.
Example:
# mmchfirmware --type storage-enclosure --serial-number 78E4004 -N ess3500b2-emsvm-hs
mmchfirmware: Processing node ess3500b2-emsvm-hs.test.net
ess3500b2-hs: update-directory /usr/lpp/mmfs/updates/latest/firmware/enclosure/.
ess3500b2-hs: [I]Found storage-enclosure firmware update-id firmwareTable version 6.1.6.0-8.
ess3500b2-hs: Updating enclosure firmware ESM_A.
ess3500b2-hs: Found storage-enclosure 5141-FN2 78E4004, update-id
/usr/lpp/mmfs/updates/latest/firmware/enclosure/ess3500fw.111Y.tar.
ess3500b2-hs: update-directory /usr/lpp/mmfs/updates/latest/firmware/enclosure/.
ess3500b2-hs: [I]Found storage-enclosure firmware update-id firmwareTable version 6.1.6.0-8.
ess3500b2-hs: Updating enclosure firmware ESM_B.
4) In each canister node run the following command to verify that the BMC firmware is updated to 12.63.
# /usr/lpp/mmfs/updates/latest/firmware/enclosure/ess3500lsfw.sh -v BMC
BMC: 12.63
Document Location
Worldwide
Was this topic helpful?
Document Information
Modified date:
21 March 2023
UID
ibm16964206