IBM Support

IBM ESS Alert: Simultaneous canister power cycle, missing NVMe drives and out of range sensor data in ESS 3500 5141-FN2 storage enclosures

Troubleshooting


Problem

Enclosure management software running in the Baseboard Management Controller (BMC) of each canister in an ESS 3500 5141-FN2 enclosure can exhibit the following conditions and symptoms:

1) Enclosure management software can incorrectly detect that the 3.3V Electronic Circuit Breaker in the power supply assembly is open and reset them. In some rare conditions, this can lead to a reset of both power supplies at the same time leading to a momentary loss of power to each canister causing them to reboot at the same time. 

2) Enclosure management software can incorrectly read the state of the NVMe drive hot plug register and can set the drive in a stuck or unallocated state from a canister view. This can result in one or more missing paths to a NVMe drive.

3) Enclosure management software can intermittently read incorrect power supply and enclosure fan operational values (such as power supply input/output voltage/current and fan rpm) and presence. In rare cases, the health monitoring system such as mmhealth and mmlsenclosure can sample the incorrect values presented by the enclosure management software and can generate a false alert including call home.

Symptom

ESS 3500 can show one or more of the following symptoms:

1) Canister reboot
        • System can report the reboot of both canisters including the BMC at the same time as seen from the uptime. In most cases, the reboot will not show any stored vmcore for the reboot.

2) Missing NVMe
        • System can show NVMe drive missing or path missing in the pdisk list:
              mmvdisk pdisk list -L --rg <rg> --da <nvme DA> --not-ok
             
Where <rg> is the recoverygroup name and <nvme DA> is the NVMe declustered array name such as DA1.

        • The NVMe can show the drive is powered off from one canister in the dmesg output:
           [Thu Jan 26 09:49:14 2022] pcieport 0000:00:01.7: Slot(12): Powering off due to button press 
       [Thu Jan 26 09:49:19 2022] pci 0000:07:00.0: Removing from iommu group 42 
       [Thu Jan 26 09:49:20 2022] pcieport 0000:00:01.7: Slot(12): Power fault 

3) Intermittent false alerts

       • May show intermittent NATIVE_RAID DEGRADED in mmhealth node show and in mmhealth node eventlog
        2023-02-10 04:56:55.889416 EST    enclosure_needsservice                 WARNING   Enclosure 78XXXXX needs service.
      2023-02-10 04:56:55.892811 EST    power_supply_failed                    WARNING   Power supply psu2_right_id1 is FAILED.

      2023-02-10 04:56:55.894619 EST    voltage_sensor_failed                  WARNING   Voltage sensor psu2_v_out_id38 is FAILED.
      2023-02-10 05:01:55.252427 EST    enclosure_ok                           INFO      Enclosure 78XXXXX is OK.
      2023-02-10 05:01:55.255970 EST    power_supply_ok                        INFO      Power supply psu2_right_id1 is OK.
      2023-02-10 05:01:55.257572 EST    voltage_sensor_ok                      INFO      Voltage sensor psu2_v_out_id38 is OK

      • May show intermittent "enclosure needs service" in mmlsenclosure.

      • May show intermittent power supply parameters in ipmitool sel elist out of range as shown below:
         01/21/2023 | 00:43:34 | Voltage PSU1_V_IN | Lower Critical going low  | Asserted | Reading 0 < Threshold 180 Volts
       01/21/2023 | 00:43:34 | Voltage PSU1_V_IN | Lower Non-recoverable going low  | Asserted | Reading 0 < Threshold 170 Volts
       01/21/2023 | 00:43:34 | Voltage PSU1_V_OUT | Lower Non-critical going low  | Asserted | Reading 0 < Threshold 10.50 Volts
       01/21/2023 | 00:43:34 | Voltage PSU1_V_OUT | Lower Critical going low  | Asserted | Reading 0 < Threshold 10 Volts
       01/21/2023 | 00:43:34 | Voltage PSU1_V_OUT | Lower Non-recoverable going low  | Asserted | Reading 0 < Threshold 9 Volts
       01/21/2023 | 00:43:34 | Fan PSU1_FAN_TACH | Lower Non-critical going low  | Asserted | Reading 0 < Threshold 2500 RPM
       01/21/2023 | 00:43:34 | Fan PSU1_FAN_TACH | Lower Critical going low  | Asserted | Reading 0 < Threshold 2000 RPM
       01/21/2023 | 00:43:34 | Fan PSU1_FAN_TACH | Lower Non-recoverable going low  | Asserted | Reading 0 < Threshold 1000 RPM

      • May show intermittent power fault sensor in ipmitool sensor:
         PSU1_FAULT_SEN   | 0x0   | discrete | 0x80c0| na  | na  | na   | na   | na  | na

Cause

Enclosure Management software in the BMC reads power supply status, enclosure fan status, drive hot plug registers over the Power Management Bus (PMBus) and I2C bus. BMC had a defect in the PMBus and I2C bus error and recovery handling area resulting in intermittent incorrect readings.

Environment

ESS 3500 enclosure solution running any ESS release prior to ESS release 6.1.6.0. The ESS release 6.1.6.0 incorporates updated BMC software version 12.63 with improved I2C bus error handling and recovery.

The BMC firmware level can be obtained from the ipmtool mc info command:

# ipmitool mc info | grep ^Firmware
Firmware Revision         : 12.63

Resolving The Problem

Affected customers need to upgrade to ESS 6.1.6.0 or later.  In certain instances, the missing NVMe issue might persist.  Please contact IBM Support if you experience this.  

See the following link for more details of ESS 6.1.6.0.

https://www.ibm.com/support/fixcentral/swg/selectFixes?parent=Software%20defined%20storage&product=ibm/StorageSoftware/IBM+Elastic+Storage+Server+(ESS)&release=6.1.6&platform=All&function=all

While IBM recommends a full system upgrade, the BMC firmware can be upgraded to the 12.63 level without a full ESS upgrade. To do so, download firmware release gpfs.ess.firmware-6.1.6.0-8.x86_64.rpm from Fix Central and then run mmchfirmware on the ESS 3500 nodes as follows:

1) Prepare for upgrade and follow all necessary preparation and checks for upgrade as stated in the Deployment and Upgrade procedure. 

Download the 6.1.6.0-8 ESS firmware from the following link:

https://www.ibm.com/support/fixcentral/swg/selectFixes?parent=Software%20defined%20storage&product=ibm/StorageSoftware/IBM+Elastic+Storage+Server+(ESS)&release=6.1.6&platform=All&function=all#ESS%20Firmware

Note: if only the firmware upgrade is performed without a full system upgrade to release 6.1.6.0 and if a subsequent system upgrade is performed to a release level below 6.1.6.0, the BMC firmware will be rolled back to that release level. 

2) Install the firmware rpm in each canister.

     cd to the directory where the rpm is saved and run:
  # yum upgrade ./gpfs.ess.firmware-6.1.6.0-8.x86_64.rpm

    Verify that the rpm is upgraded:
    
# rpm -qa | grep gpfs.ess.firmware 
  gpfs.ess.firmware-6.1.6.0-8.x86_64 

3) Update the canister firmware using mmchfirmware. Run the command from each canister, one at a time:

  # mmchfirmware –-type storage-enclosure –-serial-number <serial num of enclosure> -N canisterA_node_name 
 
   Where <serial num of enclosure> is the serial number of the ESS 3500 enclosure.

     Example:

     # mmchfirmware --type storage-enclosure --serial-number 78EXXXX -N ess3500a2-hs

  mmchfirmware: Processing node ess3500a2-hs
  ess3500a2-hs: update-directory /usr/lpp/mmfs/updates/latest/firmware/enclosure/.
  ess3500a2-hs: [I]Found storage-enclosure firmware update-id firmwareTable version 6.1.6.0-8.
  ess3500a2-hs: Updating enclosure firmware ESM_A.
  ess3500a2-hs: Found storage-enclosure 5141-FN2 78E4004, update-id
  /usr/lpp/mmfs/updates/latest/firmware/enclosure/ess3500fw.111Y.tar.
  ess3500a2-hs: update-directory /usr/lpp/mmfs/updates/latest/firmware/enclosure/.
  ess3500a2-hs: [I]Found storage-enclosure firmware update-id firmwareTable version 6.1.6.0-8.
  ess3500a2-hs: Updating enclosure firmware ESM_B.
 

  # mmchfirmware –-type storage-enclosure –-serial-number <serial num of enclosure> -N canisterB_node_name 
 
   Where <serial num of enclosure> is the serial number of the ESS 3500 enclosure.

     Example:

    # mmchfirmware --type storage-enclosure --serial-number 78E4004 -N ess3500b2-emsvm-hs

  mmchfirmware: Processing node ess3500b2-emsvm-hs.test.net
  ess3500b2-hs: update-directory /usr/lpp/mmfs/updates/latest/firmware/enclosure/.
  ess3500b2-hs: [I]Found storage-enclosure firmware update-id firmwareTable version 6.1.6.0-8.
  ess3500b2-hs: Updating enclosure firmware ESM_A.
  ess3500b2-hs: Found storage-enclosure 5141-FN2 78E4004, update-id
  /usr/lpp/mmfs/updates/latest/firmware/enclosure/ess3500fw.111Y.tar.
  ess3500b2-hs: update-directory /usr/lpp/mmfs/updates/latest/firmware/enclosure/.
  ess3500b2-hs: [I]Found storage-enclosure firmware update-id firmwareTable version 6.1.6.0-8.
  ess3500b2-hs: Updating enclosure firmware ESM_B.
 

4) In each canister node run the following command to verify that the BMC firmware is updated to 12.63.

     # /usr/lpp/mmfs/updates/latest/firmware/enclosure/ess3500lsfw.sh -v BMC 

  BMC: 12.63

Note: Internal reference D.307419

Document Location

Worldwide

[{"Type":"MASTER","Line of Business":{"code":"LOB26","label":"Storage"},"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"SSZL24","label":"IBM Elastic Storage System"},"ARM Category":[{"code":"a8m3p000000hBnDAAU","label":"ESS 3500"}],"ARM Case Number":"","Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"All Versions"}]

Document Information

Modified date:
21 March 2023

UID

ibm16964206