IBM Elastic Storage Server (ESS) 5000 Alert: NVDIMMs and Recovery Group fails to start

Flashes (Alerts)

Abstract

Failure analysis and repair actions for ESS5000 NVDIMM errors

Content

Symptom:

Customer may encounter this issue by noticing either or both of the following:

During IPL, SRC BC23352C posts, calling-out all four NVDIMMs or two NVDIMMs behind the same processor at mandatory priority, but system may still IPL with NVDIMMs deconfigured.
OR
Log tip pdisk state in the ESS Recovery Group of the corresponding ESS 5000 I/O Server node may become missing and the corresponding Recovery Group may fail to start. The logtip devices may contain key metadata of the corresponding Recovery Group and if when sufficient replicas of the metadata cannot be accessed, the Recovery Group containing user disks for the file system cannot be activated.

All models of ESS 5000 with software versions ESS 6.0.1.0 through 6.0.2.0 and ESS 6.1.0.0 are impacted.

Problem Isolation Aids:

If Symptom 1 is encountered, then check the detailed data in the error log for SRC BC23352C for the following hex word 3:

Hex Words 2-5: 000000E0 00003500 00000000 03200000
Hex Words 6-9: 00000002 0003001E 00000000 00000000

And then after decoding the error, confirm the "reason code" matches: NVDIMM_CSAVE_ERROR. If so, then this flash is applicable.

If Symptom 2 is encountered then check the error log for errors that report SRC BC23352C and if any are found, then check the word 3 value as noted above to see whether a match and if so, then this flash is applicable.

IBM Spectrum Scale RAID Log:

The following entries may be observed in the GPFS logs when a single ESS 5000 I/O Server node is power cycled and all four NVDIMM devices are in a deconfigured state:

Fri Sep 25 12:08:12.155 2020 dsk2c-io2 ST [W] Log tip DA NVR of RG rg_dsk2c-io2: insufficient spare space to complete rebalance. Unavailable disks in this DA may cause performance degradation.

Fri Sep 25 12:08:12.149 2020 dsk2c-io2 ST [I] Start rebalance of DA NVR in RG rg_dsk2c-io2.

Fri Sep 25 12:08:12.148 2020 dsk2c-io2 ST [D] Pdisk n003v002 of RG rg_dsk2c-io2 state changed from missing/00048.0c0 to missing/undrainable/00048.0d0.

Fri Sep 25 12:08:12.148 2020 dsk2c-io2 ST [W] Log tip DA NVR of RG rg_dsk2c-io2: insufficient spare space to complete rebuild. Unavailable disks in this DA may cause performance degradation.

Fri Sep 25 12:08:12.129 2020 dsk2c-io2 ST [I] Finished repairing RGD/VCD in RG rg_dsk2c-io2.

Fri Sep 25 12:08:12.011 2020 dsk2c-io2 ST [I] Start repairing RGD/VCD in RG rg_dsk2c-io2.

Fri Sep 25 12:08:11.689 2020 dsk2c-io2 ST [D] Pdisk n003v002 of RG rg_dsk2c-io2 state changed from diagnosing/00020.0c0 to missing/00048.0c0.

In these situation two NVDIMM devices (/dev/pmem0 and /dev/pmem1) from the ESS 5000 I/O node would be missing. When this happens, the RG would recover, and file system access can be established. The missing devices must be recovered as soon as possible to prevent loss of access to the file system. The RG name and pdisk name are shown as example.

The following entries may be observed in the GPFS logs when both ESS 5000 I/O Server nodes of the building block are power cycled and all eight NVDIMM devices (four in each node) are in a deconfigured state:

2020-11-11_11:33:55.926+0100: [I] Beginning log tip recovery for LG root of RG rg_nsdibm13g.

2020-11-11_11:33:55.929+0100: [E] Unable to read logTip vdisk rg_nsdibm13g_logtip track 1 due to fatal pdisk IO errors!

2020-11-11_11:33:55.929+0100: [E] Unable to read logTip vdisk rg_nsdibm13g_logtip track 3 due to fatal pdisk IO errors!

2020-11-11_11:33:55.929+0100: [E] Unable to read logTip vdisk rg_nsdibm13g_logtip track 2 due to fatal pdisk IO errors!

2020-11-11_11:33:55.938+0100: [E] Unable to read logTip vdisk rg_nsdibm13g_logtip track 0 due to fatal pdisk IO errors!

2020-11-11_11:33:55.938+0100: [E] Beginning to resign log group root in recovery group rg_nsdibm13g due to "recovery

In this situation, all four NVDIMM devices (/dev/pmem0 and /dev/pmem1 on both I/O nodes of the building block) from the ESS 5000 I/O nodes would be missing. When this happens, the RG would fail to recover, and file system access cannot be established. The missing devices must be reconstructed and added to the RG.

IBM Service must be contacted to complete the recovery of the RG and file system. The RG name and pdisk name are shown here as example.

Identify Associated pdisk with the logtip vdisk:

mmvdisk pdisk list --rg all --da NVR --not-ok

                              declustered                                               

recovery group  pdisk            array     paths  capacity  free space  FRU (type)       state

--------------  ------------  -----------  -----  --------  ----------  ---------------  -----

ess5k_7894DBA   n001v002      NVR              0    31 GiB      31 GiB  34GB NVRAM       missing/undrainable

ess5k_7894E4A   n001v001      NVR              0    31 GiB      31 GiB  34GB NVRAM       missing/undrainable

The command mmhealth can be run to verify the health of the system and logtip devices. You see something similar to this:

mmhealth node show NATIVE_RAID PHYSICALDISK

  ess5k_7894E4A/e3s105      HEALTHY       3 days ago        -

  ess5k_7894E4A/n001v001    DEGRADED      3 days ago        gnr_pdisk_missing(ess5k_7894E4A/n001v001)

  ess5k_7894E4A/n002v001    HEALTHY       3 days ago        -

Event                     Parameter                  Severity    Active Since      Event Message

--------------------------------------------------------------------------------------------------------------------------------

gnr_pdisk_missing         ess5k_7894DBA/n001v002     WARNING     3 days ago        GNR pdisk ess5k_7894DBA/n001v002 is missing

gnr_pdisk_replaceable     ess5k_7894E4A/e3s005       ERROR       3 days ago        GNR pdisk ess5k_7894E4A/e3s005 is replaceable

gnr_pdisk_missing         ess5k_7894E4A/n001v001     WARNING     3 days ago        GNR pdisk ess5k_7894E4A/n001v001 is missing

Recommendation:

Users running IBM ESS 5000 V6.0.1.0 through V6.0.2.0 code levels should apply IBM ESS 5000 V6.0.2.1 or later, available from IBM Fix Central at:

https://www.ibm.com/support/fixcentral/swg/selectFixes?parent=Software%20defined%20storage&product=ibm/StorageSoftware/IBM+Elastic+Storage+Server+(ESS)&release=6.0.0&platform=All&function=all

Users running IBM ESS 5000 V6.1.0.0 code levels should apply IBM ESS 5000 V6.1.0.1 or later, available from IBM Fix Central at:

https://www.ibm.com/support/fixcentral/swg/selectFixes?parent=Software%20defined%20storage&product=ibm/StorageSoftware/IBM+Elastic+Storage+Server+(ESS)&release=6.1.0&platform=All&function=all

If you cannot apply the above PTF level, contact IBM service.

Updating the NVDIMM firmware takes approximately eight minutes per NVDIMM on ESS I/O servers 5105-22e. If an NVDIMM FW update is required, then this would be incurred only on the initial system boot when updating system firmware or replacing an NVDIMM. There are four NVDIMMs per server, so up to an extra 32 minutes might be needed to complete the system boot in these cases.

Note:

If the error recurs against the same NVDIMMs in a few days or weeks after performing the procedures in the workaround, then do not reperform this procedure nor replace the NVDIMM, contact IBM Service.

[{"Type":"SW","Line of Business":{"code":"LOB26","label":"Storage"},"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"STHMCM","label":"IBM Elastic Storage Server"},"ARM Category":[{"code":"a8m50000000KzfKAAS","label":"Disk Errors"}],"Platform":[{"code":"PF016","label":"Linux"}],"Version":"6.0.0;6.1.0"}]

Tips

IBM Elastic Storage Server (ESS) 5000 Alert: NVDIMMs and Recovery Group fails to start

Flashes (Alerts)

Abstract

Content

Was this topic helpful?

Document Information

UID

Share your feedback

Need support?