IBM Support

IBM Elastic Storage Server 3200: Pems reboot and HAL, BMC false positive known issues for Elastic Storage System (ESS) 3200

Troubleshooting


Problem

Intermittent false positive events can be reported in the mmhealth node event log possibly resulting in unnecessary Call Home Events.  The issue is found in ESS levels 6.1.1.0 through 6.1.2.2 on ESS 3200 only.

Additionally, IBM ESS 3200 system can see possible reboots due to pemsmod module issues.

Summary of known issues fixed in 6.1.2.3:

  1. Weekly boot drive sync leads to boot drive degraded event.
  2. PSU fan speed calculation error leads to PSU failed event.
  3. Down level BMC firmware affects mmlsenclosure accuracy.
  4. PSU Voltage In threshold increased to allow for more flexibility in power network design
  5. Canister reboot. 

Summary of known issues fixed in ESS 6.1.2.5 and 6.1.5.0:

  1. Coin_battery_failed or missing.  Intermittently a reading of zero volts for the coin cell battery leads to a coin cell battery failure event.

Environment

IBM ESS 3200 system with ESS V6.1.1.0 through V6.1.2.2.

Diagnosing The Problem

1. Weekly boot drive sync leads to boot drive degraded event.

Example: A customer would see a mmhealth node events log like these:

2021-12-05 01:02:16.386541 IST        bootdrive_mirror_degraded WARNING    The bootdrive mirroring is degraded

2021-12-05 01:37:16.403387 IST        bootdrive_mirror_ok       INFO       The bootdrive mirroring is ok

What to do if you see this problem:

If the time between bootdrive_mirror_degraded and bootdrive_mirror_ok is less than one hour, you can safely ignore it.

Applying IBM ESS 3200 V6.1.2.3 will also fix it - https://www.ibm.com/support/fixcentral/swg/selectFixes?parent=Software%20defined%20storage&product=ibm/StorageSoftware/IBM+Elastic+Storage+Server+(ESS)&release=6.1.0&platform=All&function=all

2. PSU fan speed calculation error leads to PSU failed event.

Example: A customer would see a mmhealth node events log like these:

2021-08-24 06:16:49.164810 EDT        power_supply_failed       WARNING    Power supply psu1_left_id0 is failed

2021-08-24 06:21:49.374650 EDT        power_supply_ok           INFO       Power supply psu1_left_id0 is ok

What to do if you see this problem:

If the time between power_supply_failed and power_supply_ok is less than 10 minutes, you can safely ignore it.

Applying IBM ESS 3200 V6.1.2.3 will also fix it - https://www.ibm.com/support/fixcentral/swg/selectFixes?parent=Software%20defined%20storage&product=ibm/StorageSoftware/IBM+Elastic+Storage+Server+(ESS)&release=6.1.0&platform=All&function=all

3. Down level BMC firmware affects mmlsenclosure accuracy.

Example: A customer could see something like the following in the mmlsenclosure all -L output:

component type  serial number     component id    failed value   unit   properties  fru     location

--------------  -------------     ------------    ------ -----   ----   ----------  ---     --------

intraComm       78E401Y           0               yes                   OFFLINE NON_CRIT         canister2

intraComm       78E401Y           1               yes                   OFFLINE NOTAVAIL         canister1

What to do if you see this problem:

If the intraComm issue is encountered apply IBM ESS 3200 V6.1.2.3 - https://www.ibm.com/support/fixcentral/swg/selectFixes?parent=Software%20defined%20storage&product=ibm/StorageSoftware/IBM+Elastic+Storage+Server+(ESS)&release=6.1.0&platform=All&function=all

4. "PSU Voltage In" threshold increased to allow for more flexibility in power network design.

Example: A customer would see something like the following in the mmlsenclosure all -L output:

voltageSensor   78E40AH           39              yes    240.00  V      FAILED      01ll737 psu2_v_in

What to do if you see this problem:

If the psu2_v_in or psu2_v_in is encountered apply IBM ESS 3200 V6.1.2.3 - https://www.ibm.com/support/fixcentral/swg/selectFixes?parent=Software%20defined%20storage&product=ibm/StorageSoftware/IBM+Elastic+Storage+Server+(ESS)&release=6.1.0&platform=All&function=all

5. Canister reboot. 

For versions prior to ESS 6.1.2.0, the pemsmod module may hit stack traces with:

“BUG: scheduling while atomic” with some pemsmod function like pemsMsgRequestMsg before the kernel crash. The stack trace of the kernel crash may not have a pemsmod function.

For versions prior to ESS 6.1.2.2, the pemsmod module can hit a kernel crash with the following stack trace:

[48094.184305] BUG: unable to handle kernel paging request at 0000000000002d04

[48094.191262] PGD 0 P4D 0

[48094.193793] Oops: 0000 [#1] SMP NOPTI

[48094.197450] CPU: 47 PID: 2942 Comm: pemsIpmiRecvWor Kdump: loaded Tainted: G           OE    --------- -t - 4.18.0-193.60.2.el8_2.x86_64 #1

[48094.209946] Hardware name: IBM ESS 3200 : -[5141FN1]-/ESS 3200 : -[5141FN1]-, BIOS RWH1-12.16.00 04/28/2021

[48094.219678] RIP: 0010:kmem_cache_alloc_trace+0x7f/0x1c0

[48094.224899] Code: 01 00 00 4d 8b 01 65 49 8b 50 08 65 4c 03 05 b0 ae 17 47 4d 8b 38 4d 85 ff 0f 84 f0 00 00 00 41 8b 41 20 49 8b 39 48 8d 4a 01 <49> 8b 1c 07 4c 89 f8 65 48 0f c7 0f 0f 94 c0 84 c0 74 c6 41 8b 41

[48094.243634] RSP: 0018:ffffacda0f34fe48 EFLAGS: 00010202

[48094.248851] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000153a79

[48094.255977] RDX: 0000000000153a78 RSI: 00000000006000c0 RDI: 000000000002e080

[48094.263099] RBP: ffff909347c07800 R08: ffff9110aedee080 R09: ffff909347c07800

[48094.270223] R10: ffff91109bed8f60 R11: ffff9110aede8b01 R12: 00000000006000c0

[48094.277346] R13: 0000000000000018 R14: ffffffffc0904298 R15: 0000000000002d04

[48094.284474] FS:  0000000000000000(0000) GS:ffff9110aedc0000(0000) knlGS:0000000000000000

[48094.292549] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033

[48094.298287] CR2: 0000000000002d04 CR3: 000000498980a000 CR4: 0000000000340ee0

[48094.305411] Call Trace:

[48094.307876]  pemsListEnqueue+0x28/0xd0 [pemsmod]

[48094.312496]  pemsIpmiRecvWorker+0x573/0x5a0 [pemsmod]

[48094.317549]  ? finish_wait+0x80/0x80

[48094.321127]  ? pemsIpmiProcessReply+0x190/0x190 [pemsmod]

[48094.326528]  kthread+0x112/0x130

[48094.329757]  ? kthread_flush_work_fn+0x10/0x10

[48094.334196]  ret_from_fork+0x22/0x40

For versions prior to ESS 6.1.2.3, the pemsmod module can hit a kernel crash with the following stack trace:

6547875.303351] BUG: unable to handle kernel NULL pointer dereference at 0000000000000008

[6547875.311345] PGD 0 P4D 0

[6547875.314056] Oops: 0002 [#1] SMP NOPTI

[6547875.317888] CPU: 30 PID: 192 Comm: ksoftirqd/30 Kdump: loaded Tainted: G        W  OE    --------- -t - 4.18.0-193.60.2.el8_2.x86_64 #1

[6547875.330211] Hardware name: IBM ESS 3200 : -[5141FN1]-/ESS 3200 : -[5141FN1]-, BIOS RWH1-12.16.00 04/28/2021

[6547875.340130] RIP: 0010:pemsListDequeueFirst+0x28/0x60 [pemsmod]

[6547875.346133] Code: 00 00 0f 1f 44 00 00 53 48 85 ff 74 33 48 89 fa 48 8b 7f 30 48 85 ff 74 27 48 8b 1f 48 3b 7a 38 74 25 48 8b 47 10 48 89 42 30 <48> c7 40 08 00 00 00 00 83 6a 28 01 e8 e7 a9 d0 ea 48 89 d8 5b c3

[6547875.365043] RSP: 0018:ffffacb58cc5fcf8 EFLAGS: 00010202

[6547875.370434] RAX: 0000000000000000 RBX: ffff8fa1ee847b40 RCX: 0000000000000298

[6547875.377731] RDX: ffff8fa1edc3af00 RSI: ffffffffc07caf18 RDI: ffff8f8da447f960

[6547875.385029] RBP: ffff8fa2135a0000 R08: ffffffffc07c6c22 R09: 0000000000000032

[6547875.392327] R10: ffffd6c27bb12d00 R11: 0000000000000064 R12: 0000000000000081

[6547875.399623] R13: 000000000000002d R14: 0000000000000000 R15: ffff8f47855dc000

[6547875.406922] FS:  0000000000000000(0000) GS:ffff8fa2ae780000(0000) knlGS:0000000000000000

[6547875.415172] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033

[6547875.421083] CR2: 0000000000000008 CR3: 00000001cff66000 CR4: 0000000000340ee0

[6547875.428381] Call Trace:

[6547875.431002]  pemsIpmiRecvMsg+0x16f/0x480 [pemsmod]

[6547875.435973]  deliver_response+0x61/0xd0 [ipmi_msghandler]

[6547875.441538]  deliver_local_response+0xe/0x30 [ipmi_msghandler]

[6547875.447544]  handle_one_recv_msg+0x178/0xe00 [ipmi_msghandler]

What to do if you see this problem:

If the canister reboot issue is encountered apply IBM ESS 3200 V6.1.2.3 - https://www.ibm.com/support/fixcentral/swg/selectFixes?parent=Software%20defined%20storage&product=ibm/StorageSoftware/IBM+Elastic+Storage+Server+(ESS)&release=6.1.0&platform=All&function=all

6. Coin_battery_failed or missing.  Intermittently, a reading of zero volts for the coin cell battery leads to a coin cell battery failure event. 

Example: A customer could see a mmhealth node events log like these:

2021-09-16 13:28:48.200561 PDT coin_battery_missing WARNING The coin battery is absent

2021-09-16 13:31:21.763169 PDT coin_battery_ok INFO The coin battery is ok

What to do if you see this problem:

If the time between coin_battery_missing issue is encountered, apply IBM ESS 3200 V6.1.2.5 or later or V6.1.5.0 or later https://www.ibm.com/support/pages/node/6844125

Document Location

Worldwide

[{"Type":"MASTER","Line of Business":{"code":"LOB26","label":"Storage"},"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"SSZL24","label":"IBM Elastic Storage System"},"ARM Category":[{"code":"a8m3p0000008uFFAAY","label":"ESS 3200"}],"ARM Case Number":"","Platform":[{"code":"PF025","label":"Platform Independent"}]}]

Document Information

Modified date:
15 June 2023

UID

ibm16591433