Troubleshooting
Problem
Intermittent false positive events can be reported in the mmhealth node event log possibly resulting in unnecessary Call Home Events. The issue is found in ESS levels 6.1.1.0 through 6.1.2.2 on ESS 3200 only.
Additionally, IBM ESS 3200 system can see possible reboots due to pemsmod module issues.
Summary of known issues fixed in 6.1.2.3:
- Weekly boot drive sync leads to boot drive degraded event.
- PSU fan speed calculation error leads to PSU failed event.
- Down level BMC firmware affects mmlsenclosure accuracy.
- PSU Voltage In threshold increased to allow for more flexibility in power network design
- Canister reboot.
Summary of known issues fixed in ESS 6.1.2.5 and 6.1.5.0:
- Coin_battery_failed or missing. Intermittently a reading of zero volts for the coin cell battery leads to a coin cell battery failure event.
Environment
IBM ESS 3200 system with ESS V6.1.1.0 through V6.1.2.2.
Diagnosing The Problem
1. Weekly boot drive sync leads to boot drive degraded event.
Example: A customer would see a mmhealth node events log like these:
2021-12-05 01:02:16.386541 IST bootdrive_mirror_degraded WARNING The bootdrive mirroring is degraded
2021-12-05 01:37:16.403387 IST bootdrive_mirror_ok INFO The bootdrive mirroring is ok
What to do if you see this problem:
If the time between bootdrive_mirror_degraded and bootdrive_mirror_ok is less than one hour, you can safely ignore it.
Applying IBM ESS 3200 V6.1.2.3 will also fix it - https://www.ibm.com/support/fixcentral/swg/selectFixes?parent=Software%20defined%20storage&product=ibm/StorageSoftware/IBM+Elastic+Storage+Server+(ESS)&release=6.1.0&platform=All&function=all
2. PSU fan speed calculation error leads to PSU failed event.
Example: A customer would see a mmhealth node events log like these:
2021-08-24 06:16:49.164810 EDT power_supply_failed WARNING Power supply psu1_left_id0 is failed
2021-08-24 06:21:49.374650 EDT power_supply_ok INFO Power supply psu1_left_id0 is ok
What to do if you see this problem:
If the time between power_supply_failed and power_supply_ok is less than 10 minutes, you can safely ignore it.
Applying IBM ESS 3200 V6.1.2.3 will also fix it - https://www.ibm.com/support/fixcentral/swg/selectFixes?parent=Software%20defined%20storage&product=ibm/StorageSoftware/IBM+Elastic+Storage+Server+(ESS)&release=6.1.0&platform=All&function=all
3. Down level BMC firmware affects mmlsenclosure accuracy.
Example: A customer could see something like the following in the mmlsenclosure all -L output:
component type serial number component id failed value unit properties fru location
-------------- ------------- ------------ ------ ----- ---- ---------- --- --------
intraComm 78E401Y 0 yes OFFLINE NON_CRIT canister2
intraComm 78E401Y 1 yes OFFLINE NOTAVAIL canister1
What to do if you see this problem:
If the intraComm issue is encountered apply IBM ESS 3200 V6.1.2.3 - https://www.ibm.com/support/fixcentral/swg/selectFixes?parent=Software%20defined%20storage&product=ibm/StorageSoftware/IBM+Elastic+Storage+Server+(ESS)&release=6.1.0&platform=All&function=all
4. "PSU Voltage In" threshold increased to allow for more flexibility in power network design.
Example: A customer would see something like the following in the mmlsenclosure all -L output:
voltageSensor 78E40AH 39 yes 240.00 V FAILED 01ll737 psu2_v_in
What to do if you see this problem:
If the psu2_v_in or psu2_v_in is encountered apply IBM ESS 3200 V6.1.2.3 - https://www.ibm.com/support/fixcentral/swg/selectFixes?parent=Software%20defined%20storage&product=ibm/StorageSoftware/IBM+Elastic+Storage+Server+(ESS)&release=6.1.0&platform=All&function=all
5. Canister reboot.
For versions prior to ESS 6.1.2.0, the pemsmod module may hit stack traces with:
“BUG: scheduling while atomic” with some pemsmod function like pemsMsgRequestMsg before the kernel crash. The stack trace of the kernel crash may not have a pemsmod function.
For versions prior to ESS 6.1.2.2, the pemsmod module can hit a kernel crash with the following stack trace:
[48094.184305] BUG: unable to handle kernel paging request at 0000000000002d04
[48094.191262] PGD 0 P4D 0
[48094.193793] Oops: 0000 [#1] SMP NOPTI
[48094.197450] CPU: 47 PID: 2942 Comm: pemsIpmiRecvWor Kdump: loaded Tainted: G OE --------- -t - 4.18.0-193.60.2.el8_2.x86_64 #1
[48094.209946] Hardware name: IBM ESS 3200 : -[5141FN1]-/ESS 3200 : -[5141FN1]-, BIOS RWH1-12.16.00 04/28/2021
[48094.219678] RIP: 0010:kmem_cache_alloc_trace+0x7f/0x1c0
[48094.224899] Code: 01 00 00 4d 8b 01 65 49 8b 50 08 65 4c 03 05 b0 ae 17 47 4d 8b 38 4d 85 ff 0f 84 f0 00 00 00 41 8b 41 20 49 8b 39 48 8d 4a 01 <49> 8b 1c 07 4c 89 f8 65 48 0f c7 0f 0f 94 c0 84 c0 74 c6 41 8b 41
[48094.243634] RSP: 0018:ffffacda0f34fe48 EFLAGS: 00010202
[48094.248851] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000153a79
[48094.255977] RDX: 0000000000153a78 RSI: 00000000006000c0 RDI: 000000000002e080
[48094.263099] RBP: ffff909347c07800 R08: ffff9110aedee080 R09: ffff909347c07800
[48094.270223] R10: ffff91109bed8f60 R11: ffff9110aede8b01 R12: 00000000006000c0
[48094.277346] R13: 0000000000000018 R14: ffffffffc0904298 R15: 0000000000002d04
[48094.284474] FS: 0000000000000000(0000) GS:ffff9110aedc0000(0000) knlGS:0000000000000000
[48094.292549] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[48094.298287] CR2: 0000000000002d04 CR3: 000000498980a000 CR4: 0000000000340ee0
[48094.305411] Call Trace:
[48094.307876] pemsListEnqueue+0x28/0xd0 [pemsmod]
[48094.312496] pemsIpmiRecvWorker+0x573/0x5a0 [pemsmod]
[48094.317549] ? finish_wait+0x80/0x80
[48094.321127] ? pemsIpmiProcessReply+0x190/0x190 [pemsmod]
[48094.326528] kthread+0x112/0x130
[48094.329757] ? kthread_flush_work_fn+0x10/0x10
[48094.334196] ret_from_fork+0x22/0x40
For versions prior to ESS 6.1.2.3, the pemsmod module can hit a kernel crash with the following stack trace:
6547875.303351] BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
[6547875.311345] PGD 0 P4D 0
[6547875.314056] Oops: 0002 [#1] SMP NOPTI
[6547875.317888] CPU: 30 PID: 192 Comm: ksoftirqd/30 Kdump: loaded Tainted: G W OE --------- -t - 4.18.0-193.60.2.el8_2.x86_64 #1
[6547875.330211] Hardware name: IBM ESS 3200 : -[5141FN1]-/ESS 3200 : -[5141FN1]-, BIOS RWH1-12.16.00 04/28/2021
[6547875.340130] RIP: 0010:pemsListDequeueFirst+0x28/0x60 [pemsmod]
[6547875.346133] Code: 00 00 0f 1f 44 00 00 53 48 85 ff 74 33 48 89 fa 48 8b 7f 30 48 85 ff 74 27 48 8b 1f 48 3b 7a 38 74 25 48 8b 47 10 48 89 42 30 <48> c7 40 08 00 00 00 00 83 6a 28 01 e8 e7 a9 d0 ea 48 89 d8 5b c3
[6547875.365043] RSP: 0018:ffffacb58cc5fcf8 EFLAGS: 00010202
[6547875.370434] RAX: 0000000000000000 RBX: ffff8fa1ee847b40 RCX: 0000000000000298
[6547875.377731] RDX: ffff8fa1edc3af00 RSI: ffffffffc07caf18 RDI: ffff8f8da447f960
[6547875.385029] RBP: ffff8fa2135a0000 R08: ffffffffc07c6c22 R09: 0000000000000032
[6547875.392327] R10: ffffd6c27bb12d00 R11: 0000000000000064 R12: 0000000000000081
[6547875.399623] R13: 000000000000002d R14: 0000000000000000 R15: ffff8f47855dc000
[6547875.406922] FS: 0000000000000000(0000) GS:ffff8fa2ae780000(0000) knlGS:0000000000000000
[6547875.415172] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[6547875.421083] CR2: 0000000000000008 CR3: 00000001cff66000 CR4: 0000000000340ee0
[6547875.428381] Call Trace:
[6547875.431002] pemsIpmiRecvMsg+0x16f/0x480 [pemsmod]
[6547875.435973] deliver_response+0x61/0xd0 [ipmi_msghandler]
[6547875.441538] deliver_local_response+0xe/0x30 [ipmi_msghandler]
[6547875.447544] handle_one_recv_msg+0x178/0xe00 [ipmi_msghandler]
What to do if you see this problem:
If the canister reboot issue is encountered apply IBM ESS 3200 V6.1.2.3 - https://www.ibm.com/support/fixcentral/swg/selectFixes?parent=Software%20defined%20storage&product=ibm/StorageSoftware/IBM+Elastic+Storage+Server+(ESS)&release=6.1.0&platform=All&function=all
6. Coin_battery_failed or missing. Intermittently, a reading of zero volts for the coin cell battery leads to a coin cell battery failure event.
Example: A customer could see a mmhealth node events log like these:
2021-09-16 13:28:48.200561 PDT coin_battery_missing WARNING The coin battery is absent
2021-09-16 13:31:21.763169 PDT coin_battery_ok INFO The coin battery is ok
What to do if you see this problem:
If the time between coin_battery_missing issue is encountered, apply IBM ESS 3200 V6.1.2.5 or later or V6.1.5.0 or later https://www.ibm.com/support/pages/node/6844125
Document Location
Worldwide
Was this topic helpful?
Document Information
Modified date:
15 June 2023
UID
ibm16591433