IBM Support

SEU ERROR MESSAGES

Troubleshooting


Problem

SEU error messages in sysmgr.log

Symptom

Example of single bit SEU Error:

2013-12-05 18:37:06.358233 EST Error: DAC [dac hwid=2080 sn="1232F58250345" SPA=8 Parent=1796 Position=1][fpga-index=3] on SPU [spu hwid=1796 sn="06Y5616" SPA=8 Parent=1012 Position=9 spuName= spu0809] reported - SEU single bit error:

Example of multi-bit SEU error:
2013-11-17 00:59:38.007859 PST Error: DAC [dac hwid=1590 sn="YF10JB2AS772" SPA=1 Parent=1588 Position=2][fpga-index=1] on SPU [spu hwid=1588 sn="Y012BG2BK073" SPA=1 Parent=1002 Position=13 spuName= spu0113 DesignatedSpu] reported - SEU multi bit error:

Cause

Single-Event Upsets, SEUs, are caused by particle or ray-based radiation changing the contents of the FPGA programming logic

Environment

N1000-1, N2000-1, N100-1,, N3001

Diagnosing The Problem

To diagnose this issue the first step is to review all sysmgr logs by using the grep command for the SEU string

grep -i SEU /nz/kit/log/sysmgr/*.log


Resolving The Problem

Work with support to failover the affected SPU and scheduled FPGA replacement.

Fail the affected SPU if more than two SEU error messages have been present in 24 hours, or a clear trend has been established.

Fail the affected SPU if a single multi-bit SEU error is detected in the logging.

[{"Product":{"code":"SSULQD","label":"IBM PureData System"},"Business Unit":{"code":"BU053","label":"Cloud & Data Platform"},"Component":"Blade","Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"1.0.0","Edition":"Edition Independent","Line of Business":{"code":"LOB10","label":"Data and AI"}}]

Document Information

Modified date:
17 October 2019

UID

swg21680171