IBM Support

Motherboard RAID Controller ServeRAID M5110e might reset during heavy I/O - System x3650 (7915)

Troubleshooting


Problem

A small percentage of IBM System x3650 M4 on board Redundant Array of Independent Disks (RAID) controller, ServeRAID M5110 might experience early life online resets during heavy Input/Output (I/O). On susceptible controllers, the frequency of online resets varies depending on I/O throughput. This issue has been observed to occur during a virtual disk consistency check, and a hard drive patrol read. A momentary loss of performance for a few seconds can be observed while the controller resets itself. Thefirmware level does not contribute to the online resets. This is a recoverable event and has no impact on data. Data is stored in the controller's flash-based memory, and is off-loaded when the reset completes. However, the system board needs to be replaced if either of following conditions have occurred: Text has been truncated due to size limiations.

Resolving The Problem

Source

RETAIN tip: H211741

Symptom

A small percentage of IBM System x3650 M4 on board Redundant Array of Independent Disks (RAID) controller, ServeRAID M5110 might experience early life online resets during heavy Input/Output (I/O). On susceptible controllers, the frequency of online resets varies depending on I/O throughput.

This issue has been observed to occur during a virtual disk consistency check, and a hard drive patrol read. A momentary loss of performance for a few seconds can be observed while the controller resets itself. The firmware level does not contribute to the online resets.

This is a recoverable event and has no impact on data. Data is stored in the controller's flash-based memory, and is off-loaded when the reset completes.

However, the system board needs to be replaced if either of following conditions have occurred:

  1. It is normal to see a controller reset for some tasks. If more than five unexplained controller resets occur per hour, with no 'PMU Msg' fault code logged, the system board needs to be replaced.
      Controller encountered a fatal error and reset
  2. Users can identify the error from the following MegaRAID Storage Manager event:

    In Linux, look for the following messages in the /var/log/kernel log file:

      kernel megasas: Found firmware in FAULT state, will reset adapter.
    kernel megaraid_sas: resetting fusion adapter.
    kernel megasas: Waiting for firmware to come to ready state
    kernel megasas: firmware now in Ready state
    kernel megasas: IOC Init cmd success
    kernel megaraid_sas: Reset successful

    If the previous event is observed, view the RAID controller's firmware log to see if it contains one of the following messages:

      - Pmu Msg Fault!!! faultcode 00002651

    - Pmu Msg Fault!!! faultcode 00002653

    - Pmu Msg Fault!!! faultcode 00002656

    - Pmu Msg Fault!!! faultcode 0000265D

    - Pmu Msg Fault!!! faultcode 00000615

    - Pmu Msg Fault!!! faultcode 00001900

    - Pmu Msg Fault!!! faultcode 00002665

    To check the controller's firmware log, download the IBM MegaRAID Command Line Interface (MegaCLI) and run the following command:

      MegaCLI -FwTermLog -Dsply -aALL

    MegaCLI for Microsoft Windows:

    MIGR-5082326.html

    MegaCLI for Linux:

    MIGR-5082327.html

Under some conditions, an event such as the following event could be logged in the firmware log instead of the 'Pmu Msg' fault.

  - To understand..pmsg:c130adf8 lmid:


Affected configurations

The system can be any of the following IBM servers:

This tip is not software specific.

This tip is not option specific.

The system has the symptom described above.

Solution

If the symptoms listed are observed, replace the system board with controller embedded with the following Field Replaceable Unit (FRU) Part Number (Part Number): 00AM209 (for Intel V2 CPU) or 00Y8457 (for Intel CPU).

Additional information

An online reset is defined as the controller resets its firmware. This process takes only a few seconds. Early life implies this is an event that will occur right away, and controller age has no impact.

If RAID controller M51XX adapters have any failure symptoms that meet the listed description, refer to RETAIN Tip H21381 (ServeRAID M51XX CONTROLLERS MAY RESET DURING HEAVY I/O).

Document Location

Worldwide

Operating System

System x:Operating system independent / None

System x Hardware Options:Operating system independent / None

Lenovo x86 servers:Operating system independent / None

[{"Type":"HW","Business Unit":{"code":"BU016","label":"Multiple Vendor Support"},"Product":{"code":"QU01DKP","label":"System x->System x3650 M4->7915"},"Platform":[{"code":"PF025","label":"Platform Independent"}],"Line of Business":{"code":"","label":""}},{"Type":"HW","Business Unit":{"code":"BU016","label":"Multiple Vendor Support"},"Product":{"code":"QUOEABI","label":"System x Hardware Options->ServeRAID->ServeRAID M and MR10 Series"},"Platform":[{"code":"PF025","label":"Platform Independent"}],"Line of Business":{"code":"","label":""}},{"Type":"HW","Business Unit":{"code":"BU054","label":"Systems w\/TPS"},"Product":{"code":"QUOFNIO","label":"Lenovo x86 servers->Lenovo System x3650 M4->7915"},"Platform":[{"code":"PF025","label":"Platform Independent"}],"Line of Business":{"code":"","label":""}}]

Document Information

Modified date:
30 January 2019

UID

ibm1MIGR-5094459