IBM Elastic Storage Server : Technical tip for RAID monitoring script

Troubleshooting

Problem

In ESS, GNR writes operations that are less than nsdRAIDFastWriteFSDataLimit in size are written to the log tip, which is a VDisk in a special declustered array (DA) called NVR. The log tip is a two-way replicated VDisk that is configured on disk partitions from each IO node from its internal hard disk drives. ESS IO nodes have a RAID controller with a write cache of 1800 MB. The key is that these log tip VDisk are small so that all the I/Os can be satisfied by the cache. The internal drives do not have to take the I/Os.

If there is a hardware issue in a RAID controller it leads to small IOs not being written to cache but to the internal drives of the IO servers. It significantly impacts the performance of small writes. Other types of IOs, like full track writes, do not see any impact on performance, making this problem difficult to diagnose.

Symptom

Significant impact on performance of small writes. Full track, promoted full track and medium track writes do not see much impact on performance. This might get manifested in different ways depending on customer workload (for example, Sybase IQ write operations running slow, backup running slowly because metadata operations tend to be small writes, or protocol performance being slow, etc).

Environment

MTM impact:

5148-22L

Code Level impacted:

ESS V5.3.7.1, 6.0.2.0, 6.1.0.0 and all prior version.

Log Entry:

Currently, this hardware issue is not monitored by the Power Server hardware monitoring or by the GNR software RAS component (mmhealth).

Diagnosing The Problem

To facilitate monitoring of the RAID adapter card, a sample script is provided at /opt/ibm/gss/tools/samples/ipraid_monitor.py. The script creates two custom events ev_001 for “Cache failure of RAID Adapter card” and ev_002 for “Disk failure of mirror RAID 10 disk used by root partition”. For more information on custom events, refer to the Spectrum Scale document Creating, raising, and finding custom defined events.

Usage:

1.) Use this script on only Power8 ESS IO nodes and EMS (or a protocol node if they are Power8 servers and have the same type of RAID adapter card).

2.) The location of the script is /opt/ibm/gss/tools/samples/ipraid_monitor.py. Run the script once manually.

3.) If no issue is reported, then the script ran correctly. The script logs are stored in /var/log/ipraid_monitor.log. Open an issue if there are any errors.

4.) Check whether events are created by issuing the command 'mmhealth event show ev_001' and 'mmhealth event show ev_002'.

5.) Add a cron job entry to run the script every 30 mins (issue commands crontab -e and add a line):

*/30 * * * * /opt/ibm/gss/tools/samples/ipraid_monitor.py).

6.) Repeat steps 2-5 on all ESS IO nodes and EMS.

7.) Restart the Spectrum Scale GUI service after the script has run once on all the nodes:

systemctl restart gpfsgui

8.) Upon upgrade of ESS deployment/gpfs rpms, no changes are required. The script takes care of it.

If there is a hardware issue and an event is generated, they are seen in the GUI and in the eventlog.

If you want to test the script, you can issue the following command to artificially trigger the event from one of the IO nodes by issuing the following command: /usr/lpp/mmfs/bin/mmsysmonc event custom ev_001

Once you verify that the event is seen in the GUI and in the eventlog for that node, you can clear the event by issuing:

mmhealth node eventlog --clear

[root@ems1 ~]# mmhealth node eventlog

Node name: ems1-10g.gpfs.net

Timestamp Event Name Severity Details

2021-04-18 00:59:37.140221 IST eventlog_cleared INFO On the node ems1-10g.gpfs.net the eventlog was cleared.

2021-04-18 00:59:40.155770 IST root_disk_event ERROR One of the disks of Mirrored (RAID 10) Root Partition failed

Fixing the hardware issue:

Collect the output of /usr/sbin/iprconfig -c dump from the impacted node (same is available in GNR directory in a snap). Refer to the Power server documentation IBM SAS RAID controller - Problem determination and recovery to resolve the issue. In most cases, you have to replace the adapter card.

Document Location

Worldwide

[{"Type":"SW","Line of Business":{"code":"LOB26","label":"Storage"},"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"STHMCM","label":"IBM Elastic Storage Server"},"ARM Category":[{"code":"a8m50000000Kze2AAC","label":"GPFS Raid"}],"ARM Case Number":"","Platform":[{"code":"PF016","label":"Linux"}],"Version":"All Version(s)"}]

Was this topic helpful?

Document Information

More support for:
IBM Elastic Storage Server

Component:
GPFS Raid

Software version:
All Version(s)

Operating system(s):
Linux

Document number:
6452593

Modified date:
21 May 2021

UID

ibm16452593

IBM Support

Tips