Troubleshooting
Problem
In ESS, GNR writes operations that are less than nsdRAIDFastWriteFSDataLimit in size are written to the log tip, which is a VDisk in a special declustered array (DA) called NVR. The log tip is a two-way replicated VDisk that is configured on disk partitions from each IO node from its internal hard disk drives. ESS IO nodes have a RAID controller with a write cache of 1800 MB. The key is that these log tip VDisk are small so that all the I/Os can be satisfied by the cache. The internal drives do not have to take the I/Os.
If there is a hardware issue in a RAID controller it leads to small IOs not being written to cache but to the internal drives of the IO servers. It significantly impacts the performance of small writes. Other types of IOs, like full track writes, do not see any impact on performance, making this problem difficult to diagnose.
Symptom
Significant impact on performance of small writes. Full track, promoted full track and medium track writes do not see much impact on performance. This might get manifested in different ways depending on customer workload (for example, Sybase IQ write operations running slow, backup running slowly because metadata operations tend to be small writes, or protocol performance being slow, etc).
Environment
MTM impact:
5148-22L
Code Level impacted:
ESS V5.3.7.1, 6.0.2.0, 6.1.0.0 and all prior version.
Log Entry:
Currently, this hardware issue is not monitored by the Power Server hardware monitoring or by the GNR software RAS component (mmhealth).
Diagnosing The Problem
To facilitate monitoring of the RAID adapter card, a sample script is provided at /opt/ibm/gss/tools/samples/ipraid_monitor.py. The script creates two custom events ev_001 for “Cache failure of RAID Adapter card” and ev_002 for “Disk failure of mirror RAID 10 disk used by root partition”. For more information on custom events, refer to the Spectrum Scale document Creating, raising, and finding custom defined events.
Usage:
1.) Use this script on only Power8 ESS IO nodes and EMS (or a protocol node if they are Power8 servers and have the same type of RAID adapter card).
2.) The location of the script is /opt/ibm/gss/tools/samples/ipraid_monitor.py. Run the script once manually.
3.) If no issue is reported, then the script ran correctly. The script logs are stored in /var/log/ipraid_monitor.log. Open an issue if there are any errors.
4.) Check whether events are created by issuing the command 'mmhealth event show ev_001' and 'mmhealth event show ev_002'.
5.) Add a cron job entry to run the script every 30 mins (issue commands crontab -e and add a line):
*/30 * * * * /opt/ibm/gss/tools/samples/ipraid_monitor.py).
6.) Repeat steps 2-5 on all ESS IO nodes and EMS.
7.) Restart the Spectrum Scale GUI service after the script has run once on all the nodes:
systemctl restart gpfsgui
8.) Upon upgrade of ESS deployment/gpfs rpms, no changes are required. The script takes care of it.

[root@ems1 ~]# mmhealth node eventlog
Node name: ems1-10g.gpfs.net
Timestamp Event Name Severity Details
2021-04-18 00:59:37.140221 IST eventlog_cleared INFO On the node ems1-10g.gpfs.net the eventlog was cleared.
2021-04-18 00:59:40.155770 IST root_disk_event ERROR One of the disks of Mirrored (RAID 10) Root Partition failed
Fixing the hardware issue:
Collect the output of /usr/sbin/iprconfig -c dump from the impacted node (same is available in GNR directory in a snap). Refer to the Power server documentation IBM SAS RAID controller - Problem determination and recovery to resolve the issue. In most cases, you have to replace the adapter card.
Document Location
Worldwide
[{"Type":"SW","Line of Business":{"code":"LOB26","label":"Storage"},"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"STHMCM","label":"IBM Elastic Storage Server"},"ARM Category":[{"code":"a8m50000000Kze2AAC","label":"GPFS Raid"}],"ARM Case Number":"","Platform":[{"code":"PF016","label":"Linux"}],"Version":"All Version(s)"}]
Was this topic helpful?
Document Information
More support for:
IBM Elastic Storage Server
Component:
GPFS Raid
Software version:
All Version(s)
Operating system(s):
Linux
Document number:
6452593
Modified date:
21 May 2021
UID
ibm16452593
Manage My Notification Subscriptions