gnrhealthcheck script

Checks the general health of an ESS configuration.

Synopsis


gnrhealthcheck  [--topology] [--enclosure] [--rg] [--pdisk] 
                 [--perf-dd] [--ipr] [--nvme-ctrl] [--ssd-endurance-percentage] [--local]

Availability

Available on all IBM Storage Scale editions.

Description

The gnrhealthcheck script checks the general health of an ESS configuration.

Parameters

--topology
Checks the operating system topology. Runs mmgetpdisktopology and topsummary to look for cabling and path issues.
--enclosure
Checks enclosures. Runs mmlsenclosure to look for failures.
--rg
Checks recovery groups. Runs mmlsrecoverygroup to check whether all recovery groups are active and whether the active server is the primary server. Also checks for any recovery groups that need service.
--pdisk
Checks pdisks. Runs mmlspdisk to check that each pdisk has two paths.
--perf-dd
Checks basic performance of disks. Runs a dd read to each potential IBM Storage Scale RAID disk drive for a GB and reports basic performance statistics. Reads are done six disks at a time. These statistics will only be meaningful if run on an idle system. Available on Linux® only.
--ipr
Checks IBM® Power® RAID Array status. Runs iprconfig to check if the local RAID adapter is running "Optimized" or "Degraded". The ESS NVR pdisks are created on a RAID 10 array on this adapter. If one of the drives has failed, it will affect performance and should be replaced.
--nvme-ctrl
Checks NVMe controllers. Runs mmlsnvmestatus to find the status of the NVMe controllers.
--ssd-endurance-percentage
Runs mmlspdisk to check for SSDs that exceed 90% endurance.
--local
Runs tests only on the invoking node.

The default is to check everything except --perf-dd arguments on all NSD server nodes.

Exit status

0
No problems were found.
1
Problems were found and information was displayed.
Note: The default is to display to standard output. There could be a large amount of data, so it is recommended that you pipe the output to a file.

Security

You must have root authority to run the gnrhealthcheck script.

The node on which the script is issued must be able to execute remote shell commands on any other node in the cluster without the use of a password and without producing any extraneous messages. For more details, see the following IBM Storage Scale RAID: Administration topic: Requirements for administering IBM Storage Scale RAID.

Examples

  1. In this example, all checks are successful.
    To run a health check on the local server nodes and place output in /tmp/gnrhealthcheck.out, issue the following command:
       gnrhealthcheck --local | tee /tmp/gnrhealthcheck.out
    The system displays information similar to this:
    ################################################################
    # Beginning topology checks.
    ################################################################
    Topology checks successful.
    
    ################################################################
    # Beginning enclosure checks.
    ################################################################
    Enclosure checks successful.
    
    ################################################################
    # Beginning recovery group checks.
    ################################################################
    Recovery group checks successful.
    
    ################################################################
    # Beginning pdisk checks.
    ################################################################
    Pdisk group checks successful.
    
    ###########################################################
    # Beginning IBM Power RAID checks. 
    ###############################################################
    IBM Power RAID checks successful.
    
    ################################################################
    # Beginning the NVMe Controller checks.
    ################################################################
    The NVMe Controller checks are successful.
    
    ################################################################
    # Beginning SSD endurance checks
    ################################################################
    The SSD endurance checks are successful.
  2. In this example, several issues need to be investigated.
    To run a health check on the local server nodes and place output in /tmp/gnrhealthcheck.out, issue the following command:
    gnrhealthcheck --local | tee /tmp/gnrhealthcheck.out
    The system displays information similar to this:
    
    ################################################################
    # Beginning topology checks.
    ############################################################
    Found topology problems on node c45f01n01-ib0.gpfs.net
    
    DCS3700 enclosures found: 0123456789AB SV11812206 SV12616296 SV13306129
    Enclosure 0123456789AB (number 1):
    Enclosure 0123456789AB ESM A sg244[0379][scsi8 port 4] ESM B sg4[0379][scsi7 port 4]
    Enclosure 0123456789AB Drawer 1 ESM sg244 12 disks diskset "19968" ESM sg4 12 disks diskset "19968"
    Enclosure 0123456789AB Drawer 2 ESM sg244 12 disks diskset "11294" ESM sg4 12 disks diskset "11294"
    Enclosure 0123456789AB Drawer 3 ESM sg244 12 disks diskset "60155" ESM sg4 12 disks diskset "60155"
    Enclosure 0123456789AB Drawer 4 ESM sg244 12 disks diskset "03345" ESM sg4 12 disks diskset "03345"
    Enclosure 0123456789AB Drawer 5 ESM sg244 11 disks diskset "33625" ESM sg4 11 disks diskset "33625"
    Enclosure 0123456789AB sees 59 disks
    
    Enclosure SV12616296 (number 2):
    Enclosure SV12616296 ESM A sg63[0379][scsi7 port 3] ESM B sg3[0379][scsi9 port 4]
    Enclosure SV12616296 Drawer 1 ESM sg63 11 disks diskset "51519" ESM sg3 11 disks diskset "51519"
    Enclosure SV12616296 Drawer 2 ESM sg63 12 disks diskset "36246" ESM sg3 12 disks diskset "36246"
    Enclosure SV12616296 Drawer 3 ESM sg63 12 disks diskset "53750" ESM sg3 12 disks diskset "53750"
    Enclosure SV12616296 Drawer 4 ESM sg63 12 disks diskset "07471" ESM sg3 12 disks diskset "07471"
    Enclosure SV12616296 Drawer 5 ESM sg63 11 disks diskset "16033" ESM sg3 11 disks diskset "16033"
    Enclosure SV12616296 sees 58 disks
    
    Enclosure SV11812206 (number 3):
    Enclosure SV11812206 ESM A sg66[0379][scsi9 port 3] ESM B sg6[0379][scsi8 port 3]
    Enclosure SV11812206 Drawer 1 ESM sg66 11 disks diskset "23334" ESM sg6 11 disks diskset "23334"
    Enclosure SV11812206 Drawer 2 ESM sg66 12 disks diskset "16332" ESM sg6 12 disks diskset "16332"
    Enclosure SV11812206 Drawer 3 ESM sg66 12 disks diskset "52806" ESM sg6 12 disks diskset "52806"
    Enclosure SV11812206 Drawer 4 ESM sg66 12 disks diskset "28492" ESM sg6 12 disks diskset "28492"
    Enclosure SV11812206 Drawer 5 ESM sg66 11 disks diskset "24964" ESM sg6 11 disks diskset "24964"
    Enclosure SV11812206 sees 58 disks
    
    Enclosure SV13306129 (number 4):
    Enclosure SV13306129 ESM A sg64[0379][scsi8 port 2] ESM B sg353[0379][scsi7 port 2]
    Enclosure SV13306129 Drawer 1 ESM sg64 11 disks diskset "47887" ESM sg353 11 disks diskset "47887"
    Enclosure SV13306129 Drawer 2 ESM sg64 12 disks diskset "53906" ESM sg353 12 disks diskset "53906"
    Enclosure SV13306129 Drawer 3 ESM sg64 12 disks diskset "35322" ESM sg353 12 disks diskset "35322"
    Enclosure SV13306129 Drawer 4 ESM sg64 12 disks diskset "37055" ESM sg353 12 disks diskset "37055"
    Enclosure SV13306129 Drawer 5 ESM sg64 11 disks diskset "16025" ESM sg353 11 disks diskset "16025"
    Enclosure SV13306129 sees 58 disks
    
    DCS3700 configuration: 4 enclosures, 1 SSD, 7 empty slots, 233 disks total
    Location 0123456789AB-5-12 appears empty but should have an SSD
    Location SV12616296-1-3 appears empty but should have an SSD
    Location SV12616296-5-12 appears empty but should have an SSD
    Location SV11812206-1-3 appears empty but should have an SSD
    Location SV11812206-5-12 appears empty but should have an SSD
    
    scsi7[07.00.00.00] 0000:11:00.0 [P2 SV13306129 ESM B (sg353)] [P3 SV12616296 ESM A (sg63)] [P4 0123456789AB ESM B (sg4)]
    scsi8[07.00.00.00] 0000:8b:00.0 [P2 SV13306129 ESM A (sg64)] [P3 SV11812206 ESM B (sg6)] [P4 0123456789AB ESM A (sg244)]
    scsi9[07.00.00.00] 0000:90:00.0 [P3 SV11812206 ESM A (sg66)] [P4 SV12616296 ESM B (sg3)]
    
    ################################################################
    # Beginning enclosure checks.
    ################################################################
    Enclosure checks successful.
    
    ################################################################
    # Beginning recovery group checks.
    ################################################################
    Found recovery group BB1RGR, primary server is not the active server.
    
    ################################################################
    # Beginning pdisk checks.
    ################################################################
    Found recovery group BB1RGL pdisk e4d5s06 has 0 paths.
    
    ############################################################ 
    # Beginning IBM Power RAID checks. 
    ###############################################################
    IBM Power RAID Array is running in degraded mode.  
    
    Name   PCI/SCSI Location          Description               Status 
    ------ -------------------------  ------------------------- -----------------        
           0007:90:00.0/0:            PCI-E SAS RAID Adapter    Operational        
           0007:90:00.0/0:0:1:0       Advanced Function Disk    Failed        
           0007:90:00.0/0:0:2:0       Advanced Function Disk    Active sda    
           0007:90:00.0/0:2:0:0       RAID 10  Disk Array       Degraded        
           0007:90:00.0/0:0:0:0       RAID 10  Array Member     Active        
           0007:90:00.0/0:0:3:0       RAID 10  Array Member     Failed        
           0007:90:00.0/0:0:4:0       Enclosure                 Active        
           0007:90:00.0/0:0:6:0       Enclosure                 Active        
           0007:90:00.0/0:0:7:0       Enclosure                 Active
    
    ################################################################
    # Beginning the NVMe Controller checks.
    ################################################################
    The NVMe Controller checks are successful.
    
    ################################################################
    # Beginning SSD endurance checks
    ################################################################
    The SSD endurance checks are successful.

See also

See also the following Elastic Storage Server: Problem Determination Guide topic:
  • Checking the health of an ESS configuration: a sample scenario

Location

/usr/lpp/mmfs/samples/vdisk