IBM Support

PureData System for Analytics (Netezza): Hardware performance diagnostics

Troubleshooting


Problem

How do I diagnose a hardware problem that is causing me a performance issue?

Symptom

The PDA appliance is designed with a high degree of resilience and hardware redundancy. In most cases the system will deal with hardware issues itself - so the first step is to review the current system status and understand what hardware problems can cause you performance issues.

Environment

The following information is applicable to Mako (N3001), Striper (N2001, N2002) and Twinfin (N1001) Puredata System for Analytics (Netezza) systems. Command output here is for release 7.2.1 of the database software - earlier versions may show slightly different output but the content will be the same.

Resolving The Problem

System managed hardware issues

The administration tool or performance portal will both give you a graphic view of the system hardware, and highlight any issues with it. The most informative approach (and easiest to add to a ticket) is the output of nzhw -issues. This will indicate any hardware components that are not currently active.

[nz@nz80409-h2 ~]$ nzhw -issues
Description HW ID Location            Role     State Security
----------- ----- --------------------- -------- ----- --------
Blower      1017  spa1.blower2          Inactive Down  N/A
Disk        1060  spa1.diskEncl1.disk4  Failed   Ok    N/A
Disk        1067  spa1.diskEncl1.disk11 Failed   Ok    N/A  
Here, for example, we see two data disks failed and one fan. If the system had no hardware problems the output would look like this:

[nz@nz80409-h2 ~]$ nzhw -issues
No entries found

Hardware problems do not always cause performance problems - the appliance redundancy will shield you from the majority of hardware problems, which should be reported to IBM support to be dealt with as a BAU activity. There are cases where performance problems can be caused by hardware - the following are the most likely scenarios.

1. Active disk regen in progress.

Disks, as a mechanical component, do fail. The appliances ship with spare drives ready to replace any failed disk as required. A disk failing does not interrupt 'in flight' queries at all, but when a disk has failed we have to rebuild it from the data mirrors.

You can tell if a regen is in progress with the 'nzds -regenstatus' command.

[nz@nz80409-h2 ~] nzds show -regenstatus
Data Slice SPU Source Destination Start Time % Done
---------- ---- ------ ----------- ----------------------- --------
5 1092 1035 1014 09-Apr-09, 07:24:55 EDT 0.01
6 1092 1035 1014 09-Apr-09, 07:24:55 EDT 0.01

If a regen is in progress, it will have an impact on query performance. The system is designed to prioritize active queries, and will throttle back the rate that the disk regen progresses, but you could still see around a 20% performance hit until the regen finishes.

2. More than one blade failed.

Systems larger than a mini-mako/skimmer are designed with an additional 'hot spare' blade per rack. A full Striper or Mako rack has seven blades, but is capable of running at 100% performance with only six active blades. If we loose more blades than that, data slices will be distributed among remaining blades and the system will still run, but at lower performance (about 20% per blade lost over the first).

A single blade down is nothing to worry about, and blade failures are not common. You should arrange for hardware service as soon as you can to restore resilience.

Lower priority hardware issues

The system will only fail components when they are failed or causing a significant problem. We also provide tools that allow you to be more proactive about potential issues on the system or discover issues that the system has not yet discovered. The majority of these are not performance impacting. The main area the review is disk performance. There are two main things to review:

1. nzhealthcheck

nzhealthcheck is a tool provided with your appliance that evaluates system health against a large catalog of rules. These typically cover preventative maintenance items that are not currently causing you issues, but can if not dealt with. Examples of such issues include:

Disks that are exhibiting behaviour we associate with imminent failure

Non-critical hardware issues such as failed memory DIMMs or internal networking issues.

Typically if these issues are dealt with they will not cause you issues, but it is often a first step to resolve any potential hardware problems as a first step.

2. nz_check_disk_scan_speeds

Sometimes we will have issues in the system that are not easily flagged. nz_check_disk_scan_speeds is a process which exercises the end to end software and hardware stack against a known data volume (which is dynamically generated). Based on known data volumes and expected runtimes we can benchmark performance against expected values.

[nz@striper ~]$ nz_check_disk_scan_speeds

Reusing existing table 'NZ_CHECK_DISK_SCAN_SPEEDS' for these tests

Running the scan test now ...

Iteration        Elapsed Seconds          Computed MB per SECOND per dataslice
------------     ---------------          ------------------------------------
     1               29.79                         134.27324
     2               29.41                         136.00816
     3               29.42                         135.96193
     4               29.55                         135.36379
     5               29.73                         134.54423
============     ===============          ====================================

   slowest            29.79                         134.27324
  AVERAGE            29.58                         135.22650
  fastest            29.41                         136.00816


################################################################################

The generated NZ_CHECK_DISK_SCAN_SPEEDS table in SYSTEM can be dropped, or left in place for the next time you run this.

It is important that the system is quiet when this is run! We can only infer the run speeds when the system is not processing other workload. Typically we would expect to see values above the following:

Twinfin - read speed over 90 MB/sec

Striper/Mako - read speed over 120 MB/sec

When you run nz_check_disk_scan_speeds it is advisable to start a second session and run nz_responders. nz_responders monitors running queries and tells you how many data slices are participating in the query - from this you can determine if there is any processing skew in your queries. For a normal workload this can give you mixed signals - after all how do you know if the skew is down to processing or data skew, or hardware issues for a given data slice? In this specific case we know that our query and data are perfectly distributed and so any discrepancy is nearly certainly down to hardware issues. Take the following example:

[nz@nz80409-h2 ~]$ nz_check_disk_scan_speeds

Reusing existing table 'NZ_CHECK_DISK_SCAN_SPEEDS' for these tests

Running the scan test now ...

Iteration        Elapsed Seconds          Computed MB per SECOND per dataslice
------------     ---------------          ------------------------------------
     1               71.32                          56.08524
     2               71.37                          56.04595

This is a Twinfin-3 (N1001-005) system. Here we can clearly see that the scan rates of ~56 MB/sec are below our expectations. If we review the nz_responders output we see:

[nz@nz80409-h2 ~]$ nz_responders

20160509 Plan #  Snippet   Time S/P  State  Busy Dataslices ...       SQL                                      Username/Database
======== ======= ========= ========= ======= ==== ==================== ======================================== ====================
05:51:20
05:51:30
05:51:40  262224     (2/2)        10 RUNNING 22                        select count(*) as "nz_check_disk_scan_s ADMIN/SYSTEM

05:51:51  262224     (2/2)        20 RUNNING 22                        select count(*) as "nz_check_disk_scan_s ADMIN/SYSTEM

05:52:01  262224     (2/2)        30 RUNNING 18                        select count(*) as "nz_check_disk_scan_s ADMIN/SYSTEM

05:52:11  262224     (2/2)        40 RUNNING 8                         select count(*) as "nz_check_disk_scan_s ADMIN/SYSTEM

05:52:21  262224     (2/2)        50 RUNNING 4    2 4 17 18            select count(*) as "nz_check_disk_scan_s ADMIN/SYSTEM

05:52:31  262224     (2/2)        61 RUNNING 2    2 4                  select count(*) as "nz_check_disk_scan_s ADMIN/SYSTEM

05:52:41  262224     (2/2)        71 RUNNING 1    2                    select count(*) as "nz_check_disk_scan_s ADMIN/SYSTEM

05:52:51  262239     (2/2)        10 RUNNING 22                        select count(*) as "nz_check_disk_scan_s ADMIN/SYSTEM

05:53:01  262239     (2/2)        20 RUNNING 22                        select count(*) as "nz_check_disk_scan_s ADMIN/SYSTEM

05:53:12  262239     (2/2)        30 RUNNING 18                        select count(*) as "nz_check_disk_scan_s ADMIN/SYSTEM

05:53:22  262239     (2/2)        40 RUNNING 8                         select count(*) as "nz_check_disk_scan_s ADMIN/SYSTEM

05:53:32  262239     (2/2)        50 RUNNING 4    2 4 17 18            select count(*) as "nz_check_disk_scan_s ADMIN/SYSTEM

05:53:42  262239     (2/2)        60 RUNNING 1    2                    select count(*) as "nz_check_disk_scan_s ADMIN/SYSTEM

05:53:52  262239     (2/2)        70 RUNNING 1    2                    select count(*) as "nz_check_disk_scan_s ADMIN/SYSTEM

05:54:02  262255     (2/2)        10 RUNNING 22                        select count(*) as "nz_check_disk_scan_s ADMIN/SYSTEM

05:54:12  262255     (2/2)        20 RUNNING 22                        select count(*) as "nz_check_disk_scan_s ADMIN/SYSTEM

05:54:22  262255     (2/2)        30 RUNNING 18                        select count(*) as "nz_check_disk_scan_s ADMIN/SYSTEM

05:54:32  262255     (2/2)        40 RUNNING 8                         select count(*) as "nz_check_disk_scan_s ADMIN/SYSTEM

05:54:43  262255     (2/2)        50 RUNNING 4    2 4 17 18            select count(*) as "nz_check_disk_scan_s ADMIN/SYSTEM

05:54:53  262255     (2/2)        60 RUNNING 2    2 4                  select count(*) as "nz_check_disk_scan_s ADMIN/SYSTEM

The busy column indicates how many dataslices are participating in the query - we would want this to be 22 (total dataslices on the system) as much as possible. As the number of dataslices gets low we start to record which dataslices are still running (in the 'Dataslices...' column). Here we see that dataslice 2 is consistently running for longer than all other dataslices and should be reviewed and possibly failed. We may also want to look at dataslice 4.

We can determine which disk we need to fail by looking at nzds output. This will tell us which is the primary disk (which is most likely causing us issues):

[nz@nz80409-h2 ~]$ nzds -dsid 2 -detail

Data Slice Status SPU  Partition Size (GiB) % Used Supporting Disks Supporting Disks Locations                Primary Storage
---------- ------- ---- --------- ---------- ------ ---------------- ----------------------------------------- ---------------
2          Healthy 1697 1         195        75.70  1053,1188        spa1.diskEncl1.disk1,spa1.diskEncl4.disk1 1188  

At this point if you have any doubts, log a call with IBM support and report all the supporting information you have gathered. The system will typically detect and deal with problem disks itself so having to proactively fail a disk is fairly uncommon. If you wanted to proceed you would do this with the nzds command. Firstly, ensure you have the correct primary storage hardware ID.

[nz@nz80409-h2 ~]$ nzhw -id 1188

Description HW ID Location             Role   State Security
----------- ----- -------------------- ------ ----- --------
Disk        1188  spa1.diskEncl4.disk1 Active Ok    N/A  

Then validate you have at least one spare disk that the system can regenerate to. if not, then STOP! Log a ticket with support to review why, if you are not aware already.

[nz@nz80409-h2 ~]$ nzhw -type disk | grep -i spare

Disk        1062  spa1.diskEncl1.disk10  Spare  Ok    N/A
Disk        1067  spa1.diskEncl1.disk15  Spare  Ok    N/A  

So we have spares, we have the correct disk ID. Check that the system is not too busy.. failing a disk is not a task that will cause an outage to any SQL but you want to control when the regen happens, ideally at a quiet point. We use the nzhw command to do this:

[nz@nz80409-h2 ~]$ nzhw failover -id 1188

Are you sure you want to proceed (y|n)? [n] y

We can then monitor the status. We want to ensure a disk has been assigned for the regen:

[nz@nz80409-h2 ~]$ nzhw -issues

Description HW ID Location             Role     State Security
----------- ----- -------------------- -------- ----- --------
Disk        1188  spa1.diskEncl4.disk1 Failed   Ok    N/A
Disk        1606  spa1.diskEncl4.disk4 Assigned Ok    N/A

We then want to monitor regen progress:

[nz@nz80409-h2 ~]$ nzds -regenstatus
Data Slice Status   SPU  Partition Size (GiB) % Used Supporting Disks Start Time          % Done
---------- --------- ---- --------- ---------- ------ ---------------- ------------------- -------
1          Repairing 1697 0         195        75.85  1053,1606        2016-05-17 11:30:49    0.00
2          Repairing 1697 1         195        75.70  1053,1606        2016-05-17 11:30:49    4.46

Other issues

Of course this is not an exhaustive list but outlines the most common hardware issues that can cause performance problems. Overall the nz_check_disk_scan_speeds script is useful for identifying if anything in the path between disk and host is causing a performance drag. The disk and blades are the most likely components but not the only ones.

[{"Product":{"code":"SSULQD","label":"IBM PureData System"},"Business Unit":{"code":"BU053","label":"Cloud & Data Platform"},"Component":"Blade","Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"1.0.0","Edition":"","Line of Business":{"code":"LOB10","label":"Data and AI"}}]

Document Information

Modified date:
17 October 2019

UID

swg21983496