PureData System for Analytics (Netezza): Hardware performance diagnostics

Problem

How do I diagnose a hardware problem that is causing me a performance issue?

Symptom

The PDA appliance is designed with a high degree of resilience and hardware redundancy. In most cases the system will deal with hardware issues itself - so the first step is to review the current system status and understand what hardware problems can cause you performance issues.

Environment

The following information is applicable to Mako (N3001), Striper (N2001, N2002) and Twinfin (N1001) Puredata System for Analytics (Netezza) systems. Command output here is for release 7.2.1 of the database software - earlier versions may show slightly different output but the content will be the same.

Resolving The Problem

System managed hardware issues

The administration tool or performance portal will both give you a graphic view of the system hardware, and highlight any issues with it. The most informative approach (and easiest to add to a ticket) is the output of nzhw -issues. This will indicate any hardware components that are not currently active.

`[nz@nz80409-h2 ~]$ nzhw -issues Description HW ID Location Role State Security ----------- ----- --------------------- -------- ----- -------- Blower 1017 spa1.blower2 Inactive Down N/A Disk 1060 spa1.diskEncl1.disk4 Failed Ok N/A Disk 1067 spa1.diskEncl1.disk11 Failed Ok N/A`

Here, for example, we see two data disks failed and one fan. If the system had no hardware problems the output would look like this:

`[nz@nz80409-h2 ~]$ nzhw -issues No entries found`

Hardware problems do not always cause performance problems - the appliance redundancy will shield you from the majority of hardware problems, which should be reported to IBM support to be dealt with as a BAU activity. There are cases where performance problems can be caused by hardware - the following are the most likely scenarios.

1. Active disk regen in progress.

Disks, as a mechanical component, do fail. The appliances ship with spare drives ready to replace any failed disk as required. A disk failing does not interrupt 'in flight' queries at all, but when a disk has failed we have to rebuild it from the data mirrors.

You can tell if a regen is in progress with the 'nzds -regenstatus' command.

`[nz@nz80409-h2 ~] nzds show -regenstatus Data Slice SPU Source Destination Start Time % Done ---------- ---- ------ ----------- ----------------------- -------- 5 1092 1035 1014 09-Apr-09, 07:24:55 EDT 0.01 6 1092 1035 1014 09-Apr-09, 07:24:55 EDT 0.01`

If a regen is in progress, it will have an impact on query performance. The system is designed to prioritize active queries, and will throttle back the rate that the disk regen progresses, but you could still see around a 20% performance hit until the regen finishes.

2. More than one blade failed.

Systems larger than a mini-mako/skimmer are designed with an additional 'hot spare' blade per rack. A full Striper or Mako rack has seven blades, but is capable of running at 100% performance with only six active blades. If we loose more blades than that, data slices will be distributed among remaining blades and the system will still run, but at lower performance (about 20% per blade lost over the first).

A single blade down is nothing to worry about, and blade failures are not common. You should arrange for hardware service as soon as you can to restore resilience.

Lower priority hardware issues

The system will only fail components when they are failed or causing a significant problem. We also provide tools that allow you to be more proactive about potential issues on the system or discover issues that the system has not yet discovered. The majority of these are not performance impacting. The main area the review is disk performance. There are two main things to review:

1. nzhealthcheck

nzhealthcheck is a tool provided with your appliance that evaluates system health against a large catalog of rules. These typically cover preventative maintenance items that are not currently causing you issues, but can if not dealt with. Examples of such issues include:

Disks that are exhibiting behaviour we associate with imminent failure

Non-critical hardware issues such as failed memory DIMMs or internal networking issues.

Typically if these issues are dealt with they will not cause you issues, but it is often a first step to resolve any potential hardware problems as a first step.

2. nz_check_disk_scan_speeds

Sometimes we will have issues in the system that are not easily flagged. nz_check_disk_scan_speeds is a process which exercises the end to end software and hardware stack against a known data volume (which is dynamically generated). Based on known data volumes and expected runtimes we can benchmark performance against expected values.

`[nz@striper ~]$ nz_check_disk_scan_speeds` `Reusing existing table 'NZ_CHECK_DISK_SCAN_SPEEDS' for these tests` `Running the scan test now ...` Iteration Elapsed Seconds Computed MB per SECOND per dataslice ------------ --------------- ------------------------------------ 1 29.79 134.27324 2 29.41 136.00816 3 29.42 135.96193 4 29.55 135.36379 5 29.73 134.54423 ============ =============== ==================================== `slowest 29.79 134.27324 AVERAGE 29.58 135.22650 fastest 29.41 136.00816` `################################################################################`

The generated NZ_CHECK_DISK_SCAN_SPEEDS table in SYSTEM can be dropped, or left in place for the next time you run this.

It is important that the system is quiet when this is run! We can only infer the run speeds when the system is not processing other workload. Typically we would expect to see values above the following:

Twinfin - read speed over 90 MB/sec

Striper/Mako - read speed over 120 MB/sec

When you run nz_check_disk_scan_speeds it is advisable to start a second session and run nz_responders. nz_responders monitors running queries and tells you how many data slices are participating in the query - from this you can determine if there is any processing skew in your queries. For a normal workload this can give you mixed signals - after all how do you know if the skew is down to processing or data skew, or hardware issues for a given data slice? In this specific case we know that our query and data are perfectly distributed and so any discrepancy is nearly certainly down to hardware issues. Take the following example:

`[nz@nz80409-h2 ~]$ nz_check_disk_scan_speeds` `Reusing existing table 'NZ_CHECK_DISK_SCAN_SPEEDS' for these tests` `Running the scan test now ...` `Iteration Elapsed Seconds Computed MB per SECOND per dataslice ------------ --------------- ------------------------------------ 1 71.32 56.08524 2 71.37 56.04595`

This is a Twinfin-3 (N1001-005) system. Here we can clearly see that the scan rates of ~56 MB/sec are below our expectations. If we review the nz_responders output we see:

`[nz@nz80409-h2 ~]$ nz_responders` `20160509 Plan # Snippet Time S/P State Busy Dataslices ... SQL Username/Database ======== ======= ========= ========= ======= ==== ==================== ======================================== ==================== 05:51:20 05:51:30 05:51:40 262224 (2/2) 10 RUNNING 22 select count() as "nz_check_disk_scan_s ADMIN/SYSTEM` `05:51:51 262224 (2/2) 20 RUNNING 22 select count() as "nz_check_disk_scan_s ADMIN/SYSTEM` `05:52:01 262224 (2/2) 30 RUNNING 18 select count() as "nz_check_disk_scan_s ADMIN/SYSTEM` `05:52:11 262224 (2/2) 40 RUNNING 8 select count() as "nz_check_disk_scan_s ADMIN/SYSTEM` `05:52:21 262224 (2/2) 50 RUNNING 4 2 4 17 18 select count() as "nz_check_disk_scan_s ADMIN/SYSTEM` `05:52:31 262224 (2/2) 61 RUNNING 2 2 4 select count() as "nz_check_disk_scan_s ADMIN/SYSTEM` `05:52:41 262224 (2/2) 71 RUNNING 1 2 select count() as "nz_check_disk_scan_s ADMIN/SYSTEM` `05:52:51 262239 (2/2) 10 RUNNING 22 select count() as "nz_check_disk_scan_s ADMIN/SYSTEM` `05:53:01 262239 (2/2) 20 RUNNING 22 select count() as "nz_check_disk_scan_s ADMIN/SYSTEM` `05:53:12 262239 (2/2) 30 RUNNING 18 select count() as "nz_check_disk_scan_s ADMIN/SYSTEM` `05:53:22 262239 (2/2) 40 RUNNING 8 select count() as "nz_check_disk_scan_s ADMIN/SYSTEM` `05:53:32 262239 (2/2) 50 RUNNING 4 2 4 17 18 select count() as "nz_check_disk_scan_s ADMIN/SYSTEM` `05:53:42 262239 (2/2) 60 RUNNING 1 2 select count() as "nz_check_disk_scan_s ADMIN/SYSTEM` `05:53:52 262239 (2/2) 70 RUNNING 1 2 select count() as "nz_check_disk_scan_s ADMIN/SYSTEM` `05:54:02 262255 (2/2) 10 RUNNING 22 select count() as "nz_check_disk_scan_s ADMIN/SYSTEM` `05:54:12 262255 (2/2) 20 RUNNING 22 select count() as "nz_check_disk_scan_s ADMIN/SYSTEM` `05:54:22 262255 (2/2) 30 RUNNING 18 select count() as "nz_check_disk_scan_s ADMIN/SYSTEM` `05:54:32 262255 (2/2) 40 RUNNING 8 select count() as "nz_check_disk_scan_s ADMIN/SYSTEM` `05:54:43 262255 (2/2) 50 RUNNING 4 2 4 17 18 select count() as "nz_check_disk_scan_s ADMIN/SYSTEM` `05:54:53 262255 (2/2) 60 RUNNING 2 2 4 select count() as "nz_check_disk_scan_s ADMIN/SYSTEM`

The busy column indicates how many dataslices are participating in the query - we would want this to be 22 (total dataslices on the system) as much as possible. As the number of dataslices gets low we start to record which dataslices are still running (in the 'Dataslices...' column). Here we see that dataslice 2 is consistently running for longer than all other dataslices and should be reviewed and possibly failed. We may also want to look at dataslice 4.

We can determine which disk we need to fail by looking at nzds output. This will tell us which is the primary disk (which is most likely causing us issues):

`[nz@nz80409-h2 ~]$ nzds -dsid 2 -detail` `Data Slice Status SPU Partition Size (GiB) % Used Supporting Disks Supporting Disks Locations Primary Storage ---------- ------- ---- --------- ---------- ------ ---------------- ----------------------------------------- --------------- 2 Healthy 1697 1 195 75.70 1053,1188 spa1.diskEncl1.disk1,spa1.diskEncl4.disk1 1188`

At this point if you have any doubts, log a call with IBM support and report all the supporting information you have gathered. The system will typically detect and deal with problem disks itself so having to proactively fail a disk is fairly uncommon. If you wanted to proceed you would do this with the nzds command. Firstly, ensure you have the correct primary storage hardware ID.

`[nz@nz80409-h2 ~]$ nzhw -id 1188` `Description HW ID Location Role State Security ----------- ----- -------------------- ------ ----- -------- Disk 1188 spa1.diskEncl4.disk1 Active Ok N/A`

Then validate you have at least one spare disk that the system can regenerate to. if not, then STOP! Log a ticket with support to review why, if you are not aware already.

`[nz@nz80409-h2 ~]$ nzhw -type disk \| grep -i spare` `Disk 1062 spa1.diskEncl1.disk10 Spare Ok N/A Disk 1067 spa1.diskEncl1.disk15 Spare Ok N/A`

So we have spares, we have the correct disk ID. Check that the system is not too busy.. failing a disk is not a task that will cause an outage to any SQL but you want to control when the regen happens, ideally at a quiet point. We use the nzhw command to do this:

`[nz@nz80409-h2 ~]$ nzhw failover -id 1188` `Are you sure you want to proceed (y\|n)? [n] y`

We can then monitor the status. We want to ensure a disk has been assigned for the regen:

`[nz@nz80409-h2 ~]$ nzhw -issues` `Description HW ID Location Role State Security ----------- ----- -------------------- -------- ----- -------- Disk 1188 spa1.diskEncl4.disk1 Failed Ok N/A Disk 1606 spa1.diskEncl4.disk4 Assigned Ok N/A`

We then want to monitor regen progress:

`[nz@nz80409-h2 ~]$ nzds -regenstatus Data Slice Status SPU Partition Size (GiB) % Used Supporting Disks Start Time % Done ---------- --------- ---- --------- ---------- ------ ---------------- ------------------- ------- 1 Repairing 1697 0 195 75.85 1053,1606 2016-05-17 11:30:49 0.00 2 Repairing 1697 1 195 75.70 1053,1606 2016-05-17 11:30:49 4.46`

Other issues

Of course this is not an exhaustive list but outlines the most common hardware issues that can cause performance problems. Overall the nz_check_disk_scan_speeds script is useful for identifying if anything in the path between disk and host is causing a performance drag. The disk and blades are the most likely components but not the only ones.

Tips

PureData System for Analytics (Netezza): Hardware performance diagnostics

Troubleshooting

Problem

Symptom

Environment

Resolving The Problem

Was this topic helpful?

Document Information

UID

Share your feedback

Need support?