Description of SmartCollect

Question & Answer

Question

What is SmartCollect?

Answer

1) What is SmartCollect?

SmartCollect is a script supplied by Netezza Support which checks for the following SPU attributes:

Topology issues
Drives exceeding S.M.A.R.T thresholds
Drives with potential disk timing issues
Recent hardware-specific nzevents

2) What does S.MA.R.T stand for?

S.M.A.R.T stands for Self-Monitoring Analysis & Reporting Technology. See also, What is S.M.A.R.T. as used in SmartCollect? The following wikipedia page describes some of the attributes that disk hardware manufacturers monitor:

http://en.wikipedia.org/wiki/S.M.A.R.T

3) What does SmartCollect output look like?

** Detail of all devices under SMART monitoring (verbose notify only)
** Loc | HwId | SPU Serial | Drive | Cur(Wrst)/Thres | Delta | Raw | STATUS


------+------+---------------+-----------------+-----------------+-------+-----+---------------------------------------

6-13 | 1069 | 802S53704041 | WD-WCANU2351463 | 197 / 140 | 57 | 22 | NOTICE: Retired Sector Count Quantity

4) From where can I get the latest SmartCollect script?

Download the latest Support tools (nz-support.tgz) from ntzftp.netezza.com b
Untar the file (tar –xvzf nz-support.tgz)
Execute the “unpack” script

5) How do I invoke SmartCollect:?

The script is typically invoked on a daily basis from crontab as user “nz” before or after an ETL via the script.

"0 1 * * * /nz/support/bin/SmartCollect”

6) How do I specify the e-mail distribution of the output?

You can specify it at the command line using the following optional arguments:

$SmartCollect [ –notify -cc ] [ --verboseNotify ]

--notify is used to generate summary reports that are intended for customers
--verboseNotify is used to generate a more detailed report that is usually meant for the TAMs or Support; however customers can subscribe to it as well.

7) Which nzevents are examined by SmartCollect?

SmartCollect does not look at all events; it looks at the most recent events involving the following types of SPU errors:

ECC Errors: Correctable Memory errors at the SPU level which are usually caused by memory going bad over time. For correctable errors, the error means that a bad bit was encountered in the memory chip and disabled. This is not a fatal error in and of itself, though multiple correctable errors may indicate that the memory is about to fail. If a device has Correctable ECCs in multiple recurring yet distinct 15 minute periods, it should be removed.
FPGA Errors: These errors generally indicate that the FPGA could not interpret a block on the disk and it is usually accompanied by a block number and SPU ID. The root cause could be a bad disk, corrupt table data, an FPGA data engine bug, or a program error.
UEC Errors: A variant of the FPGA error, UEC stands for Unexpected Error Condition. UEC errors are caused by exceptions conditions and are typically reported at a later point in time than they typically occur.

8) How do I utilize the reported SPU status readings?

Utilizing observed ‘normal’ values on ‘new’ drives, the manufacturer’s recommendations for end-of-life as well as established field test results, the current health of the drive is assessed in the following order of increasing severity:

Debug-> INFO-> Notice->

Monitoring->

Monitor Watch -> Monitor WARNING -> FAIL OVER and REPLACE SPU NOW!

Hence, drives which have moved from ‘new’ values toward end-of-life are flagged as being ‘monitored’. Devices approaching end-of-life will be recommended for replacement as soon as possible. You should replace a SPU when the status changes from “NOTICE” to “Monitor Watch or higher”.

[{"Product":{"code":"SSULQD","label":"IBM PureData System"},"Business Unit":{"code":"BU053","label":"Cloud & Data Platform"},"Component":null,"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"1.0.0","Edition":"","Line of Business":{"code":"LOB10","label":"Data and AI"}}]

Historical Number

NZ101587

Was this topic helpful?

Document Information

Modified date:
17 October 2019

UID

swg21572930

Tips