Question & Answer
Question
What is SmartCollect?
Answer
1) What is SmartCollect?
SmartCollect is a script supplied by Netezza Support which checks for the following SPU attributes:
- Topology issues
- Drives exceeding S.M.A.R.T thresholds
- Drives with potential disk timing issues
- Recent hardware-specific nzevents
2) What does S.MA.R.T stand for?
S.M.A.R.T stands for Self-Monitoring Analysis & Reporting Technology. See also, What is S.M.A.R.T. as used in SmartCollect? The following wikipedia page describes some of the attributes that disk hardware manufacturers monitor:
http://en.wikipedia.org/wiki/S.M.A.R.T
3) What does SmartCollect output look like?
** Detail of all devices under SMART monitoring (verbose notify only)
** Loc | HwId | SPU Serial | Drive | Cur(Wrst)/Thres | Delta | Raw | STATUS
------+------+---------------+-----------------+-----------------+-------+-----+---------------------------------------
6-13 | 1069 | 802S53704041 | WD-WCANU2351463 | 197 / 140 | 57 | 22 | NOTICE: Retired Sector Count Quantity
4) From where can I get the latest SmartCollect script?
- Download the latest Support tools (nz-support.tgz) from ntzftp.netezza.com b
- Untar the file (tar –xvzf nz-support.tgz)
- Execute the “unpack” script
5) How do I invoke SmartCollect:?
The script is typically invoked on a daily basis from crontab as user “nz” before or after an ETL via the script.
"0 1 * * * /nz/support/bin/SmartCollect”
6) How do I specify the e-mail distribution of the output?
You can specify it at the command line using the following optional arguments:
$SmartCollect [ –notify -cc ] [ --verboseNotify ]
- --notify is used to generate summary reports that are intended for customers
- --verboseNotify is used to generate a more detailed report that is usually meant for the TAMs or Support; however customers can subscribe to it as well.
7) Which nzevents are examined by SmartCollect?
SmartCollect does not look at all events; it looks at the most recent events involving the following types of SPU errors:
- ECC Errors: Correctable Memory errors at the SPU level which are usually caused by memory going bad over time. For correctable errors, the error means that a bad bit was encountered in the memory chip and disabled. This is not a fatal error in and of itself, though multiple correctable errors may indicate that the memory is about to fail. If a device has Correctable ECCs in multiple recurring yet distinct 15 minute periods, it should be removed.
- FPGA Errors: These errors generally indicate that the FPGA could not interpret a block on the disk and it is usually accompanied by a block number and SPU ID. The root cause could be a bad disk, corrupt table data, an FPGA data engine bug, or a program error.
- UEC Errors: A variant of the FPGA error, UEC stands for Unexpected Error Condition. UEC errors are caused by exceptions conditions and are typically reported at a later point in time than they typically occur.
8) How do I utilize the reported SPU status readings?
Utilizing observed ‘normal’ values on ‘new’ drives, the manufacturer’s recommendations for end-of-life as well as established field test results, the current health of the drive is assessed in the following order of increasing severity:
Debug-> INFO-> Notice->
Monitoring->
Monitor Watch -> Monitor WARNING -> FAIL OVER and REPLACE SPU NOW!
Hence, drives which have moved from ‘new’ values toward end-of-life are flagged as being ‘monitored’. Devices approaching end-of-life will be recommended for replacement as soon as possible. You should replace a SPU when the status changes from “NOTICE” to “Monitor Watch or higher”.
Historical Number
NZ101587
Was this topic helpful?
Document Information
Modified date:
17 October 2019
UID
swg21572930