IBM Support

Description of SmartCollect

Question & Answer


Question

What is SmartCollect?

Answer

1) What is SmartCollect?

SmartCollect is a script supplied by Netezza Support which checks for the following SPU attributes:

  • Topology issues
  • Drives exceeding S.M.A.R.T thresholds
  • Drives with potential disk timing issues
  • Recent hardware-specific nzevents

2) What does S.MA.R.T stand for?

S.M.A.R.T stands for Self-Monitoring Analysis & Reporting Technology. See also, What is S.M.A.R.T. as used in SmartCollect? The following wikipedia page describes some of the attributes that disk hardware manufacturers monitor:

http://en.wikipedia.org/wiki/S.M.A.R.T

3) What does SmartCollect output look like?


** Detail of all devices under SMART monitoring (verbose notify only)

** Loc | HwId | SPU Serial | Drive | Cur(Wrst)/Thres | Delta | Raw | STATUS


------+------+---------------+-----------------+-----------------+-------+-----+---------------------------------------
6-13 | 1069 | 802S53704041 | WD-WCANU2351463 | 197 / 140 | 57 | 22 | NOTICE: Retired Sector Count Quantity


4) From where can I get the latest SmartCollect script?
  • Download the latest Support tools (nz-support.tgz) from ntzftp.netezza.com b
  • Untar the file (tar –xvzf nz-support.tgz)
  • Execute the “unpack” script

5) How do I invoke SmartCollect:?

The script is typically invoked on a daily basis from crontab as user “nz” before or after an ETL via the script.  

"0 1 * * *           /nz/support/bin/SmartCollect”

6) How do I specify the e-mail distribution of the output?

You can specify it at the command line using the following optional arguments:

           $SmartCollect  [ –notify -cc ] [ --verboseNotify ]

  • --notify is used to generate summary reports that are intended for customers
  • --verboseNotify is used to generate a more detailed report that is usually meant for the TAMs or Support; however customers can subscribe to it as well.

7) Which nzevents are examined by SmartCollect?

SmartCollect does not look at all events; it looks at the most recent events involving the following types of SPU errors:

  • ECC Errors: Correctable Memory errors at the SPU level which are usually caused by memory going bad over time. For correctable errors, the error means that a bad bit was encountered in the memory chip and disabled. This is not a fatal error in and of itself, though multiple correctable errors may indicate that the memory is about to fail. If a device has Correctable ECCs in multiple recurring yet distinct 15 minute periods, it should be removed.
  • FPGA Errors: These errors generally indicate that the FPGA could not interpret a block on the disk and it is usually accompanied by a block number and SPU ID. The root cause could be a bad disk, corrupt table data, an FPGA data engine bug, or a program error.
  • UEC Errors: A variant of the FPGA error, UEC stands for Unexpected Error Condition. UEC errors are caused by exceptions conditions and are typically reported at a later point in time than they typically occur.

8) How do I utilize the reported SPU status readings?

Utilizing observed ‘normal’ values on ‘new’ drives, the manufacturer’s recommendations for end-of-life as well as established field test results, the current health of the drive is assessed in the following order of increasing severity:

Debug-> INFO-> Notice->

                               Monitoring->

                                          Monitor Watch -> Monitor WARNING -> FAIL OVER and REPLACE SPU NOW!

Hence, drives which have moved from ‘new’ values toward end-of-life are flagged as being ‘monitored’. Devices approaching end-of-life will be recommended for replacement as soon as possible. You should replace a SPU when the status changes from “NOTICE” to “Monitor Watch or higher”.

[{"Product":{"code":"SSULQD","label":"IBM PureData System"},"Business Unit":{"code":"BU053","label":"Cloud & Data Platform"},"Component":null,"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"1.0.0","Edition":"","Line of Business":{"code":"LOB10","label":"Data and AI"}}]

Historical Number

NZ101587

Document Information

Modified date:
17 October 2019

UID

swg21572930