IBM Support

Host Disk Alert in IBM Pure Data Systems for Analytics

Question & Answer


Question

I got this alert output in my IBM PureData Systems for Analytics, what should I do ? [nz@TEST-H1 ~]$ nzhw -issues Description HW ID Location Role State ----------- ----- --------------------- ------ ----------- HostDisk 1025 rack1.host1.hostDisk7 Failed Warning [nz@TEST-H1 ~]$

Cause

Two situations can be the cause of this alert here:

1 - The disk is good and not set as HOTSPARE as it should in the machine.
2 - The disk is really defective and demands a replacement.

Answer

You will have to go through Step 1 and 2 depending on the results to fix this situation

1 - How do I know that the disk is good and I can set it as HotSpare ?

as root, run the following command to know the type of host your appliance have :

[root@TEST-H1 hts]# dmidecode -t1
# dmidecode 2.11
SMBIOS 2.7 present.

Handle 0x0024, DMI type 1, 27 bytes
System Information
Manufacturer: IBM
Product Name: System x3650 M4 : -[7915AC1]-
Version: 0B
Serial Number: XXXXXXX
UUID: 27E6089C-D127-3323-ADB8-CF199BB0DCF5
Wake-up Type: Power Switch
SKU Number: Not Specified
Family: System X

** Most of the hosts use MegaCli to interact with the Disk Controller the Only Exeption would be MT 7979 Hosts that would need another program to interact with the controller **

Here is a comparative of the hosts that use Megacli

Host Disk Drive
7145, 7143 x3850-X5
7233 x3850-M2
7945, 7947 x3650-M3
7947 x3650-M2
7979 x3650-M1 ( this one use arcconf to interact with host disks )

** Now that you know what type of host you have you can then check for the case you have.

Run the following in your appliance

[root@TEST-H1 hts]# /opt/nz-hwsupport/hts/mega_check.pl -r

You should see a menu like this example :

MegaCli Checks - Version 2.3
Fri Nov 14 08:48:01 EST 2014
Fail/Rebuild process

Disks and State
disk in slot Slot Number: 0 Firmware state: Online, Spun Up
Drive is predicted to fail? NO

disk in slot Slot Number: 1 Firmware state: Online, Spun Up
Drive is predicted to fail? NO

disk in slot Slot Number: 2 Firmware state: Online, Spun Up
Drive is predicted to fail? NO

disk in slot Slot Number: 3 Firmware state: Online, Spun Up
Drive is predicted to fail? NO

disk in slot Slot Number: 4 Firmware state: Online, Spun Up
Drive is predicted to fail? NO

disk in slot Slot Number: 5 Firmware state: Online, Spun Up
Drive is predicted to fail? NO

disk in slot Slot Number: 6 Firmware state: Unconfigured(good), Spun Up
Drive is predicted to fail? NO

0: Exit
1: Manually Fail drive / prep for removal
2: Turn drive Unconfigured(good) to Hotspare
3: Turn drive from bad to good
4: Spin up drive (undo a prep for removal)
5: Manually start a Copyback process
6: Turn on/off LED of drive to locate
7: Monitor drive that is in the process of rebuild or copyback



** Verify that the Slot 6 ( position 7 ) is marked as Unconfigured(good), Spun Up and Drive is predicted to fail? NO
** The above is an example, you may have other disk/slots with different status

To make sure that the drive is really good, run the following :

[root@TEST-H1 hts]# /opt/MegaRAID/MegaCli/MegaCli64 -PDInfo -PhysDrv '[252:6]' -aALL

** 252:6 represent the disk in the example, your case may differ from this example
** The output you will see is something like this:

Enclosure Device ID: 252
Slot Number: 6
Enclosure position: 0
Device Id: 22
Sequence Number: 3
Media Error Count: 0
Other Error Count: 0
Predictive Failure Count: 0
Last Predictive Failure Event Seq Number: 0
PD Type: SAS
Raw Size: 279.396 GB [0x22ecb25c Sectors]
Non Coerced Size: 278.896 GB [0x22dcb25c Sectors]
Coerced Size: 278.464 GB [0x22cee000 Sectors]
Firmware state: Unconfigured(good), Spun Up
SAS Address(0): 0x5000cca016762bb9
SAS Address(1): 0x0
Connected Port Number: 6(path0)
Inquiry Data: IBM-ESXSHUC109030CSS60 J2E8KLJ2ZP5FJ2E8J2E8J2E8
IBM FRU/CRU: 90Y8878
FDE Capable: Not Capable
FDE Enable: Disable
Secured: Unsecured
Locked: Unlocked
Needs EKM Attention: No
Foreign State: None
Device Speed: 6.0Gb/s
Link Speed: 6.0Gb/s
Media Type: Hard Disk Device
Drive: Not Certified
Drive Temperature :27C (80.60 F)




Exit Code: 0x00
[root@TEST-H1 hts]#


The important information here is the following :

Media Error Count: 0
Other Error Count: 0
Predictive Failure Count: 0
Firmware state: Unconfigured(good), Spun Up

** If you see any errors in this important information you will have to go to step 2 **

Now that you see that the disk is in Perfect Conditions run the megacheck script again :

/opt/nz-hwsupport/hts/mega_check.pl -r

you will choose option 2 ( Turn drive Unconfigured(good) to Hotspare )

Follow the options as shown :

Would You like to try and clear and foreign states?[y/n]y
Plese Input slot number of Unconfigured(good) drive you want to turn into a hotspare:6 ( because usually slot6(position7) is the hotspare )

And the output of this case is :


Succesfully changed state to global spare
Fail/Rebuild process

Disks and State
disk in slot Slot Number: 0 Firmware state: Online, Spun Up
Drive is predicted to fail? NO

disk in slot Slot Number: 1 Firmware state: Online, Spun Up
Drive is predicted to fail? NO

disk in slot Slot Number: 2 Firmware state: Online, Spun Up
Drive is predicted to fail? NO

disk in slot Slot Number: 3 Firmware state: Online, Spun Up
Drive is predicted to fail? NO

disk in slot Slot Number: 4 Firmware state: Online, Spun Up
Drive is predicted to fail? NO

disk in slot Slot Number: 5 Firmware state: Online, Spun Up
Drive is predicted to fail? NO

disk in slot Slot Number: 6 Firmware state: Hotspare, Spun Up
Drive is predicted to fail? NO

0: Exit
1: Manually Fail drive / prep for removal
2: Turn drive Unconfigured(good) to Hotspare
3: Turn drive from bad to good
4: Spin up drive (undo a prep for removal)
5: Manually start a Copyback process
6: Turn on/off LED of drive to locate
7: Monitor drive that is in the process of rebuild or copyback


Choose option 0 ( Exit )

Wait 1 or 2 minutes, log as nzuser and check with command nzhw -issues. You should no longer see an alert for this host disk

[nz@TEST-H1 ~]$ nzhw -issues
No entries found
[nz@TEST-H1 ~]$


2 - I have gone through the step 1 and found the disk to have errors and it needs to be replaced, what do I need to do ?

first you need to run DSA as root ( the binary Version vary from N200X and N100X )

cd /opt/nz-hwsupport/install_files/IBM
cat /etc/redhat-release
** Depending on the result you will run the DSA accordingly for redhat versions 5 or 6

chmod +x ibm_utl_dsa_dsytb31-9.30_portable_rhelX_x86-64.bin ( Change the X for your redhat version accordingly )
[root@TEST-H1 IBM]# ./ibm_utl_dsa_dsytb31-9.30_portable_rhel6_x86-64.bin -diags -text

**There will be a big output on the screen of the DSA running and informing actions on the screen, the most important information it will show is the location where it will put the logs after it finish running :

.
.
.
Content above supressed

Running DSA analyzer plug-ins pass 2.
liblpanal: Light Path Analysis
Running Diagnostics.


Optical: Verify Media Installed: /dev/sr0

Percent Complete: 0%
Aborted

Optical: Read Error Test: /dev/sr0

Percent Complete: 0%
Aborted

Optical: Self Test: /dev/sr0

Percent Complete: 100%
Pass

Adding DSA log entries to XML file.
Writing XML data to file /var/log/IBM_Support/7915AC1_KQ3CN82_20141114-092953.xml.gz
Writing Text report file /var/log/IBM_Support/7915AC1_KQ3CN82_20141114-092953.txt

DSA capture completed successfully.

Please press ANY key to continue ...

[root@TEST-H1 IBM]#


** As teh example above show, you will have to get the file /var/log/IBM_Support/*.xml.gz and Open a PMR case with IBM
** you will also have to send the output from the commands nzstats and the confirmation of the address where the appliance is located to speed up the PMR process.




[{"Product":{"code":"SSULQD","label":"IBM PureData System"},"Business Unit":{"code":"BU053","label":"Cloud & Data Platform"},"Component":"Host","Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"1.0.0","Edition":"All Editions","Line of Business":{"code":"LOB10","label":"Data and AI"}}]

Document Information

Modified date:
17 October 2019

UID

swg21690261