IBM Support

How to properly prepare your system and replace a SAS or SSD disk unit with Linux on IBM Power Systems

How To


Summary

This document provides information for preparing IBM Power Systems running Linux for concurrent or none concurrent SAS or SSD disk replacements. It is important to first determine disk array configuration including protection level and then follow appropriate procedure.

Objective

Determine the failing disk unit, what level of protection it has, and successfully replace the disk unit.

Environment

Linux on Power, does not apply to Power LC systems. 

Steps

Step 1: Log analysis to determine disk errors

Example 01 - Console log and messages file is reporting a problem with a disk unit. On the console the following message was displayed when the disk failed or went missing.

OPT Name   PCI/SCSI Location          Description                  Status
--- ------ -------------------------  ---------------------------- -------------
 ipr 0000:00:01.0: 9030: Array no longer protected due to missing or failed disk unit
ipr: ----------------------------------------------------------       Active
ipr: RAID 10 Array Configuration: 0:255:0:0D 10 Array Member       Active
ipr: ----------------------------------------------------------    Non-Optimized
ipr: Exposed Array Member 0:5:0         RAID 10 Array Member       Remote
ipr: Vendor/Product ID: IBM     MBE2073RC       Array Member       Remote
ipr:     Serial Number: D3A04P4V
ipr:     WWN: 500000E1173E8AA0
ipr: Current Location: 0:0:5:0
ipr: Expected Location: 
ipr: ----------------------------------------------------------
ipr: Array Member 1:
ipr: Vendor/Product ID: IBM     MBE2073RC
ipr:     Serial Number: D3A04LW1
ipr:     WWN: 500000E116E13430
ipr: Current Location: 0:0:4:0  t=Toggle
ipr: Expected Location: 0:0:4:0
ipr: ----------------------------------------------------------
ipr: 0:0:5:0: 3002: Addressed device failed to respond to selection
ipr: 00000000: 04050000 00000500 00955180 11050E3C
ipr: 00000010: 70000200 00000028 00000000 04010000

image-20240503130433-1

Example 1.1 Check the /var/log/messages file for any array type errors by running "cat /var/log/messages | grep -i 'Array".

[root@localhost ~]# cat /var/log/messages | grep -i 'Array'
May  3 11:13:30 localhost kernel: ipr 0000:00:01.0: 9030: Array no longer protected due to missing or failed disk unit
May  3 11:13:30 localhost kernel: ipr: RAID 10 Array Configuration: 0:255:0:0
May  3 11:13:30 localhost kernel: ipr: Exposed Array Member 0:
May  3 11:13:30 localhost kernel: ipr: Array Member 1:
[root@localhost ~]#

Example 1.2 Use the less command to find this details in this section by using the date from the previous command: 

[root@localhost ~]# less /var/log/messages

Example 1.3 Press "/" then enter in the time stamp "11:13:30" to display this section of the messages file.

image-20240503134134-4

Example 1.4 From the output below you can see the error information. The array in no longer protected due to a missing or failed disk unit. The disk unit details are provided and it states that this is a RAID 10 Array Configuration.

image-20240503134751-5

Example 1.5 To proceed with the concurrent disk replacement go to "Procedure 4: Replacing a RAID10 disk unit concurrently with no hot spare using iprconfig"

Step 2: Check if there is system service tools for iprconfig

Type "iprconfig" at the command line, this should bring up the "IBM Power Raid Configuration Utility". If you do not get this menu then you will need to install the "IBM Power Systems service and productivity tools".

https://www.ibm.com/support/pages/service-and-productivity-tools

Step 3: Determine your level of protection for concurrent or nonconcurrent procedures

3.1 For no protection and nonconcurrent procedures your system will be down when a disk unit fails, refer to Procedures 1 through 3.

3.2 Display the status of your array and hot spare, from command line enter "iprconfig", select 2 for "Work with Disk Units".

image-20240807130025-2

3.3 Select 1 to display the hot spare and disk array status.

image-20240807130322-3

3.4 The following screen show shows a healthy array with an active hot spare. It shows 2 although this is because there are redundant SAS adapters, sda and sdb.

image-20240806153533-2

3.5 Example of a failed RAID10 disk with no defined "Hot Spare". Proceed to Procedure 4 for disk replacement procedures.
image-20240807160422-1
3.6 Example of a failed RAID5 or RAID6 disk with no defined "Hot Spare". Notice the primary controller sda shows the failed disk unit although each controller shows the array as being degraded. Make note of the "PCI/SCSI Location" for the failed disk, which may be needed later during the disk replacement.  Proceed to Procedure 5 for disk replacement procedures.
image-20240826140058-1
3.7 Example of a failed RAID10 disk with an active "Hot Spare". The failed disk may not show in the "Display Disk Array Status" menu since no hot spare will be available, press "q" to go back to the main menu and select 1 to "Display Hardware Status". Proceed to Procedure 6 for disk replacement procedures.
image-20240806162406-7
3.8 Example of a failed RAID5 or RAID6 disk with an active "Hot Spare". Proceed to Procedure 7 for disk replacement procedures. The first screen shot is taken from the main menu and selecting option 1 for " Display hardware status". The second screen shot is from the main menu, select option 2 for "Work with disk arrays" and then option 1 for "Display disk array status".
image-20240910152017-2
image-20240910151837-1
3.9 Example of a failed "Hot Spare" disk unit. Proceed to Procedure 8 for disk replacement procedures.

image-20240807131238-5

image-20240807130607-4

Nonconcurrent Procedures for "Unprotected" arrays or JBOD disks

Procedure 1: Replacing a failed JBOD disk with the Operating System loaded on it (not sure what it is called in Linux)

1.1 This is an example of a JBOD (Just a Bunch of Disks) hdisk. From command line type in "iprconfig" then select 1 for "Display hardware status", The screen shot shows one "Physical Disk" on redundant SAS controllers, sda and sdb.

image-20240828181124-1

1.2 This is an example of a JBOD (Just a Bunch of Disks) hdisk when booting from standalone diagnostics. For info on booting standalone diagnostics see Appendix A1 then return here.

image-20240828162806-1

1.3 This is an example of a  failed hdisk when booting from standalone diagnostics. The disk may show up normally or not at all, even though it is defective and you are unable to boot your OS. For info on booting standalone diagnostics see Appendix A1 then return here. The only way to ensure the disk is defective and needs to be replaced is to run Advance Diagnostics and Certify on the disk unit, refer to Appendix A3 for this procedure.

image-20240828165151-2

1.4 If you have a failed JBOD disk unit with the operating system on it, your only option is to boot from standalone diagnostics. You can display what your current configuration is and then replicate it once you install the new disk unit.

1.5 Power off the LPAR or standalone system and physically replace the disk unit. Use Appendix A1 to boot standalone diagnostics and access the SAS RAID Manager menu then return here.

1.6 Select "List SAS Disk Array Configuration".

image-20240923151702-1

1.7 New disk units come from IBM formatted as a pdisk array candidate. This screen shot shows my new pdisk which is available to format to a JBOD hdisk. If for some reason your disk comes as an hdisk then you can skip to step 1.13.

image-20240828150925-5

1.8 Press F3 to get to the "IBM SAS Disk Array Manager" and then select "Change/Show SAS pdisk status".

image-20240828172310-3

1.9 Move the cursor to "Delete an Array Candidate pdisk and Format to JBOD block size" then press enter.

image-20240828172437-4

1.10 Move you cursor to your new pdisk and press enter.

image-20240828172604-5

1.11 You will receive a warning, press enter to continue.

image-20240828172716-6

1.12 The "Format in Progress" screen is displayed and will progress until the format is complete. Once complete press enter to continue.

image-20240828172836-7

image-20240828173908-1

1.13 If you want to see your new JBOD hdisk, back out one menu by pressing F3 and then select "List SAS Disk Array Configuration" and then select your primary SAS controller.

image-20240828174027-2

1.14 You can now shutdown your LPAR to re-install your operating system, scratch install, or from a backup. This procedure is now complete.

Procedure 2: Replacing a failed RAID0 array disk unit with a single or multiple disks with the Operating System loaded on it.

2.1 This is an example of an "Optimized" RAID0 array with a single disk with linux operating. From command line type in "iprconfig" then select 2 for "Work with disk arrays", then 1 for "Display disk array status". The screen shot shows one RAID0 array on redundant SAS controllers, sda and sdb.

image-20240828135234-3

2.2 This is an example of an "Optimal" RAID0 array with a single disk when booting from standalone diagnostics. For info on booting standalone diagnostics see Appendix A1 then return here.

image-20240828124900-1

2.3 This is an example of a RAID0 array with a failed disk when booting from standalone diagnostics. For info on booting standalone diagnostics see Appendix A1 then return here. If you want to test this disk unit go to Appendix A4 for the procedures then return here.

image-20240828125144-2

2.4 If you have a failed disk unit and no RAID protection,  with the operating system on it, your only option is to boot from standalone diagnostics. You can display what your current configuration is and then replicate it once you install the new disk unit.

2.5 Power off the LPAR or standalone system and physically replace the disk unit. Use Appendix A1 to boot standalone diagnostics and access the SAS RAID Manager menu then return here.

2.6 Select "List SAS Disk Array Configuration".

image-20240923151750-2

2.7 This screen shot shows my new pdisk which is available to create a new RAID0 array. If your display shows the previous RAID0 array that is in a broken state you need to back up a menu and select the option to delete the array.

image-20240828150925-5

2.8 Press F3 to get to the "IBM SAS Disk Array Manager" and then select "Create a SAS Disk Array".

image-20240828151301-6

2.9 Select your primary SAS controller then select your RAID level of "0" and press enter.

image-20240828151429-7

2.10 Select your stripe size, the "recommended" value is usually best, press enter to continue.

image-20240828151602-9

2.11 Move the cursor down to your new pdisk and press enter.

image-20240828151805-11

2.12 A summary of your selection is displayed, press enter to continue and your array will be created.

image-20240828151911-12

2.13 If you want to see your new array back out one menu by pressing F3 and then select "List SAS Disk Array Configuration" and you primary SAS controller.

image-20240828152144-13

2.14 You can now shutdown your LPAR to re-install your operating system, scratch install, or from a backup. This procedure is now complete.

Procedure 3: Replacing a disk unit in a "Protected" array using standalone diagnostics

3.1 Boot you system to Standalone Diagnostics. If you need assistance with booting to Standalone Diagnostics then go to Appendix A and follow that procedure then return here. Select the "Task Selection" menu by selecting 3 and pressing enter.

image-20240917110056-1

3.2 You will be prompted to set your terminal type, vt320 emulation is a good choice since the Function keys work.

image-20240917110153-2

3.3 Cursor to the bottom and select "Raid Array Manager"

image-20240430102609-6

3.4 Select "IBM SAS DIsk Array Manager".

image-20240430102719-7

3.5 Cursor to "List SAS Disk Array Configuration" and press enter, then select the "Primary" SAS adapter if you have redundant adapters.

image-20240430103242-8

3.6 This is a typical example of what a healthy RAID5 array looks like when it is set up for a linux image.

image-20240910161351-3

3.7 This is an example of a degraded RAID5 array with one failed disk unit.

image-20240910164146-4

3.8 Press F3 twice to back out to the "Task Selection List". Cursor up to "Hot Plug Task" and press enter.

image-20240910164443-5

3.9 Cursor to "SCSI and SCSI RAID Hot Plug Manager" and press enter.

image-20240910164610-6

3.10 Cursor to "Replace/Remove a Device Attached to an SCSI Hot Swap Enclosure Device" and press enter.

image-20240910164717-7

3.11 Cursor to the disk you want to replace and press enter. Note: the disk can display as populated or it could also show the pdisk# that failed. From step 3.7 it was pdisk2 that was failed, you could see that listed here as pdisk2 instead of populated. We know that pdisk0 and pdisk1 are our active disks in this "degraded" array.

image-20240910164855-8

3.12 The LED will be in the remove state, solid amber. Physically remove the disk and put the new one in then press enter to continue.

image-20240910165242-9

3.13 The array will not start to auto rebuild. At this point you can power down and start the Operating System and start the rebuild from the "iprconfig" utility or you can start the rebuild here. To start the rebuild here press F3 twice to get to the main "Tasks Selection List" and cursor down to "RAID Array Manager" and press enter.

image-20240910165807-10

3.14 Select " IBM SAS Disk Array Manager"

image-20240910165923-11

3.15 Cursor down to "Reconstruct a SAS Disk Array" and press enter.

image-20240910170042-12

3.16 The next step will depend on one of the 3 situations since standalone diagnostics may not configure the new disk.

  • If you receive an error like the screen shot below you can elect to power down, boot the operating system and perform the reconstruct from “iprconfig”. If you choose this option skip to step 3.22
  • If you receive an error message like the screen shot below, you can shutdown and boot standalone diagnostics again, repeat steps 3.1 to 3.4 then return here to step 3.17
  • If you do not receive an error message and the screen is displayed with the newly replaced pdisk then skip to step 3.18

image-20240910170431-13

3.17 After the reboot I am now back at the " IBM SAS Disk Array Manager" main menu, select "Reconstruct a SAS Disk Array".

image-20240910171517-14

3.18 Select the new pdisk and press enter.

image-20240910171901-15

3.19 A summary screen will be displayed, press enter to continue the reconstruct. The reconstruct will start and bring you back to the main menu.

image-20240910172108-16

3.20 At this point the procedure is complete. You can wait for the rebuild to complete before powering down or you can power down and start the Operating System. To best extra safe it is preferable to wait for the rebuild to complete although it is not necessary. The rebuild will continue when the Operating System is started. If you choose to wait for the rebuild to complete, then you can display the progress by selecting "List SAS Disk Array Configuration" on the main menu.

image-20240910172409-17

3.21 For example, I rebooted to the Operating System to show how the rebuild continues. 

image-20240910173046-18

3.22 To start the reconstruct from the Operating System after a reboot, type "iprconfig" at the command line. Select option 3 for "Work with disk unit recovery".

image-20240910174130-19

3.23 Select 5 to "Rebuild disk unit data".
image-20240826145227-15
3.24 Select 1 beside the failed disk location and press enter to start the rebuild process. Note, new disks come formatted for array candidates, if your disk is not an array candidate or it is RWProtected and it requires initializing then go to Appendix A2 then return here.
image-20240826145414-16
 
3.25 Press enter to confirm your selection.
image-20240826145510-17
3.26 The screen returns to the "Work with Disk Unit Recovery" menu.
image-20240826145624-18
3.27 To display the rebuild status press q to get to the main menu "IBM Power RAID Configuration Utility" or from command line type in "iprconfig". Select 2 for "Work with disk arrays" and then 1 for "Display disk array status". You can see the percentage of the rebuild for the primary controller sda. The secondary controller sdb only shows the "Rebuilding" status.
image-20240826150313-19
3.28 You can press "r" to refresh the screen to see the percentage of rebuild increase.
image-20240826150632-20
3.29 Once the rebuild completes, the array will be in an "Optimized" state for the primary controller. If "HA asymmetric access is enabled" then both controllers would show as "Optimized". In this mode, all read/write operations go through the primary adapter sda. The procedure is now complete.

image-20240826155550-2

Concurrent Procedures for "Protected" arrays

Procedure 4: Replacing a RAID10 disk unit concurrently with no hot spare using iprconfig

4.1 This is a normally a concurrent procedure therefore there is no need to shut down the operating system. From command line type in "iprconfig" to access "IBM Power RAID Configuration Utility".

image-20240503120657-13

4.2 Select Option 2 to Work with Disk arrays.

image-20240503120955-14

4.3 Select Option 1 to Display Disk Array Status.

image-20240503121051-15

4.4 This first example is what you would see if there was NO disk issue and your array was optimal.

image-20240503121225-16

4.5 This second example is when it has been determined that you have a failed or missing disk that needs to be replaced.

Disk "0000:00:01.0/0:0:5:0         RAID 10 Array Member       Failed" needs replacing.

 Put a 1 next to the "Failed" disk and press enter to display the disk location information.

image-20240503152330-1image-20240503153446-4

4.6 This will give you the disk serial number for confirmation and the physical disk location that you will need to record for the removal procedure. In this example it is "U78AB.001.WZSGCAZ-P3-D2".

image-20240503153723-5

4.7 Press 'q' three times to  back out to the main menu "IBM Power RAID Configuration Utility" and select 3 for "Work with disk unit recovery".

image-20240503152735-2

4.8 Select option 2 for "Concurrent remove device".

image-20240503153010-3

4.9 Put a 1 next to the disk location you recorded earlier and press enter. Note: This is an example and a test environment therefore in order to create the disk error I had to pull the disk, therefore it says the slot is empty.

image-20240503154442-6

4.10 Verify the information is correct and press "enter" to continue.

image-20240503154552-7

4.11 Physically pull the disk unit out of the system then press enter.

image-20240503155102-8

4.12 This will return you to the "Work with Disk Unit Recovery" menu. Select option 1 to "Concurrent add device".

image-20240503155330-9

4.13 Put a 1 by the same disk location recorded earlier that you will be inserting the new disk in.

image-20240503155629-10

4.14 Verify the information is correct then press enter to continue.

image-20240503155741-11

4.15 Insert the new disk unit the press enter to continue.

image-20240503155859-12

4.16 Select 5 to "Rebuild disk unit data".

image-20240503160613-13

4.17 Enter 1 to select the Failed RAID 10 array member to be rebuilt.

image-20240503160804-14

4.18 Confirm your selection and press enter to continue.

image-20240503160927-15

4.19 This will bring you to the "Work with Disk Unit Recovery" menu. Press "q" once to bring you to the "IBM Power RAID Configuration Utility" then press "1" to Display Hardware Status". You can see that the array is starting to rebuild at 4%. If you want to monitor the rebuild progress you can press the "r" key to refresh.

image-20240503161441-16

4.20 Note: The alternate way of showing the rebuild status is to select 2 "Work with disk arrays" from the main "IBM Power RAID Configuration Utility" iprconfig main menu then select 1 to "Display disk array status".

image-20240503162058-17

4.22 Once the rebuild is complete the status will show "Optimized". This procedure is now complete, your disk has been successfully replaced concurrently.

image-20240503171845-18

Procedure 5 Replacing a RAID5 or RAID6 disk unit concurrently with no hot spare using iprconfig

5.1. The following screenshot shows what a protected RAID5/6 array with NO hot spare and a failed disk unit looks like. This can be accessed by typing "iprconfig" at the command line and selecting option 2 "Work with disk arrays" and then 1 to "Display disk array status".

image-20240826140058-1

5.2 Press "q" twice to get back to the main menu or if you are at command line type in "iprconfig" and press enter. Select 3 for "Work With disk unit recovery".

image-20240826141509-2

5.3 Select Option 2 for a "Concurrent remove device"

image-20240826141721-3

5.4 This shows the "Failed" disk and the location should match what you recorded earlier. You select "1" beside the disk to be replaced and press enter. Note that you can use the "t" key to display different location formats. The second screen shot shows the physical disk location.

image-20240826142542-4

image-20240826142824-5

5.4 Press 1 and enter to start the disk removal procedure.

image-20240826143218-6

5.5 Verify the disk and press enter to continue.

image-20240826143345-7

5.6 Remove the select disk unit from the system. There should be an amber LED on for the disk to be removed.

image-20240826143508-8

5.7 Once the disk has been removed, press enter and the menu will go back to the "Work with Disk Unit Recovery" menu.

image-20240826143640-9

5.8 Select "Option 1" by pressing 1 and enter to access the "Concurrent add device" task.

image-20240826144011-10

5.9 Use the "t" to toggle and ensure that you select the correct location to add the new disk unit. See both examples of this screen shot below. Select the slot to add the disk to by putting a "1" beside the slot and pressing enter.

image-20240826144319-11

image-20240826144429-12

5.10 Verify the concurrent add and press enter to continue.

image-20240826144539-13

5.11 Insert the new disk and press enter to continue.

image-20240826144733-14

5.12 This will return you to the "Work with Disk Unit Recovery" menu.

image-20240826143640-9

5.13 Select 5 to "Rebuild disk unit data".
image-20240826145227-15
5.14 Select 1 beside the failed disk location and press enter to start the rebuild process. Note, new disks come formatted for array candidates, if your disk is not an array candidate or it is RWProtected and it requires initializing then go to Appendix A2 then return here.
image-20240826145414-16
5.15 Press enter to confirm your selection.
image-20240826145510-17
5.16 The screen returns to the "Work with Disk Unit Recovery" menu.
image-20240826145624-18
5.17 To display the rebuild status press q to get to the main menu "IBM Power RAID Configuration Utility" or from command line type in "iprconfig". Select 2 for "Work with disk arrays" and then 1 for "Display disk array status". You can see the percentage of the rebuild for the primary controller sda. The secondary controller sdb only shows the "Rebuilding" status.
image-20240826150313-19
5.18 You can press "r" to refresh the screen to see the percentage of the rebuild increase.
image-20240826150632-20
5.19 Once the rebuild completes, the array will be in an "Optimized" state for the primary controller. If "HA asymmetric access is enabled" then both controllers would show as "Optimized". In this mode, all read/write operations go through the primary adapter sda. The procedure is now complete.

image-20240826155550-2

Procedure 6: Replacing a RAID10 protected array with an active hot spare using iprconfig

6.1 The following two screenshots show what a protected array with a hot spare looks like. They can be accessed by typing "iprconfig" at the command line and selecting option 1 "Display hardware status". The second method is selecting 2 at the main menu, "Work with disk arrays" and then 1 to "Display disk array status".

image-20240806153431-1

image-20240806153533-2

6.2 When a disk in the array fails the hot spare will join the array and start the rebuild process. If you are quick enough after the disk fails you will be able to see the progress. Type "iprconfig" from the command line and select option 2 "Work with disk arrays" then option 1 "Display disk array status". The example below shows the status when the hot spare joins and the array starts to rebuild.

image-20240806154321-3

6.3 Press "r" to refresh the screen and show the progress of the rebuild status.

image-20240806154508-4

The failed disk is shown from the "Display Hardware Status" screen.

6.4 Once the rebuild is complete the array will be optimized from the "Display Disk Array Status" screen.

image-20240806162128-6

From "Display Hardware Status" you see the failed disk and the Optimized array.

image-20240806162406-7 

6.5 To replace the disk type in "iprconfig" from the command line and then select option 3 for "Work with disk unit recovery".

image-20240806162724-8

6.6 Select 2 for "Concurrent remove device".

image-20240806162957-9

6.7 Enter "1" in the row displaying the failed disk unit and press enter. Record the "PCI/SCSI Location" for the failed disk, this will be used later when installing the new disk unit.
image-20240806163417-10
6.8 This displays the "Verify Device Concurrent Remove" screen and your disk unit will have the amber "Identify LED" in a flashing state so that you can locate the disk unit. In the example below it will show a status of "Failed" instead of "R/W Protected" since this is a test system and the disk in question has not actually failed.
image-20240806163937-11
6.9 Press enter to continue and the disk will be in the "remove" state.
image-20240806164141-12
6.10 Physically remove the disk unit and then press enter. To install the new disk go to the "Concurrent add device" menu, from the iprconfig main menu select 3 for "Work with disk unit recovery" and then select 1 for "Concurrent add device"
image-20240806164901-13
6.11 Enter 1 to select the location of the disk slot which will match the location of the failed disk previously recorded. Press enter and the "Verify Device Concurrent Add" menu is displayed, press enter again
image-20240806165251-14
6.12 Insert the new disk unit and the press enter. When complete it will take you back to the "Work with disk unit recovery" menu.
image-20240806170543-15
6.13 Press "q" to back out to the main menu and then select 2 to "Work with disk arrays" then select 7 to "Work with hot spares".
image-20240806172002-16
6.14 Select 1 and press enter to "Create a hot spare".
image-20240806172227-17
6.15 Press 1 and enter to select your primary SAS adapter.
image-20240806172419-18
6.16 Confirm the location of the disk your previously recorded then press 1 and enter to create the Hot Spare disk.
image-20240806172635-19
6.17 The "Confirm Create Hot Spare" screen is displayed which is your final warning. Press enter to confirm your selection and the Hot Spare disk will be created. This ends the procedure.
image-20240806172821-20

Procedure 7: Replacing a RAID5 or RAID6 protected array with an active hot spare using iprconfig

7.1 Display the status of your array and hot spare, from command line enter "iprconfig" then select 2 for "Work with Disk Units".

image-20240807130025-2

7.2 Select 1 to "Display disk array status".

image-20240807130322-3

7.3 The following screen show shows a healthy array with an active hot spare. It shows sda and sdb since there are redundant SAS controllers.

image-20240826162904-1

7.4 When a disk in the array fails it will look like the following screen shot when you catch it before it rebuilds. You are able to see the progress in the automatic rebuild process. Record the "PCI/SCSI Location" of the failed disk, this will be used later when replacing the disk unit. 

image-20240826163526-4

7.5 You can also display the failed hot spare from the 'iprconfig' main menu by selecting 1 to "Display hardware status". Make note of the "PCI/SCSI Location", it will be used later when replacing the disk unit.

image-20240826163740-5

Once the array completes the rebuild from the "Hot Spare" it will look like the following screen shot.

image-20240826172336-6

7.6 To replace the disk type in "iprconfig" from the command line and then select option 3 for "Work with disk unit recovery".

image-20240806162724-8

7.7 Select 2 for "Concurrent remove device".

image-20240806162957-9

7.8 Enter "1" in the row displaying the failed disk unit and press enter. Record the "PCI/SCSI Location" for the failed disk, this will be used later when installing the new disk unit.
image-20240806163417-10
7.9 This displays the "Verify Device Concurrent Remove" screen and your disk unit will have the amber "Identify LED" in a flashing state so that you can locate the disk unit. In the example below it will show a status of "Failed" instead of "R/W Protected" since this is a test system and the disk in question has not actually failed.
image-20240806163937-11
7.10 Press enter to continue and the disk will be in the "remove" state.
image-20240806164141-12
7.11 Physically remove the disk unit and then press enter. To install the new disk go to the "Concurrent add device" menu, from the iprconfig main menu select 3 for "Work with disk unit recovery" and then select 1 for "Concurrent add device"
image-20240806164901-13
7.12 Enter 1 to select the location of the disk slot which will match the location of the failed disk previously recorded. Press enter and the "Verify Device Concurrent Add" menu is displayed, press enter again
image-20240806165251-14
7.13 Insert the new disk unit and the press enter. When complete it will take you back to the "Work with disk unit recovery" menu.
image-20240806170543-15
7.14 Press "q" to back out to the main menu and then select 2 to "Work with disk arrays" then select 7 to "Work with hot spares".
image-20240806172002-16
7.15 Select 1 and press enter to "Create a hot spare".
image-20240806172227-17
7.16 Press 1 and enter to select your primary SAS adapter.
image-20240806172419-18
7.17 Confirm the location of the disk your previously recorded then press 1 and enter to create the Hot Spare disk.
image-20240806172635-19
7.18 The "Confirm Create Hot Spare" screen is displayed which is your final warning. Press enter to confirm your selection and the Hot Spare disk will be created. This ends the procedure.
image-20240806172821-20

Procedure 8: Replacing a failed hot spare using iprconfig

8.1 Display the status of your array and hot spare, from command line enter "iprconfig", select 2 for "Work with disk arrays".

image-20240807130025-2

8.2 Select 1 to display the hot spare and disk array status.

image-20240807130322-3

8.3 The following screen show shows a healthy array with an active hot spare. It shows 2 although this is because there are redundant SAS adapters, sda and sdb.

image-20240807130607-4

8.4 When the hot spare has fails it will look like the following screen shot. Record the "PCI/SCSI Location", this will be used later when replacing the disk unit. Note, the arrays are unaffected and stay optimized.

image-20240807131238-5

8.5 You can also display the failed hot spare from the 'iprconfig' main menu by selecting 1 to "Display hardware status". Make note of the "PCI/SCSI Location", it will be used later when replacing the disk unit.

image-20240807131651-6

8.6 To replace the disk, type in "iprconfig" from the command line and then select option 3 for "Work with disk unit recovery".

image-20240806162724-8

8.7 Select 2 for "Concurrent remove device".

image-20240806162957-9

8.8 Enter "1" in the row displaying the failed disk unit and press enter. Record the "PCI/SCSI Location" for the failed disk, this will be used later when installing the new disk unit.
image-20240806163417-10
8.9 This displays the "Verify Device Concurrent Remove" screen and your disk unit will have the amber "Identify LED" in a flashing state so that you can locate the disk unit. In the example below it will show a status of "Failed" instead of "R/W Protected" since this is a test system and the disk in question has not actually failed.
image-20240806163937-11
8.10 Press enter to continue and the disk will be in the "remove" state.
image-20240806164141-12
8.11 Physically remove the disk unit and then press enter. To install the new disk go to the "Concurrent add device" menu. From command line type "iprconfig" then select 3 for "Work with disk unit recovery" and then select 1 for "Concurrent add device"
image-20240806164901-13
8.12 Enter 1 to select the location of the disk slot which will match the location of the failed disk previously recorded. Press enter and the "Verify Device Concurrent Add" menu is displayed, press enter again
image-20240806165251-14
8.13 Insert the new disk unit and the press enter. When complete it will take you back to the "Work with disk unit recovery" menu.
image-20240806170543-15
8.14 Press "q" to back out to the main menu and then select 2 to "Work with disk arrays" then select 7 to "Work with hot spares".
image-20240806172002-16
8.15 Select 1 and press enter to "Create a hot spare".
image-20240806172227-17
8.16 Press 1 and enter to select your primary SAS adapter.
image-20240806172419-18
8.17 Confirm the location of the disk your previously recorded then press 1 and enter to create the Hot Spare disk.
image-20240806172635-19
8.18 The "Confirm Create Hot Spare" screen is displayed which is your final warning. Press enter to confirm your selection and the Hot Spare disk will be created. This ends the procedure.
image-20240806172821-20
 

Appendix

Appendix A1:  Booting IBM Standalone Diagnostics from DVD or USB

Note: The following procedures are from an HMC managed system using a Linux full system LPAR. The procedures for a stand-alone system (none HMC managed) are the same except you will use the console to press the "1" key when POST is presented. This will gain you access to SMS. Diagnostic images can be downloaded from here and can be burned to DVD or USB media.

A1.1 The target LPAR should not be activated, deactivate it if it is. Prior to activating the LPAR, put your standalone diagnostic media in the system. Put a check mark in the Profile.

image-20240430093818-1

A1.2 Select the Actions tab and then "Activate".

image-20240430094003-2

A1.3 Expand Advanced Settings and change the Boot Mode to "System Management Services".

image-20240430094130-3

A1.4 You can put a check mark in the open VTERM check box if you want to open a console terminal window here. The method I prefer is to ssh into the HMC and use "vtmenu".

image-20240430094535-4

A1.5 Click on Finish and the LPAR will activate, once this process completes you can close the window and go to your console window.

image-20240430094739-5

A1.6 If you chose to use vtmenu from the HMC command line, ssh into the HMC as hscroot and type "vtmenu".

image-20240430094954-6

A1.7 Select the LPAR

image-20240430095111-7

A1.8 You will see the post menu. If you are on a standalone system this is where you would press "1" to force it to go to SMS. In this example we set the LPAR to activate to SMS.

image-20240430101139-2

A1.9 The console will come to the SMS main menu, select 5 for "Select Boot Options"

image-20240430095251-8

A1.10 Select 1 for "Select Install/Boot Device"

image-20240430095418-9

A1.11 Select 7 for "List all Devices".

image-20240430095535-10

A1.12 This will display a list of devices you can boot from. Often the network or fiber devices should up first so you have to press the "N" key to go to the next page. You will see the DVD or USB listed depending on what media you have loaded. Select the number of the diagnostic media and press enter, this process will start the IPL s.

image-20240430101331-3

A1.13 Select 2 for a Normal Mode Boot

image-20240923132554-2

A1.14 Select 1 to confirm it is okay to exit SMS

image-20240923132650-3

A1.15 Press 1 and enter to make this the console.

image-20240923133124-4

A1.16  Press Enter to continue.

image-20240923133812-5

A1.17 - This will bring you to the Diagnostics Main Menu. See the "Note" at the bottom, once you make a selection you will have to enter the terminal type, vt320 is a good option since the function keys work with this emulation.

image-20240923133846-6

A1.18 This ends a Standalone Diagnostics IPL, return to the step that brought you here.

Appendix A2: How to format a disk unit if it is R/W Protected

A2.1 Type in iprconfig to get to the "IBM Power RAID Configuration Utility" then select 1 to "Display hardware status".

image-20240923135335-7

image-20240923135450-8

A2.2 Normally a new disk from IBM stock does not come this way, when a disk is moved from another system or was previously part of a RAID array it will show up like this and will need to be formatted.

image-20240923135721-9

A2.3 Press "e" to exit then press 3 to "Work with disk unit recovery".

image-20240923135816-10

A2.4 Select 3 to "Initialize and format disk".

image-20240923135910-11

A2.5 Select 1 and enter to "Initialize and Format" the disk, then press "c" to continue.

image-20240923140006-12

image-20240923140118-13

A2.6 The progress of the format is displayed.

image-20240923140217-14

A2.7 Once the format is complete it will bring to to the "Work with Disk Unit Recovery" menu. The ends the procedure, return to the step that brought you here.

Appendix A3: Booting IBM Standalone Diagnostics to run disk diagnostics on a JBOD hdisk

A3.1 Follow Appendix A1 to perform a diagnostics IPL. Move your cursor to "Advanced Diagnostics Routines" and press enter.

image-20240829114615-1

A3.2 Set your terminal type, vt320 is a good choice since the function keys works with this emulation.

image-20240910091109-1

A3.3 Select "System Verification" and press enter.

image-20240910091341-2

A3.4 Move the cursor to the hdisk that you want to test and press enter to select it, then "F7" to start the diagnostics. Note, if you have more then one hdisk and you want to test multiple hdisks you can select each one by pressing enter on each hdisk then select "F7" to start the diagnostics. Once a disk is selected you will see a "+" symbol beside the hdisk.

image-20240910091826-3

A3.5 Testing begins.

image-20240910092013-5

A3.6 Testing results are displayed and you receive an option to perform a certification. It is recommended to perform certification on a suspect disk. Move the cursor to yes and press enter.

image-20240910092308-6

A3.7 The certify operation starts and you can see its progress.

image-20240910092449-7

image-20240910092832-9

A3.8 Once testing completes it will list any errors the disk may have.  If you receive errors then this should be reported back to your IBM Support channels. If you receive no errors like the screen shot below, no further action is needed.

image-20240910093325-10

A3.9 Press enter and the sas controller will also perform a test.

image-20240910093842-11

A3.10 The final test results are displayed, if no trouble is found then press enter to continue, if trouble was found this should be reported back to your IBM service channels.

image-20240910094339-12

A3.11 This brings you to your device selection list. Note: there is an "*" beside each device tested. You can continue testing other devices otherwise this procedure is complete, you can power off the LPAR.

image-20240910094721-13

Appendix A4: Booting IBM Standalone Diagnostics to run disk diagnostics on a pdisk that is part of an array

A4.1 Follow Appendix A1 to perform a diagnostics IPL. 

A4.2 Enter "3" and press enter to go top the "Task Selection" menu.

image-20240910115419-1

A4.3 Set your terminal type, vt320 is a good choice since the function keys works with this emulation.

image-20240910091109-1

A4.3 Move the cursor to the "Raid Array Manager" which is typically at or near the bottom of the list and press enter.

image-20240910115603-2

A4.4 Move the cursor to "IBM SAS Disk Array Manager" and press enter.

image-20240910115748-3

A4.5 Mov e the cursor to " Diagnostics and Recovery Options" and press enter.

image-20240910121847-4

A4.6 Mov e the cursor to "Certify Physical Disk Media" and press enter

image-20240910121953-5

A4.7 All "pdisks" will be displayed, move your cursor to the "pdisk" you want to test and press enter to select it. Multiple disks can be selected, once all are selected press "F7" to commit and start the test.

image-20240910122324-6

A4.8 Testing begins at 0 percent complete and will progress through the test.

image-20240910122435-7

A4.9 Testing completes and displays the results, if any errors are displayed this should be reported to your IBM Support channel. Press enter to continue, this completes the procedure.

image-20240910130108-8

Document Location

Worldwide

Operating System

Cross Brand:Linux

[{"Line of Business":{"code":"LOB57","label":"Power"},"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"POWER7","label":"IBM Power7"},"ARM Category":[{"code":"","label":""}],"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"All Versions","Type":"MASTER"},{"Line of Business":{"code":"LOB57","label":"Power"},"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"POWER8","label":"IBM Power8"},"ARM Category":[{"code":"","label":""}],"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"All Versions","Type":"MASTER"},{"Line of Business":{"code":"LOB57","label":"Power"},"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"POWER9","label":"IBM Power9"},"ARM Category":[{"code":"","label":""}],"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"All Versions","Type":"MASTER"},{"Line of Business":{"code":"LOB57","label":"Power"},"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"POWER10","label":"IBM Power10"},"ARM Category":[{"code":"","label":""}],"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"All Versions","Type":"MASTER"}]

Document Information

Modified date:
23 September 2024

UID

ibm17144244