Troubleshooting
Problem
There are some known issues with R&V related to EMX0. Some of them have fixes while others have a work-around. It's important to note details from the failure error message specifically what OS reported the problem:
Reporting partition type: IBM i
Reporting partition type: PHYP
Reporting partition: 7*9119-MHE*PPSSSSS (AIX and VIOS)
If the error is already closed the error details can be found in the pedbg. Depending on what version of HMC code the Trace files can be found in RVTrace.tar and or the RepairAndVerifyOperation.tar or RepairAndVerify.tar.
Typically you find a trace file for each RV process attempted in folder: dump/RVTrace.tar_unpack/opt/ccfw/data/service/RVTrace.
However, there are cases where these files are missing.
You can then look in RepairAndVerifyOperation and RepairAndVerify. In these files all the Repair and Verify Operations are in one large trace file that contain multiple R&V operations so you have to look through each file to find the one that contains the correct RepairAndVerify operation.
Note that if the failure is from an OS the fix is an OS fix. If the failure is in PHYP the fix would likely be in Server firmware (FW). However, there are also HMC defects that impact regardless of OS.
IBMi fixes:
APAR MA45930 Concurrent Maintenance fails for CXP, cable, or Fanout FRU Replacement
- **R710 MF62705 (Delayed)
- **R720 MF62675 (Delayed)
- **R730 MF62706 (Delayed)
APAR MA46392 Duplicate SLB Entries lead to dynamic processor deconfiguration, and recycle of EMX0 drawer resources if attached to the processor in question
- **R710 MF63633 (Delayed)
- **R720 MF63632 (Delayed)
- **R730 MF63634 (Delayed)
APAR MA45986 LPAR Hang during recovery operations from PCIe Link reset
- **R710 MF62823 (Delayed)
- **R720 MF62807 (Delayed)
- **R730 MF62819 (Delayed)
Concurrent Maintenance Repair & Verify (R&V) fails with 0x0300 Failure on EMX0 Components Tip P6220340: https://www.ibm.com/support/pages/node/6220340
AIX known issue 0931-013 Unable to isolate the resource. https://www.ibm.com/support/pages/node/667089
Minimum HMC Level V8 R8.6.0 Service Pack 1 (PTF MH01656)
This level, in addition to several crucial security fixes, also corrects and improves issues in the Repair and Verify (R&V) functions that would be needed to successfully perform a repair on the EMX0 drawers, as well as other components of the P8 servers.
HMC defect SW393016:
0x0300 Failure (hard stop or user intervention required" The "reporting partition name" are IBM i partitions that do not own any IO in the drawer being repaired. Fixed in HMC code level 850.3,860.3, 870.1, 910
Concurrent Maintenance (R&V) of Cable Card fails with Return Code 0x0314 Tip: P6421521: https://www.ibm.com/support/pages/node/6421521
Symptom
Hard stop failure of R&V display something similar to the following errors on the HMC GUI:
Reporting partition type: IBM i
Return code: 0x0300
Return code type: Failure (hard stop or user intervention required).
Message: Retry
Corrective action: Retry the procedure. If the problem persists, contact your service provider for help.
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
Reporting partition name: WebP2
Reporting partition type: IBM i
Return code: 0x0300
Return code type: Failure (hard stop or user intervention required).
Message: Retry
Corrective action: Retry the procedure. If the problem persists, contact your service provider for help
Return code: 0x0300
Return code type: Failure (hard stop or user intervention required).
Message: Retry
Corrective action: Retry the procedure. If the problem persists, contact your service provider for help.
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
Reporting partition name: WebP2
Reporting partition type: IBM i
Return code: 0x0300
Return code type: Failure (hard stop or user intervention required).
Message: Retry
Corrective action: Retry the procedure. If the problem persists, contact your service provider for help
Another Example:
Return code type: Failure (hard stop or user intervention required).
Message: The operation to remove PHB "PHB ###" failed on partition 7*9119-MHE*PPSSSSS.
Then, a message similar to the following is present:
The OS standard error: 0931-013 Unable to isolate the resource.
Then, a message similar to the following is present:
The OS standard error: 0931-013 Unable to isolate the resource.
An HMC code defect SW393016:
R&V on an EMX0 IO enclosure "fails" with 0x0300 errors even though it completed successfully.
The R&V actually completes successfully and the reporting partitions are IBM i partition(s) that do not own any IO in the drawer in being repaired.
Fixed in 850.3 860.3,870.1 base of releases beyond 870.
Cause
Many of the problems are caused by defects noted in the problem section; however, there are a couple of cases where a work-around is needed.
0931-013 Unable to isolate the resource see drupal document: https://www.ibm.com/support/pages/node/667089
IBM i 0x0300 - If an IBM i partition (LPAR) has a large amount of memory see HW Tip: https://www.ibm.com/support/pages/node/889159
Tip P6220340: https://www.ibm.com/support/pages/node/6220340
Environment
IBM POWER8 and beyond with EMX0.
Diagnosing The Problem
View/Save PCIe Topology from the HMC
Pedbg -c -q4 Instructions for collecting from Enhanced HMC: http://www.ibm.com/support/docview.wss?uid=nas8N1022548
Non-disruptive resource dump with blank resource selector.
How to initiate from Enhanced HMC GUI: http://www.ibm.com/support/docview.wss?uid=nas8N1021901
IBM i data to collect if the failure was an IBMi LPAR:
The Transport manager traces
The Transport manager traces
The following Advance Analysis (AA) macros:
iofr with options:-iop -all
iofr again with options:-ctl -all
iopiprtrfr with options: -all -bustr -sctr
iopiprtrfr with options: -all -bustr -sctr
The Transport manager traces:
- STRSST . . . . . . . . . . . . . . . . . . . . . . Enter
- Option 1 Start a service tool . . . . . . . . . . Enter
- Option 4 Display/Alter/Dump . . . . . . . . . . . Enter
- Option 2 Dump to printer . . . . . . . . . . . . Enter
- Option 2 Licensed Internal Code (LIC) data . . . Enter
- Option 12 Transport manager traces . . . . . . . . Enter
- Option 5 (Note: This is a hidden option) . . . . Enter
Specify Dump Title
- Output device . . . . . . : Printer
- Type choices, press Enter.
- Dump title . . . . . . . . . TRANSPORT MANAGER TRACE
- Perform seizes . . . . . . . . . 1 1=Yes, 2=No
AA Macro instructions:
- From the operating system command line, type the following:
STRSST
Press the Enter key. - Sign in with a service tool profile and password that has authority to Display/Alter/Dump in SST.
- Select Option 1 - Start a service tool, and press the Enter key.
- Select Option 4 - Display/Alter/Dump, and press the Enter key.
- a. To view the data on the screen, select Option 1 - Display/Alter storage, and press the Enter key.
b. To dump to a spooled file (for sending the data to IBM), select Option 2 - Dump to printer, and press the Enter key. - Select Option 2 - Licensed Internal Code (LIC) data, and press the Enter key.
- Select Option 14 - Advanced analysis, and press the Enter key.
- On the Select Advanced Analysis Command screen, there is a list of available advanced analysis macros (under the Command column).
Either select the macro from the list or type 1 (Select) next to the top blank line under the Command column. In the blank line, type the macro name, and press the Enter key. - In the Options field, specify the options provided, and press the Enter key. This provides the output needed (if dumped).
- If the output is to be spooled (Option 2 in Step 5b above), the spooled file name is QPCSMPRT.
Resolving The Problem
Per Tip P6220340: https://www.ibm.com/support/pages/node/6220340. Try going into PCIe Hardware Topology and clicking on refresh, then attempt the R&V procedure again.
Document Location
Worldwide
[{"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"HW1A1","label":"IBM Power Systems"},"ARM Category":[],"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"All Version(s)","Line of Business":{"code":"LOB57","label":"Power"}},{"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"SGDMMD","label":"Power System AC922 Server (8335-GTC)"},"ARM Category":[{"code":"a8m0z000000GnS0AAK","label":"Concurrent Maintenance"}],"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"All Version(s)","Line of Business":{"code":"LOB57","label":"Power"}}]
Was this topic helpful?
Document Information
Modified date:
07 December 2021
UID
ibm10872300