SC860
For Impact, Severity and other Firmware definitions, Please refer to the below 'Glossary of firmware terms' url:
http://www14.software.ibm.com/webapp/set2/sas/f/power5cm/home.html#termdefs |
SC860_246_165 / FW860.B4
2024/10/25
|
Impact: Security Severity: HIPER
System firmware changes that affect all systems
- A security problem was fixed for CVE-2024-45656
|
SC860_245_165 / FW860.B3
2023/05/22
|
Impact: Security Severity: HIPER
System firmware changes that affect all systems
- HIPER/Pervasive: A problem was identified internally by IBM related to SRIOV virtual function support in PowerVM. An attacker with privileged user access to a logical partition that has an assigned SRIOV virtual function (VF) may be able to create a Denial of Service of the VF assigned to other logical partitions on the same physical server and/or undetected arbitrary data corruption. The Common Vulnerability and Exposure number is CVE-2023-30440.
|
SC860_243_165 / FW860.B1
2022/06/02 |
Impact: Data Severity: HIPER
System firmware changes that affect certain systems
- HIPER/Pervasive: For systems with an IBM i partition with native SR-IOV at firmware levels FW810.00 through FW860.B0, a problem was fixed for data incorrectly written to PowerVM/LPAR memory during a DLPAR remove of a native SR-IOV Virtual Function (VF) or Concurrent Maintenance (CM) of the SR-IOV adapter. This may cause undetected data corruption in a partition or a PowerVM crash.
|
SC860_240_165 / FW860.B0
2022/01/21 |
Impact: Availability Severity: SPE
System firmware changes that affect all systems
- A problem was fixed for an incorrect SRC logged for a #EXM0 PCIe expansion drawer power fault found on the low CXP cable. An SRC B7006A85 (AOCABLE, PCICARD) is logged instead of the correct SRC of B7006A86 (PCICARD, AOCABLE). This happens every time there is a power fault on the low CXP cable.
- A problem was fixed for a Live Partition Mobility (LPM) hang during LPM validation on the target system. This is a rare system problem triggered during an LPM migration that causes LPM attempts to fail as well as other functionality such as configuration changes and partition shutdowns.
To recover from this problem to be able to do LPM and other operations such as configuration changes and shutting down partitions, the system must be re-IPLed.
- A problem was fixed for the HMC Repair and Verify (R&V) procedure failing with "Unable to isolate the resource" during concurrent maintenance of the #EMX0 Cable Card. This could lead one to take disruptive action in order to do the repair. This should occur infrequently and only with cases where a physical hardware failure has occurred which prevents access to the PCIe reset line (PERST) but allows access to the slot power controls. As a workaround, pulling both cables from the Cable Card to the #EMX0 expansion drawer will result in a completely failed state that can be handled by bringing up the "PCIe Hardware Topology" screen from either ASMI or the HMC. Then retry the R&V operation to recover the Cable Card.
- A problem was fixed for a partition with an SR-IOV logical port (VF) having a delay in the start of the partition. If the partition boot device is an SR-IOV logical port network device, this issue may result in the partition failing it boot with SRCs BA180010 and BA155102 logged and then stuck on progress code SRC 2E49 for an AIX partition. This problem is infrequent because it requires multiple error conditions at the same time on the SR-IOV adapter. To trigger this problem, multiple SR-IOV logical ports for the same adapter must encounter EEH conditions at roughly the same time such that a new logical port EEH condition is occurring while a previous EEH condition's handling is almost complete but not notified to the hypervisor yet. To recover from this problem, reboot the partition.
- A problem was fixed for a system hypervisor hang and an Incomplete state on the HMC after a logical partition (LPAR) is deleted that has an active virtual session from another LPAR. This problem happens every time an LPAR is deleted with an active virtual session. This is a rare problem because virtual sessions from an HMC (a more typical case) prevent an LPAR deletion until the virtual session is closed, but virtual sessions originating from another LPAR do not have the same check.
- The following problems were fixed for certain SR-IOV adapters:
1) An error was fixed that occurs during a VNIC failover where the VNIC backing device has a physical port down due to an adapter internal error with an SRC B400FF02 logged. This is an improved version of the fix delivered in earlier service pack FW860.A0 for adapter firmware 11.4.415.37 and it significantly reduces the frequency of the error being fixed.
2) An adapter in SR-IOV shared mode may cause a network interruption and SRCs B400FF02 and B400FF04 logged. The problem occurs infrequently during normal network traffic.
These fixes update the adapter firmware to 11.4.415.41 for the following Feature Codes and CCINs: #EN15/#EN16 with CCIN 2CE3, #EN17/#EN18 with CCIN 2CE4, #EN0H/#EN0J with CCIN 2B93, #EN0M/#EN0N with CCIN 2CC0, and #EN0K/#EN0L with CCIN 2CC1.
Update instructions: https://www.ibm.com/support/knowledgecenter/HW4M4/p8efd/p8efd_updating_sriov_firmware.htm.
- For a system with an AIX or Linux partition. a problem was fixed for Platform Error Logs (PELs) that are truncated to only eight bytes for error logs created by the firmware and reported to the AIX or Linux OS. These PELs may appear to be blank or missing on the OS. This rare problem is triggered by multiple error log events in the firmware occurring close together in time and each needing to be reported to the OS, causing a truncation in the reporting of the PEL. As a problem workaround, the full error logs for the truncated logs are available on the HMC or using ASMI on the service processor to view them.
- A problem was fixed for Platform Error Logs (PELs) not being logged and shown by the OS if they have an Error Severity code of "critical error". The trigger is the reporting by a system firmware subsystem of an error log that has set an Event/Error Severity in the 'UH' section of the log to a value in the range, 0x50 to 0x5F. The following error logs are affected:
B200308C ==> PHYP ==> A problem occurred during the IPL of a partition. The adapter type cannot be determined. Ensure that a valid I/O Load Source is tagged.
B700F104 ==> PHYP ==> Operating System error. Platform Licensed Internal Code terminated a partition.
B7006990 ==> PHYP ==> Service processor failure
B2005149 ==> PHYP ==> A problem occurred during the IPL of a partition.
B700F10B ==> PHYP ==> A resource has been disabled due to hardware problems
A7001150 ==> PHYP ==> System log entry only, no service action required. No action needed unless a serviceable event was logged.
B7005442 ==> PHYP ==> A parity error was detected in the hardware Segment Lookaside Buffer (SLB).
B200541A ==> PHYP ==> A problem occurred during a partition Firmware Assisted Dump
B7001160 ==> PHYP ==> Service processor failure.
B7005121 ==> PHYP ==> Platform LIC failure
BC8A0604 ==> Hostboot ==> A problem occurred during the IPL of the system.
BC8A1E07 ==> Hostboot ==> Secure Boot firmware validation failed.
Note that these error logs are still reported to the service processor and HMC properly. This issue does not affect the Call Home action for the error logs.
- A problem was fixed for the Device Description in a System Plan related to Crypto Coprocessors and NVMe cards that were only showing the PCI vendor and device ID of the cards. This is not enough information to verify which card is installed without looking up the PCI IDs first. With the fix, more specific/useful information is displayed and this additional information does not have any adverse impact on sysplan operations. The problem is seen every time a System Plan is created for an installed Crypto Coprocessor or NVMe card.
- A problem was fixed for correct ASMI passwords being rejected when accessing ASMI using an ASCII terminal with a serial connection to the server. This problem always occurs for systems at firmware level FW860.A0 and later.
System firmware changes that affect certain systems
- On systems with an IBM i partition, a problem was fixed for a Live Partition Mobility (LPM) hang while performing the migration of an IBM i partition. In some situations, there is a timing issue when the hypervisor is managing IBM i software licenses. When a subsequent LPM operation is performed, the LPM operation hangs. To recover from this problem to be able to do LPM, the system must be re-IPLed.
- For a system with an IBM i partition. a problem was fixed for an IBM i partition running in P7 or P8 processor compatibility mode failing to boot with SRCs BA330002 and B200A101 logged. This problem can be triggered as larger configurations for processors and memory are added to the partition. A circumvention for this problem could be to reduce the number of processors and memory in the partition, or booting in P9 or later compatibility mode will also allow the partition to boot.
|
SC860_236_165 / FW860.A2
2021/12/07 |
Impact: Security Severity: HIPER
System firmware changes that affect all systems
- HIPER/Non-Pervasive: A security problem was fixed to prevent an attacker that gains service access to the FSP service processor from reading and writing PowerVM system memory using a series of carefully crafted service procedures. This problem is Common Vulnerability and Exposure number CVE-2021-38917.
- HIPER/Non-Pervasive: A problem was fixed for the IBM PowerVM Hypervisor where through a specific sequence of VM management operations could lead to a violation of the isolation between peer VMs. This Common Vulnerability and Exposure number is CVE-2021-38918.
|
SC860_234_165 / FW860.A1
2021/09/16 |
Impact: Data Severity: HIPER
System firmware changes that affect all systems
- HIPER: A problem was fixed which may occur on a target system following a Live Partition Mobility (LPM) migration of an AIX partition utilizing Active Memory Expansion (AME) with 64 KB page size enabled using the vmo tunable: "vmo -ro ame_mpsize_support=1". The problem may result in AIX termination, file system corruption, application segmentation faults, or undetected data corruption.
Note: If you are doing an LPM migration of an AIX partition utilizing AME and 64 KB page size enabled involving a POWER8 or POWER9 system, ensure you have a Service Pack including this change for the appropriate firmware level on both the source and target systems.
|
SC860_231_165 / FW860.A0
2021/07/08 |
Impact: Availability Severity: SPE
New features and functions
- Support added to Redfish to provide a command to set the ASMI user passwords using a new AccountService schema. Using this service, the ASMI admin, HMC, and general user passwords can be changed.
System firmware changes that affect all systems
- A problem was fixed for Time of Day (TOD) being lost for the real-time clock (RTC) with an SRC B15A3303 logged when the service processor boots or resets. This is a very rare problem that involves a timing problem in the service processor kernel. If the server is running when the error occurs, there will be an SRC B15A3303 logged, and the time of day on the service processor will be incorrect for up to six hours until the hypervisor synchronizes its (valid) time with the service processor. If the server is not running when the error occurs, there will be an SRC B15A3303 logged, and If the server is subsequently IPLed without setting the date and time in ASMI to fix it, the IPL will abort with an SRC B7881201 which indicates to the system operator that the date and time are invalid.
- A problem was fixed in ASMI to allow setting static routes with two default gateway IP addresses. Without the fix, ASMI always fails with "Invalid entry. Gateway address" for this configuration. As a workaround, the static routes could be created using the ASMI command line and the "route add" command.
- A problem was fixed for intermittent failures for a reset of a Virtual Function (VF) for SR-IOV adapters during Enhanced Error Handling (EEH) error recovery. This is triggered by EEH events at a VF level only, not at the adapter level. The error recovery fails if a data packet is received by the VF while the EEH recovery is in progress. A VF that has failed can be recovered by a partition reboot or a DLPAR remove and add of the VF.
- A problem was fixed for time-out issues in Power Enterprise Pools 1.0 (PEP 1.0) that can affect performance by having non-optimal assignments of processors and memory to the server logical partitions in the pool. For this problem to happen, the server must be in a PEP 1.0 pool and the HMC must take longer than 2 minutes to provide the PowerVM hypervisor with the information about pool resources owned by this server. The problem can be avoided by running the HMC optmem command before activating the partitions.
- A problem was fixed where the Floating Point Unit Computational Test, which should be set to "staggered" by default, has been changed in some circumstances to be disabled. If you wish to re-enable this option, this fix is required. After applying this service pack, do the following steps:
1) Sign into the Advanced System Management Interface (ASMI).
2) Select Floating Point Computational Unit under the System Configuration heading and change it from disabled to what is needed: staggered (run once per core each day) or periodic (a specified time).
3) Click "Save Settings".
- The following problems were fixed for certain SR-IOV adapters:
1) An error was fixed that occurs during a VNIC failover where the VNIC backing device has a physical port down or read port errors with an SRC B400FF02 logged.
2) A problem was fixed for adding a new logical port that has a PVID assigned that is causing traffic on that VLAN to be dropped by other interfaces on the same physical port which uses OS VLAN tagging for that same VLAN ID. This problem occurs each time a logical port with a non-zero PVID that is the same as an existing VLAN is dynamically added to a partition or is activated as part of a partition activation, the traffic flow stops for other partitions with OS configured VLAN devices with the same VLAN ID. This problem can be recovered by configuring an IP address on the logical port with the non-zero PVID and initiating traffic flow on this logical port. This problem can be avoided by not configuring logical ports with a PVID if other logical ports on the same physical port are configured with OS VLAN devices.
This fix updates the adapter firmware to 11.4.415.37 for the following Feature Codes and CCINs: #EN15/#EN16 with CCIN 2CE3, #EN17/#EN18 with CCIN 2CE4, #EN0H/#EN0J with CCIN 2B93, #EN0M/#EN0N with CCIN 2CC0, and #EN0K/#EN0L with CCIN 2CC1.
Update instructions: https://www.ibm.com/support/knowledgecenter/HW4M4/p8efd/p8efd_updating_sriov_firmware.htm.
- A problem was fixed for some serviceable events specific to the reporting of EEH errors not being displayed on the HMC. The sending of an associated call home event, however, was not affected. This problem is intermittent and infrequent.
- A problem was fixed for newer hardware record names (hardware delivered after the original POWER8 GA) not being displayed correctly in the ASMI deconfiguration records. For example, Capp is displayed as "Unknown".
- A problem was fixed for a system termination with SRC B700F107 following a time facility processor failure with SRC B700F10B. With the fix, the transparent replacement of the failed processor will occur for the B700F10B if there is a free core, with no impact to the system.
- A problem was fixed for possible partition errors following a concurrent firmware update from FW810 or later. A precondition for this problem is that DLPAR operations of either physical or virtual I/O devices must have occurred prior to the firmware update. The error can take the form of a partition crash at some point following the update. The frequency of this problem is low. If the problem occurs, the OS will likely report a DSI (Data Storage Interrupt) error. For example, AIX produces a DSI_PROC log entry. If the partition does not crash, it is also possible that some subsequent I/O DLPAR operations will fail.
- A problem was fixed for spurious out-of-range (greater than 127 C) temperatures being reported for the processor with SRC B1112A10. With the fix, only valid temperature sensor readings are used when reporting processors that have exceeded the Over Temperature (OT) value.
- A problem was fixed in ASMI for setting a static route with a network address for the IP such as "xxx.xxx.xxx.0". Without the fix, ASMI always fails with "Invalid entry. IP address" for this network address format. As a workaround, the static route could be created with the individual IP endpoint entered instead of the network address. or created using the ASMI command line and the "route add" command.
System firmware changes that affect certain systems
- On systems with an IBM i partition, a problem was fixed for physical I/O property data not being able to be collected for an inactive partition booted in "IOR" mode with SRC B200A101 logged. This can happen when making a system plan (sysplan) for an IBM i partition using the HMC and the IBM i partition is inactive. The sysplan data collection for the active IBM i partitions is successful.
- On systems with only Integrated Facility for Linux ( IFL) processors and AIX or IBM i partitions, a problem was fixed for performance issues for IFL VMs (Linux and VIOS). This problem occurs if AIX or IBM i partitions are active on a system with IPL only cores. As a workaround, AIX or IBM i partitions should not be activated on an IFL only system. With the fix, the activation of AIX and IBM i partitions are blocked on an IFL only system. If this fix is installed concurrently with AIX or IBM i partitions running, these partitions will be allowed to continue to run until they are powered off. Once powered off, the AIX and IBM i partitions will not be allowed to be activated again on the IFL-only system.
|
SC860_226_165 / FW860.90
2020/12/09 |
Impact: Data Severity: HIPER
New features and functions
- Enable periodic logging of internal component operational data for the PCIe3 expansion drawer paths. The logging of this data does not impact the normal use of the system.
System firmware changes that affect all systems
- HIPER/Pervasive: A problem was fixed for certain SR-IOV adapters for a condition that may result from frequent resets of adapter Virtual Functions (VFs), or transmission stalls and could lead to potential undetected data corruption.
The following additional fixes are also included:
1) The VNIC backing device goes to a powered off state during a VNIC failover or Live Partition Mobility (LPM) migration. This failure is intermittent and very infrequent.
2) Adapter time-outs with SRC B400FF01 or B400FF02 logged.
3) Adapter time-outs related to adapter commands becoming blocked with SRC B400FF01 or B400FF02 logged
4) VF function resets occasionally not completing quickly enough resulting in SRC B400FF02 logged.
This fix updates the adapter firmware to 11.4.415.33 for the following Feature Codes and CCINs: #EN15/#EN16 with CCIN 2CE3, #EN17/#EN18 with CCIN 2CE4, #EN0H/#EN0J with CCIN 2B93, #EN0M/#EN0N with CCIN 2CC0, and #EN0K/#EN0L with CCIN 2CC1.
The SR-IOV adapter firmware level update for the shared-mode adapters happens under user control to prevent unexpected temporary outages on the adapters. A system reboot will update all SR-IOV shared-mode adapters with the new firmware level. In addition, when an adapter is first set to SR-IOV shared mode, the adapter firmware is updated to the latest level available with the system firmware (and it is also updated automatically during maintenance operations, such as when the adapter is stopped or replaced). And lastly, selective manual updates of the SR-IOV adapters can be performed using the Hardware Management Console (HMC). To selectively update the adapter firmware, follow the steps given at the IBM Knowledge Center for using HMC to make the updates: https://www.ibm.com/support/knowledgecenter/HW4M4/p8efd/p8efd_updating_sriov_firmware.htm.
Note: Adapters that are capable of running in SR-IOV mode, but are currently running in dedicated mode and assigned to a partition, can be updated concurrently either by the OS that owns the adapter or the managing HMC (if OS is AIX or VIOS and RMC is running).
- A rare problem was fixed for a checkstop during an IPL that fails to isolate and guard the problem core. An SRC is logged with B1xxE5xx and an extended hex word 8 xxxxDD90. With the fix, the suspected failing hardware is guarded and a node is possibly deconfigured to allow the subsequent IPLs of the system to be successful.
- A problem was fixed for the REST/Redfish interface to change the success return code for object creation from "200" to "201". The "200" status code means that the request was received and understood and is being processed. A "201" status code indicates that a request was successful and, as a result, a resource has been created. The Redfish Ruby Client, "redfish_client" may fail a transaction if a "200" status code is returned when "201" is expected.
- A problem was fixed to allow quicker recovery of PCIe links for the #EMXO PCIe expansion drawer for a run-time fault with B7006A22 logged. The time for recovery attempts can exceed six minutes on rare occasions which may cause I/O adapter failures and failed nodes. With the fix, the PCIe links will recover or fail faster (in the order of seconds) so that redundancy in a cluster configuration can be used with failure detection and failover processing by other hosts, if available, in the case where the PCIe links fail to recover.
- A problem was fixed for a concurrent maintenance "Repair and Verify" (R&V) operation for a #EMX0 fanout module that fails with an "Unable to isolate the resource" error message. This should occur only infrequently for cases where a physical hardware failure has occurred which prevents access to slot power controls. This problem can be worked around by bringing up the "PCIe Hardware Topology" screen from either ASMI or the HMC after the hardware failure but before the concurrent repair is attempted. This will avoid the problem with the PCIe slot isolation These steps can also be used to recover from the error to allow the R&V repair to be attempted again.
- A problem was fixed for a B7006A96 fanout module FPGA corruption error that can occur in unsupported PCIe3 expansion drawer(#EMX0) configurations that mix an enhanced PCIe3 fanout module (#EMXH) in the same drawer with legacy PCIe3 fanout modules (#EMXF, #EMXG, #ELMF, or #ELMG). This causes the FPGA on the enhanced #EMXH to be updated with the legacy firmware and it becomes a non-working and unusable fanout module. With the fix, the unsupported #EMX0 configurations are detected and handled gracefully without harm to the FPGA on the enhanced fanout modules.
- A problem was fixed for possible dispatching delays for partitions running in POWER8 processor compatibility mode.
- A problem was fixed for system memory not returned after create and delete of partitions, resulting in slightly less memory available after configuration changes in the systems. With the fix, an IPL of the system will recover any of the memory that was orphaned by the issue.
- A problem was fixed for utilization statistics for commands such as HMC lslparutil and third-party lpar2rrd that do not accurately represent CPU utilization. The values are incorrect every time for a partition that is migrated with Live Partition Mobility (LPM). Power Enterprise Pools 2.0 is not affected by this problem. If this problem has occurred, here are three possible recovery options:
1) Re-IPL the target system of the migration.
2) Or delete and recreate the partition on the target system.
3) Or perform an inactive migration of the partition. The cycle values get zeroed in this case.
- A problem was fixed for a PCIe3 expansion drawer cable that has hidden error logs for a single lane failure. This happens whenever a single lane error occurs. Subsequent lane failures are not hidden and have visible error logs. Without the fix, the hidden or informational logs would need to be examined to gather more information for the failing hardware.
- A problem was fixed for a DLPAR remove of memory from a partition that fails if the partition contains 65535 or more LMBs. With 16MB LMBs, this error threshold is 1 TB of memory. With 256 MB LMBs, it is 16 TB of memory. A reboot of the partition after the DLPAR will remove the memory from the partition.
- A problem was fixed for extraneous B400FF01 and B400FF02 SRCs logged when moving cables on SR-IOV adapters. This is an infrequent error that can occur if the HMC performance monitor is running at the same time the cables are moved. These SRCs can be ignored when accompanied by cable movement.
- A problem was fixed for B400FF02 errors for certain SR-IOV adapters during adapter initialization or error recovery. This is a rare error that can occur because of a race condition in the firmware.
This fix pertains to adapters with the following Feature Codes and CCINs: #EN15/#EN16 with CCIN 2CE3, #EN17/#EN18 with CCIN 2CE4, #EN0H/#EN0J with CCIN 2B93, #EN0M/#EN0N with CCIN 2CC0, and #EN0K/#EN0L with CCIN 2CC1.
- A problem was fixed for not logging SRCs for certain cable pulls from the #EMXO PCIe expansion drawer. With the fix, the previously undetected cable pulls are now detected and logged with SRC B7006A8B and B7006A88 errors.
- A problem was fixed for a rare system hang that can occur when a page of memory is being migrated. Page migration (memory relocation) can occur for a variety of reasons, including predictive memory failure, DLPAR of memory, and normal operations related to managing the page pool resources.
- A problem was fixed for running PCM on a system with SR-IOV adapters in shared mode that results in an "Incomplete" system state with certain hypervisor tasks deadlocked. This problem is rare and is triggered when using SR-IOV adapters in shared mode and gathering performance statistics with PCM (Performance Collection and Monitoring) and also having a low level error on an adapter. The only way to recover from this condition is to re-IPL the system.
- A problem was fixed for an SRC B7006A99 informational log now posted as a Predictive with a call out of the CXP cable FRU, This fix improves FRU isolation for cases where a CXP cable alert causes a B7006A99 that occurs prior to a B7006A22 or B7006A8B. Without the fix, the SRC B7006A99 is informational and the latter SRCs cause a larger hardware replacement even though the earlier event identified a probable cause for the cable FRU.
System firmware changes that affect certain systems
- On systems with an IBM i partition, a problem was fixed for only seeing 50% of the total Power Enterprise Pools (PEP) 1.0 memory that is provided. This happens when querying resource information via QAPMCONF which calls MATMATR 0x01F6. With the fix, an error is corrected in the IBM i MATMATR option 0X01F6 that retrieves the memory information for the Collection Services.
|
SC860_215_165 / FW860.81
2020/03/04 |
Impact: Security Severity: HIPER
System firmware changes that affect all systems
- HIPER/Pervasive: A problem was fixed for an HMC "Incomplete" state for a system after the HMC user password is changed with ASMI on the service processor. This problem can occur if the HMC password is changed on the service processor but not also on the HMC, and a reset of the service processor happens. With the fix, the HMC will get the needed "failed authentication" error so that the user knows to update the old password on the HMC.
|
SC860_212_165 / FW860.80
2019/12/17 |
Impact: Security Severity: SPE
New features and functions
- Support was added for improved security for the service processor password policy. For the service processor, the "admin", "hmc" and "general" password must be set on first use for newly manufactured systems and after a factory reset of the system. The REST/Redfish interface will return an error saying the user account is expired in these scenarios. This policy change helps to enforce the service processor is not left in a state with a well-known password. The user can change from an expired default password to a new password using the Advanced System Management Interface (ASMI).
- Support was added for real-time data capture for PCIe3 expansion drawer (#EMX0) cable card connection data via resource dump selector on the HMC or in ASMI on the service processor. Using the resource selector string of "xmfr -dumpccdata" will non-disruptively generate an RSCDUMP type of dump file that has the current cable card data, including data from cables and the retimers.
System firmware changes that affect all systems
- A problem was fixed for SR-IOV adapters to provide a consistent Informational message level for cable plugging issues. For transceivers not plugged on certain SR-IOV adapters, an unrecoverable error (UE) SRC B400FF03 was changed to an Informational message logged. This affects the SR-IOV adapters with the following feature codes and CCINs: #EC2R/EC2S with CCIN 58FA; #EC2T/EC2U with CCIN 58FB; and #EC3L/EC3M with CCIN 2CEC.
For copper cables unplugged on certain SR-IOV adapters, a missing message was replaced with an Informational message logged. This affects the SR-IOV adapters with the following feature codes and CCINs: #EN17/EN18 with CCIN 2CE4; and #EN0K/EN0L with CCIN 2CC1.
- The following problem related to SR-IOV was fixed: If the SR-IOV logical port's VLAN ID (PVID) is modified while the logical port is configured, the adapter will use an incorrect PVID for the Virtual Function (VF). This problem is rare because most users do not change the PVID once the logical port is configured, so they will not have the problem.
This fix updates adapter firmware to 10.2.252.1940 for the following Feature Codes and CCINs: #EN15/EN16 with CCIN 2CE3; #EN17/EN18 with CCIN 2CE4; #EN0H/EN0J with CCIN 2B93; #EN0M/EN0N with CCIN 2CC0; and #EN0K/EN0L with CCIN 2CC1.
The SR-IOV adapter firmware level update for the shared-mode adapters happens under user control to prevent unexpected temporary outages on the adapters. A system reboot will update all SR-IOV shared-mode adapters with the new firmware level. In addition, when an adapter is first set to SR-IOV shared mode, the adapter firmware is updated to the latest level available with the system firmware (and it is also updated automatically during maintenance operations, such as when the adapter is stopped or replaced). And lastly, selective manual updates of the SR-IOV adapters can be performed using the Hardware Management Console (HMC). To selectively update the adapter firmware, follow the steps given at the IBM Knowledge Center for using HMC to make the updates: https://www.ibm.com/support/knowledgecenter/HW4M4/p8efd/p8efd_updating_sriov_firmware.htm.
Note: Adapters that are capable of running in SR-IOV mode, but are currently running in dedicated mode and assigned to a partition, can be updated concurrently either by the OS that owns the adapter or the managing HMC (if OS is AIX or VIOS and RMC is running).
- A problem was fixed for Novalink failing to activate partitions that have names with character lengths near the maximum allowed character length. This problem can be circumvented by changing the partition name to have 32 characters or less.
- A problem was fixed where a Linux or AIX partition type was incorrectly reported as unknown. Symptoms include: IBM Cloud Management Console (CMC) not being able to determine the RPA partition type (Linux/AIX) for partitions that are not active; and HMC attempts to dynamically add CPU to Linux partitions may fail with a HSCL1528 error message stating that there are not enough Integrated Facility for Linux ( IFL) cores for the operation.
- A problem was fixed for a possible system crash with SRC B7000103 if the HMC session is closed while the performance monitor is active. As a circumvention for this problem, make sure the performance monitor is turned off before closing the HMC sessions.
- A problem was fixed for a Live Partition Mobility (LPM) migration of a large memory partition to a target system that causes the target system to crash and for the HMC to go to the "Incomplete" state. For servers with the default LMB size (256MB), if a partition is >=16TB and if desired memory is different than the maximum memory, LPM may fail on the target system. Servers with LMB sizes less than the default could hit this problem with smaller memory partition sizes. A circumvention to the problem is to set the desired and maximum memory to the same value for the large memory partition that is to be migrated.
- A problem was fixed for system hangs or incomplete states displayed by HMC(s) caused by a loop in the handling of Segment Lookaside Buffer (SLB) cache memory parity errors where SRC B7005442 may be logged. This problem has a low frequency of occurrence as it requires severe errors in the SLB cache that are not cleared by an error flush of the entries. A re-IPL of the system can be used to recover from this error.
- A problem was fixed for a failed clock card causing a node to be guarded during the IPL of a multi-node system. With the fix, the redundant clock card allows all the nodes to IPL in the case of a single clock card failure.
System firmware changes that affect certain systems
- On systems with an IBM i partition, a problem was fixed for a D-mode IPL failure when using a USB DVD drive in an IBM 7226 multimedia storage enclosure. Error logs with SRC BA16010E, B2003110, and/or B200308C can occur. As a circumvention, an external DVD drive can be used for the D-mode IPL.
- On systems with a single node, a problem was fixed for unknowingly running at lower (the default) frequencies when changing into Fixed Max Frequency (FMF) mode. This problem should be unlikely to happen because it requires that the system already is in FMF mode, and then the user requesting a change into FMF mode. This request is not handled correctly as the tunable parameters get reset to default which allows the processor frequency to be reduced to the minimum value. The recovery for this problem is to change the power mode to "Nominal" and then change it to FMF.
- On systems with IBM i partitions, a rare problem was fixed for an intermittent failure of a DLPAR remove of an adapter. In most cases, a retry of the operation will be successful.
- On systems with Integrated Facility for Linux ( IFL) processors and Linux-only partitions, a problem was fixed for Power Enterprise Pools (PEP) 1.0 not going back into "Compliance" when resources are moved from Server 1 to Server 2, causing an expected "Approaching Out Of Compliance", but not automatically going back into compliance when the resources are no longer used on Server 1. As a circumvention, the user can do an extra "push" and "pull" of one resource to make the Pool discover it is back in "Compliance",
- On systems with an IBM i partition, a problem was fixed for a possibly incorrect number of Memory COD (Capacity On Demand) resources shown when gathering performance data with IBM i Collection Services. Memory resources activated by Power Enterprise Pools (PEP) 1.0 will be missing from the data. An error was corrected in the IBM i MATMATR option 0X01F6 that retrieves the Memory COD information for the Collection Services.
|
SC860_205_165 / FW860.70
2019/06/18 |
Impact: Availability Severity: HIPER
System firmware changes that affect all systems
- HIPER/Pervasive: The following problems related to SR-IOV were fixed:
1) A problem was fixed for new or replacement SR-IOV adapters with feature codes EN15, EN16, EN17, and EN18 being rendered non-functional when moved to SR-IOV mode. This includes cards moved from dedicated device mode, newly installed adapters, and FRU replacements. This problem occurs when the adapter firmware is updated to the 10.2.252.x levels from 11.x adapter firmware levels.
2) A problem was fixed for certain SR-IOV adapters where SRC B400FF01 errors are seen during vNIC failovers and Live Partition Mobility (LPM) migration of vNIC clients. This may also result in errors seen in partitions (for example, some partitions may show LNC2ENT_TX_ERR).
3) A problem was fixed where network multicast traffic is not received by a SR-IOV logical port (VF) network interface for a Linux partition. The failure can occur when the partition transitions the network interface out of promiscuous or multicast promiscuous mode.
These fixes update adapter firmware to 10.2.252.1939 for the following Feature Codes: EN15, EN16, EN17, EN18, EN0H, EN0J, EN0M, EN0N, EN0K, and EN0L.
The SR-IOV adapter firmware level update for the shared-mode adapters happens under user control to prevent unexpected temporary outages on the adapters. A system reboot will update all SR-IOV shared-mode adapters with the new firmware level. In addition, when an adapter is first set to SR-IOV shared mode, the adapter firmware is updated to the latest level available with the system firmware (and it is also updated automatically during maintenance operations, such as when the adapter is stopped or replaced). And lastly, selective manual updates of the SR-IOV adapters can be performed using the Hardware Management Console (HMC). To selectively update the adapter firmware, follow the steps given at the IBM Knowledge Center for using HMC to make the updates: https://www.ibm.com/support/knowledgecenter/HW4M4/p8efd/p8efd_updating_sriov_firmware.htm.
Note: Adapters that are capable of running in SR-IOV mode, but are currently running in dedicated mode and assigned to a partition, can be updated concurrently either by the OS that owns the adapter or the managing HMC (if OS is AIX or VIOS and RMC is running).
- DEFERRED: PARTITION_DEFERRED: A problem was fixed for repeated CPU DLPAR remove operations by Linux (Ubuntu, SUSE, or RHEL) OSes possibly resulting in a partition crash. No specific SRCs or error logs are reported. The problem can occur on any DLPAR CPU remove operation if running on Linux. The occurrence is intermittent and rare. The partition crash may result in one or more of the following console messages (in no particular order):
1) Bad kernel stack pointer addr1 at addr2
2) Oops: Bad kernel stack pointer
3) ******* RTAS CALL BUFFER CORRUPTION *******
4) ERROR: Token not supported
This fix does not activate until there is a reboot of the partition.
- A problem was fixed for a loss of service processor redundancy if an attempt is made to boot from a corrupted flash side on the primary service processor. Although the primary service processor recovers, the backup service processor ends up stuck in the IPLing state. The backup service processor must be reset to recover from the IPL hang and restore service processor redundancy.
- A problem was fixed for an incorrect SRC B150F138 being logged against the backup service processor when service processor redundancy has been disabled. This SRC is logged at system run-time when the backup service processor is in the standby or termination state. This is the expected state of backup service processor with redundancy is disabled, so no SRC should be logged and it can be ignored.
- A problem was fixed for a PCIe Hub checkstop with SRC B138E504 logged that fails to guard the errant processor chip. With the fix, the problem hardware FRU is guarded so there is not a recurrence of the error on the next IPL.
- A problem was fixed for an incorrect SRC of B1810000 being logged when a firmware update fails because of Entitlement Key expiration. The error displayed on the HMC and in the OS is correct and meaningful. With the fix, for this firmware update failure the correct SRC of B181309D is now logged.
- A problem was fixed for informational logs flooding the error log if a "Get Sensor Reading" is not working.
- A problem was fixed for a Redfish (REST) Patch request for PowerSaveMode with an unsupported mode value returning an error code "500" instead of the correct error code of "400".
- A problem was fixed for a rare Live Partition Mobility migration hang with the partition left in VPM (Virtual Page Mode) which causes performance concerns. This error is triggered by a migration failover operation occurring during the migration state of "Suspended" and there has to be insufficient VASI buffers available to clear all partition state data waiting to be sent to the migration target. Migration failovers are rare and the migration state of "Suspended" is a migration state lasting only a few seconds for most partitions, so this problem should not be frequent. On the HMC, there will be an inability to complete either a migration stop or a recovery operation. The HMC will show the partition as migrating and any attempt to change that will fail. The system must be re-IPLed to recover from the problem.
- A problem was fixed for shared processor partitions going unresponsive after changing the processor sharing mode of a dedicated processor partition from "allow when partition is active" to either "allow when partition is inactive" or "never". This problem can be circumvented by avoiding disabling processor sharing when active on a dedicated processor partition. To recover if the issue has been encountered, enable "processor sharing when active" on the dedicated partition.
- A problem was fixed for an error in deleting a partition with the virtualized Trusted Platform Module (vTPM) enabled and SRC B7000602 logged. When this error occurs, the encryption process in the hypervisor may become unusable. The problem can be recovered from with a re-IPL of the system.
- A problem was fixed in Live Partition Mobility (LPM) of a partition to a shared processor pool, which results in the partition being unable to consume uncapped cycles on the target system. To prevent the issue from occurring, partitions can be migrated to the default shared processor pool and then dynamically moved to the desired shared processor pool. To recover from the issue, do one of the following four steps:
1) Either use DLPAR to add or remove a virtual processor to/from the affected partition;
2) or dynamically move the partition between shared processor pools;
3) or reboot the partition;
4) or re-IPL the system.
- A problem was fixed for a boot failure using a N_PORT ID Virtualization (NPIV) LUN for an operating system that is installed on a disk of 2 TB or greater, and having a device driver for the disk that adheres to a non-zero allocation length requirement for the "READ CAPACITY 16". The IBM partition firmware had always used an invalid zero allocation length for the return of data and that had been accepted by previous device drivers. Now some of the newer device drivers are adhering to the specification and needing an allocation length of non-zero to allow the boot to proceed.
- A problem was fixed for a clock card failure with SRC B158CC62 logged calling out the wrong clock card and not calling out the cable and system backplane as needed. This fix does not add processors to the callout but in some cases the processor has also been identified as the cause of the clock card failure.
- A problem was fixed for failing to boot from an AIX mksysb backup on a USB RDX drive with SRCs logged of BA210012, AA06000D, and BA090010. The problem trigger is a boot attempt from the RDX device. The boot error does not occur if a serial console is used to navigate the SMS menus.
- A problem was fixed for a system IPLing with an invalid time set on the service processor that causes partitions to be reset to the Epoch date of 01/01/1970. With the fix, on the IPL, the hypervisor logs a B700120x when the service processor real time clock is found to be invalid and halts the IPL to allow the time and date to be corrected by the user. The Advanced System Management Interface (ASMI) can be used to correct the time and date on the service processor. On the next IPL, if the time and date have not been corrected, the hypervisor will log a SRC B7001224 (indicating the user was warned on the last IPL) but allow the partitions to start, but the time and date will be set to the Epoch value.
- A security problem was fixed in the service processor Network Security Services (NSS) services which, with a man-in-the-middle attack, could provide false completion or errant network transactions or exposure of sensitive data from intercepted SSL connections to ASMI, Redfish, or the service processor message server. The Common Vulnerabilities and Exposures issue number is CVE-2018-12384.
- A problem was fixed for hypervisor task getting deadlocked if partitions are powered on at the same time that SR-IOV is being configured for an adapter. With this problem, workloads will continue to run but it will not be possible to change the virtualization configuration or power partitions on and off. This error can be recovered by doing a re-IPL of the system.
- A problem was fixed for hypervisor tasks getting deadlocked that cause the hypervisor to be unresponsive to the HMC ( this shows as an incomplete state on the HMC) with SRC B200F011 logged. This is a rare timing error. With this problem, OS workloads will continue to run but it will not be possible for the HMC to interact with the partitions. This error can be recovered by doing a re-IPL of the system with a scheduled outage.
- A problem was fixed for false indication of a real time clock (RTC) battery failure with SRC B15A3305 logged. This error happens infrequently. If the error occurs, and another battery failure SRC is not logged within 24 hours, ignore the error as it was caused by a timing issue in the battery test.
System firmware changes that affect certain systems
- DEFERRED: On systems with a PCIe3 I/O expansion drawer (#EMX0) , a problem was fixed for the PCIe3 I/O expansion drawer links to improve stability. Intermittent training failures on the links occurred during the IPL with SRC B7006A8B logged. With the fix, the link settings were changed to lower the peak link signal amplification to bring the signal level into the middle of the operating range, thus improving the high margin to reduce link training failures. The system must be re-IPLed for the fix to activate.
- On a system witn an IBM i partition, a problem was fixed for a DLPAR force-remove of a physical IO adapter from an IBM i partition and a simultaneous power off of the partition causing the partition to hang during the power off. To recover the partition from the error, the system must be re-IPLed. This problem is rare because there is only a 2-second timing window for the DLPAR and power off to interfere with each other.
- On a system with an active IBM i partition, a problem was fixed for a SPCN firmware download to the PCIe3 I/O expansion drawer (feature #EMX0) Chassis Management Card (CMC) that could possibly get stuck in a pending state. This failure is very unlikely as it would require a concurrent replacement of the CMC card that is loaded with a SPCN level that is older than 2015 (01MEX151012a). The failure with the SPCN download can be corrected by a re-IPL of the system.
- On a system with an AMS (Active Memory Sharing) partition, a problem was fixed for a Live Partition Mobility (LPM) migration failure when migrating from P9 to a pre-FW860 P8 or P7 system. This failure can occur if the P9 partition is in dedicated memory mode, and the Physical Page Table (PPT) ratio is explicitly set on the HMC (rather than keeping the default value) and the partition is then transitioned to AMS mode prior to the migration to the older system. This problem can be avoided by using dedicated memory in the partition being migrated back to the older system.
- On a system with a vNIC configuration with multiple backing Virtual Functions (VFs), a problem was fixed for a backing VF failure after a sequence of repeated failovers where one of the VF backing devices goes to a powered off state. This problem is infrequent and only occurs after many vNIC failovers. A reboot of the partition with the affected VF will recover it.
- On systems with PCIe3 expansion drawers (feature code #EMX0), a problem was fixed for a UE B700BA01 logged after a FRU was replaced in the PCIe Expansion drawer. The log should have been informational instead of unrecoverable because it is normal to have this log for a replaced part in the expansion drawer that has a different serial number from the old part. If a part in the expansion drawer has been replaced, the UE error log can be ignored.
- On systems with IBMi partitions, a problem was fixed for Live Partition Mobility (LPM) migrations that could have incorrect hardware resource information (related to VPD) in the target partition if a failover had occurred for the source partition during the migration. This failover would have to occur during the Suspended state of the migration, which only lasts about a second, so this should be rare. With the fix, at a minimum the migration error will be detected to abort the migration so it can be restarted. And at a later IBMi OS level, the fix will allow the migration to complete even though the failover has occurred during the Suspended state of the migration.
- On systems with PCIe3 expansion drawers (feature #EMX0), a problem was fixed for PCI link recovery failure during a PCI Host Bridge (PHB) reset with SRCs of B7006A80, B7006A22, B7006A8B, and B7006970 logged. This causes the cable card to fail, losing all slots in the expansion drawer. This is a rare problem. If this error occurs, a concurrent maintenance operation could reboot the expansion drawer or a re-IPL of the system could be done to recover the drawer.
- On systems with an IBM i partition with greater than 9999 GB installed, a problem was fixed for on/Off COD memory-related amounts not being displayed correctly. This only happens when retrieving the On/Off COD numbers via a particular IBMi MATMATR MI command option value.
- On systems with PCIe3 expansion drawers(feature code #EMX0), a problem was fixed for a concurrent exchange of a PCIe expansion drawer cable card, although successful, leaves the fault LED turned on.
- A problem was fixed for shared processor pools where uncapped shared processor partitions placed in a pool may not be able to consume all available processor cycles. The problem may occur when the sum of the allocated processing units for the pool member partitions equals the maximum processing units of the pool.
|
SC860_180_165 / FW860.60
2018/10/31 |
Impact: Availability Severity: SPE
System firmware changes that affect all systems
- A security problem was fixed in the Dynamic Host Control Protocol (DHCP) client on the service processor for an out-of-bound memory access flaw that could be used by a malicious DHCP server to crash the DHCP client process. The Common Vulnerabilities and Exposures issue number is CVE-2018-5732.
- A problem was fixed for certain hypervisor error logs being slow to report to the OS. The error logs affected are those created by the hypervisor immediately after the hypervisor is started and if there is more than 128 error logs from the hypervisor to be reported. The error logs at the end of the queue take a long time to be processed, and may make it appear as if error logs are not being reported to the OS.
- A problem was fixed for an IPL system termination with SRC B181345A logged. This is an infrequent problem related to a time-out in the synchronization of data to the backup service processor. The problem can be recovered from by a re-IPL of the system.
- A problem was fixed for the periodic guard reminder function to not re-post error logs of failed FRUs on each IPL. Instead, a reminder SRC is created to call home the list of FRUs that have failed and require service. This puts the system to back to original behavior of only posting one error log for each FRU that has failed.
- A problem was fixed for the Advanced System Management Interface being unable to show the details of a clock card error log without failing with a SRC B1818A12. This is a very infrequent problem that needs the failing error log entry to be truncated at exactly the maximum size of an error log entry.
- For a HMC managed system, a problem was fixed for a rare, intermittent NetsCMS core dump that could occur whenever the system is doing a deferred shutdown power off. There is no impact to normal operations as the power off completes, but there are extra error logs with SRC B181EF88 and a service processor dump.
- A problem was fixed for the Redfsih "Manager" request returning duplicate object URIs for the same HMC. This can occur if the HMC was removed from the managed system and then later added back in. The Redfish objects for the earlier instances of the same HMC were never deleted on the remove.
- Hardware data collection performance was improved for platform-level dumps.
- A problem was fixed a service processor reset that can occur after 30 or more Administrative Failovers to the backup service processor without an AC power cycle or soft reset of the service processor. After a large number of failovers, a memory leak causes an out of memory condition on the service processor. There is no impact to normal operations as the reset causes an error failover to the backup service processor that is successful.
- A problem was fixed for an enclosure fault LED being stuck on after a repair of a fan. This problem only occurs after the second concurrent repair of a fan.
- A problem was fixed for a concurrent EMX0 PCIe3 expansion CXP (120 Gb/s 12x Small Form-factor Pluggable) cable adapter add or repair that fails with a hypervisor 0x030A error after a previous add or repair failure. The affected CXP cable adapter has feature code #EJ07. A system IPL will recover from the problem.
- A problem was fixed for a dedicated processor partition hanging during a shutdown. This is a very rare problem with only a small timing window in the shutdown that can cause the hang.
- A problem was fixed for a Novalink enabled partition not being able to release master from the HMC that results in error HSCLB95B. To resolve the issue, run a rebuild managed server operation on the HMC and then retry the release. This occurs when attempting to release master from HMC after the first boot up of a Novalink enabled partition if Master Mode was enforced prior to the boot.
- A problem was fixed for resource dumps that use the selector "iomfnm" and options "rioinfo" or "dumpbainfo". This combination of options for resource dumps always fails without the fix.
- A problem was fixed for a Virtual Network Interface Controller (vNIC) client adapter to prevent a failover when disabling the adapter from the HMC. A failover to a new backing device could cause the client adapter to erroneously appear to be active again when it is actually disabled. This causes confusion and failures on the OS for the device driver. This problem can only occur when there is more than a single backing device for the vNIC adapter and if a commands are issued from the HMC to disable the adapter and enable the adapter.
- A problem was fixed for all variants (this was partially fixed in an earlier release) for the SR-IOV firmware adapter updates using the HMC GUI or CLI to only reboot one SR-IOV adapter at a time. If multiple adapters are updated at the same time, the HMC error message HSCF0241E may occur: "HSCF0241E Could not read firmware information from SR-IOV device ...". This fix prevents the system network from being disrupted by the SR-IOV adapter updates when redundant configurations are being used for the network. The problem can be circumvented by using the HMC GUI to update the SR-IOV firmware one adapter at a time using the following steps:
https://www.ibm.com/support/knowledgecenter/en/POWER8/p8efd/p8efd_updating_sriov_firmware.htm
- A problem was fixed for the callout of SRC BA188002 so it does not display three trailing extra garbage characters in the location code for the FRU. The string is correct up to the line ending white space, so the three extra characters after that should be ignored. This problem is intermittent and does not occur for all BA188002 error logs.
- A problem was fixed for when booting a large number of LPARs with Virtual Trusted Platform Module (vTPM) capability, some partitions may post a SRC BA54504D time-out for taking too long to start. With the fix, the time allowed to boot a vTPM LPAR is increased. If a time-out occurs, the partition can be booted again to recover. The problem can be avoided by auto-starting fewer vTPM LPARs, or booting them a couple at a time to prevent flooding the vTPM device server with requests that will slow the boot time while the LPARs wait on the vTPM device server responses.
- A problem was fixed for SMS menus to limit reporting on the NPIV and vSCSI configuration to the first 511 LUNs. Without the fix, LUN 512 through the last configured LUN report with invalid data. Configurations in excess of 511 LUNs are very rare, and it is recommended for performance reasons (to be able search for the boot LUN more quickly) that the number of LUNs on a single targeted be limited to less than 512.
- The following two errors in the SR-IOV adapter firmware were fixed: 1) The adapter resets and there is a B400FF01 reference code logged. This error happens in rare cases when there are multiple partitions actively running traffic through the adapter. System firmware resets the adapter and recovers the system with no user-intervention required; 2) SR-IOV VFs with defined VLANs and an assigned PVID are not able to ping each other.
This fix updates adapter firmware to 10.2.252.1933, for the following Feature Codes: EN15, EN16, EN17, EN18, EN0H, EN0J, EN0M, EN0N, EN0K, and EN0L,.
The SR-IOV adapter firmware level update for the shared-mode adapters happens under user control to prevent unexpected temporary outages on the adapters. A system reboot will update all SR-IOV shared-mode adapters with the new firmware level. In addition, when an adapter is first set to SR-IOV shared mode, the adapter firmware is updated to the latest level available with the system firmware (and it is also updated automatically during maintenance operations, such as when the adapter is stopped or replaced). And lastly, selective manual updates of the SR-IOV adapters can be performed using the Hardware Management Console (HMC). To selectively update the adapter firmware, follow the steps given at the IBM Knowledge Center for using HMC to make the updates: https://www.ibm.com/support/knowledgecenter/HW4M4/p8efd/p8efd_updating_sriov_firmware.htm.
Note: Adapters that are capable of running in SR-IOV mode, but are currently running in dedicated mode and assigned to a partition, can be updated concurrently either by the OS that owns the adapter or the managing HMC (if OS is AIX or VIOS and RMC is running).
- A problem was fixed for an IPL that ends with the HMC in the "Incomplete" state with SRCs B182951C and A7001151 logged. Partitions may start and can continue to run without the HMC services available. In order to recover the HMC session, a re-IPL of the system is needed (however, partition workloads could continue running uninterrupted until the system is intentionally re-IPLed at a scheduled time.). The frequency of this problem is very low as it rarely occurs.
- A problem was fixed for Live Partition Mobility (LPM) failing along with other hypervisor tasks, but the partitions continue to run. This is an extremely rare failure where a re-IPL is needed to restore HMC or Novalink connections to the partitions, or to do any system configuration changes.
- A problem was fixed for partition SMS menus to display certain network adapters that were unviewable and not usable as boot and install devices after a microcode update. The problem network adapter is still present and usable at the OS. The adapters with this problem have the following featiure codes: EN0A, EN0B, EN0H, EN0J, EN0K, EN0L, EN15, EN16, EN17, and EN18.
- A problem was fixed for platform dumps failing for HWPROC checkstops, causing the system to terminate instead of re-IPLing after the processor failure. To recover, the system can be powered off and then IPLed. Any problem hardware will be guarded during the IPL to allow normal system operations.
System firmware changes that affect certain systems
- On a system with an AIX partition, a problem was fixed for a partition time jump that could occur after doing an AIX Live Update. This problem could occur if the AIX Live Update happens after a Live Partition Mobility (LPM) migration to the partition. AIX applications using the timebase facility could observe a large jump forwards or backwards in the time reported by the timebase facility. A circumvention to this problem is to reboot the partition after the LPM operation prior to doing the AIX Live Update. An AIX fix is also required to resolve this problem. The issue will no longer occur when this firmware update is applied on the system that is the target of the LPM operation and the AIX partition performing the AIX Live Update has the appropriate AIX updates installed prior to doing the AIX Live Update.
- For a shared memory partition, a problem was fixed for Live Partition Mobility (LPM) migration hang after a Mover Service Partition (MSP) failover in the early part of the migration. To recover from the hang, a migration stop command must be given on the HMC. Then the migration can be retryed.
- For a shared memory partition, a problem was fixed for Live Partition Mobility (LPM) migration failure to an indeterminate state. This can occur if the Mover Service Partition (MSP) has a failover that occurs when the migrating partition is in the state of "Suspended." To recover from this problem, the partition must be shutdown and restarted.
- On a system attached to a Cloud Management Console (CMC) via a Cloud Connector on the HMC, a problem was fixed for Redfish queries to the service processor resulting in memory leaks and out of memory (OOM) resets of the service processor.
|
SC860_165_165 / FW860.51
2018/05/22 |
Impact: Security Severity: SPE
Response for Recent Security Vulnerabilities
- DISRUPTIVE: In response to recently reported security vulnerabilities, this firmware update is being released to address Common Vulnerabilities and Exposures issue number CVE-2018-3639. In addition, Operating System updates are required in conjunction with this FW level for CVE-2018-3639.
|
SC860_160_056 / FW860.50
2018/05/03 |
Impact: Availability Severity: SPE
New features and functions
- Support was added to allow V9R910 and later HMC levels to query Live Partition Mobility (LPM) performance data after an LPM operation.
- Support was added to the Advanced System Management Interface (ASMI) to provide customer control over speculative execution in response to CVE-2017-5753 and CVE-2017-5715 (collectively known as Spectre) and CVE-2017-5754 (known as Meltdown). The ASMI "System Configuration/Speculative Execution Control" provides two options that can only be set when the system is powered off:
1) Speculative execution controls to mitigate user-to-kernel and user-to-user side-channel attacks. This mode is designed for systems that need to mitigate exposures of the hypervisor, operating systems, and user application data to untrusted code. This mode is set as the default.
2) Speculative execution fully enabled: This optional mode is designed for systems where the hypervisor, operating system, and applications can be fully trusted.
Note: Enabling this option could expose the system to CVE-2017-5753, CVE-2017- 5715, and CVE-2017-5754. This includes any partitions that are migrated (using Live Partition Mobility) to this system.
- Support was added to allow a periodic data capture from the PCIe3 I/O expansion drawer (with feature code #EMX0) cable card links.
- On systems with an IBM i partition, support was added for multipliers for IBM i MATMATR fields that are limited to four characters. When retrieving Server metrics via IBM MATMATR calls, and the system contains greater than 9999 GB, for example, MATMATR has an architected "multiplier" field such that 10,000 GB can be represented
by 5,000 GB * Multiplier of 2, so '5000' and '2' are returned in the quantity and multiplier fields, respectively, to handle these extended values. The IBM i OS also requires a PTF to support the MATMATR field multipliers.
- On systems with redundant service processors, a health check was added for the state of the secondary service processor to verify it matches the state of the primary service processor. If the state of the secondary service processor is an unexpected value such as in termination, an SRC is logged and a call home is done for service processor FRU that has failed.
System firmware changes that affect all systems
- DEFERRED: A problem was fixed for a PCIe3 I/O expansion drawer (with feature code #EMX0) where control path stability issues may cause certain SRCs to be logged. Systems using copper cables may log SRC B7006A87 or similar SRCs, and the fanout module may fail to become active. Systems using optical cables may log SRC of B7006A22 or similar SRCs. For this problem, the errant I/O drawer may be recovered by a re-IPL of the system.
- A problem was fixed for error logs being collected twice by the HMC, potentially causing an extra call home for an issue that was already resolved. This problem was caused by a failover to the backup service processor whose error log was missing the acknowledgement from the HMC that error logs had been collected. This resulted in the error logs being copied onto the HMC as PELs for a second time.
- A problem was fixed in which deconfigured-resource records can become malformed and cause the loss of service processor for both redundant and non-redundant service processor systems. These failures can occur during or after firmware updates to the FW860.40, FW860.41, or FW860.42 levels. The complete loss of service processor results in the loss of HMC (or FSP stand-alone) management of the server and loss of any further error logging. The server itself will continue to run. Without the fix, the loss of the service processor could happen within one month of the deconfiguration records being encountered. It is highly recommended to install the fix. Recovery from the problem, once encountered, requires a full server AC power cycle and clearing of deconfiguration records to avoid reoccurrence. Clearing deconfiguration records exposes the server to repeat hardware failures and possible unplanned outages.
- A problem was fixed for the guard reminder processing of garded FRUs and error logs that can cause a system power off to hang and time out with a service processor reset.
- A problem was fixed for a system termination that can occur when doing a concurrent code update from the FW860.30 level with a clock card deconfigured in the system. Without the fix, this problem can be avoided by repairing the clock card prior to the code update or by doing a disruptive code update.
- A problem was fixed for a Coherent Accelerator Processor Proxy (CAPP) unit hardware failure that caused a hypervisor hang with SRC B7000602. This failure is very rare and can only occur during the early IPL of the hypervisor, before any partitions are started. A re-IPL will recover from the problem.
- A problem was fixed for a Live Partition Mobility migration hang that could occur if one of its VIOS Mover Service Partitions (MSPs) goes into a failover at the start of the LPM operation. This problem is rare because it requires a MSP error to force a MSP failover at the very start of the LPM migration to get the LPM timing error. The LPM hang can be recovered by using the "migrlpar -o s" and "migrlpar -o r" commands on the HMC.
- A problem was fixed for incorrect low affinity scores for a partition reported from the HMC "lsmemopt" command when a partition has filled an entire drawer. A low score indicates the placement is poor but in this case the placement is actually good. More information on affinity scores for partitions and the Dynamic Platform Optimizer can be found at the IBM Knowledge Center: https://www.ibm.com/support/knowledgecenter/en/POWER8/p8hat/p8hat_dpoovw.htm.
- A problem was fixed to allow the management console to display the Active Memory Mirroring (AMM) licensed capability. Without the fix, the AMM licensed capability of a server will always show as "off" on the management console, even when it is present.
- A problem was fixed for a rare hypervisor hang for systems with shared processors with a sharing mode of uncapped. If this hang occurs, all partitions of the system will become unresponsive and the HMC will go to an "Incomplete" state.
- A problem was fixed for a Live Partition Mobility migration abort that could occur if one of its VIOS Mover Service Partitions (MSPs) goes into a failover during the LPM operation. This problem is rare because it requires a MSP error to force a MSP failover during the LPM migration to get the LPM timing error. The LPM abort can be recovered by retrying the LPM migration.
- A problem was fixed for the FRU callouts for the BA188001 and BA188002 EEH errors to include the PCI Host Bridge (PHB) FRU which had been excluded. For the P8 systems, these rare errors will more typically isolate to the processor instead of the adapter or slot planar. In the pre-P8 systems, the I/O planar also included the PHB, but for P8 systems, the PHB was moved to the processor complex.
- A problem was fixed for an internal error in the SR-IOV adapter firmware that resets the adapter and logs a B400FF01 reference code. This error happens in rare cases when there are multiple partitions actively running traffic through the adapter and a subset of the partitions are shutdown hard. The error causes a temporary disruption of traffic but recovery from the error is automatic with no user intervention needed.
This fix updates adapter firmware to 10.2.252.1931, for the following Feature Codes: EN15, EN16, EN17, EN18, EN0H, EN0J, EN0M, EN0N, EN0K, and EN0L.
The SR-IOV adapter firmware level update for the shared-mode adapters happens under user control to prevent unexpected temporary outages on the adapters. A system reboot will update all SR-IOV shared-mode adapters with the new firmware level. In addition, when an adapter is first set to SR-IOV shared mode, the adapter firmware is updated to the latest level available with the system firmware (and it is also updated automatically during maintenance operations, such as when the adapter is stopped or replaced). And lastly, selective manual updates of the SR-IOV adapters can be performed using the Hardware Management Console (HMC). To selectively update the adapter firmware, follow the steps given at the IBM Knowledge Center for using HMC to make the updates: https://www.ibm.com/support/knowledgecenter/HW4M4/p8efd/p8efd_updating_sriov_firmware.htm.
Note: Adapters that are capable of running in SR-IOV mode, but are currently running in dedicated mode and assigned to a partition, can be updated concurrently either by the OS that owns the adapter or the managing HMC (if OS is AIX or VIOS and RMC is running).
- A problem was fixed for the wrong Redfish method (PATCH or POST) passed for a valid Uniform Resource Indicator (URI) causing an incorrect error message of " 501 - Not Implemented". With the fix, the message returned is "Invalid Method on URI" which is more helpful to the user.
- A problem was fixed for SRC call home reminders for bad FRUs causing service processor dumps with SRC B181E911 and reset/reloads. This occurred if the FRU callout was missing a CCIN number in the error log. This can happen because some error logs only have have "Symbolic FRUs" and these were not being handled correctly.
- A problem was fixed for a PCIe3 I/O expansion drawer (with feature code #EMX0) failing to initialize during the IPL with a SRC B7006A88 logged. The error is infrequent. The errant I/O drawer can be recovered by a re-IPL of the system.
- A problem was fixed for the SR-IOV firmware adapter updates using the HMC GUI or CLI to only reboot one SR-IOV adapter at a time. If multiple adapters are updated at the same time, the HMC error message HSCF0241E may occur: "HSCF0241E Could not read firmware information from SR-IOV device ...". This fix prevents the system network from being disrupted by the SR-IOV adapter updates when redundant configurations are being used for the network. The problem can be circumvented by using the HMC GUI to update the SR-IOV firmware one adapter at a time using the following steps:
https://www.ibm.com/support/knowledgecenter/en/8247-22L/p8efd/p8efd_updating_sriov_firmware.htm
System firmware changes that affect certain systems
- On systems with a shared processor pool, a very rare problem was fixed for the hypervisor not responding to partition requests such as power off and LIve Partiton Mobility (LPM). This error is caused by a request for a guard of a failed processor (when there are not any available spare processors) that has hung.
- On systems with mirrored memory running IBM i partitions, a problem was fixed for un-mirrored nodal memory errors in the partition that also caused the system to crash. With the fix, the memory failure is isolated to the impacted partition, leaving the rest of the system unaffected. This fix improves on an earlier fix delivered for IBM i memory errors in FW840.60 by handling the errors in nodal memory.
- On systems with Huge Page (16 GB) memory enabled for a AIX partition, a problem was fixed for the OS failing to boot with an 0607 SRC displayed. This error occurs on systems with FW860.40, FW860.41 or FW860.42 installed. To circumvent the problem, disable Huge Pages for the AIX partition. For information on viewing and setting values for AIX huge-page memory allocation, see the following link in the IBM Knowledge Center: https://www.ibm.com/support/knowledgecenter/en/POWER8/p8hat/p8hat_aixviewhgpgmem.htm
- On systems with an IBM i partition, a problem was fixed for 64 bytes overwritten in a portion of the IBM i Main Storage Dump (MSD). Approximately 64 bytes are overwritten just beyond the 17 MB (0x11000000) address on P8 systems. This problem is cosmetic as the dump is still readable for problem diagnostics and no customer operations are affected by it.
- On systems with a partition with a Fibre Channel Adapter (FCA) or a Fibre Channel over Ethernet (FCoE) adapter, a problem was fixed for bootable disks attached to the FCA or FCoE adapter not being seen in the System Management Services (SMS) menus for selection as boot devices. This problem is likely to occur if the only I/O device in the partition is a FCA or FCoE adapter. If other I/O devices are present, the problem may still occur if the FCA or FCoE is the first adapter discovered by SMS. A work-around to this problem is to define a virtual Ethernet adapter in the partition profile. The virtual adapter does not need to have any physical backing device, as just having the VLAN defined is sufficient to avoid the problem. The FCA has feature codes #EN0A, #EN0B, #EN0F, #EN0G, #EN0Y, #EN12, #5729, #5774, #5735, and #5723. The FCoE adapter has feature codes #5708, #EN0H, #EN0J, #EN0K, and #EN0L.
- On systems with a partition with a 3.0 USB controller, a problem was fixed for a partition boot failure. The USB 3.0 controller adapter card with feature code #EC45 or #EC46. The boot failure is triggered by a fault in the USB controller but instead of the just the USB controller failing, the entire partition fails. With the fix, the failure is limited to the USB controller.
- On a system in a Power Enterprise Pool (PEP) with Mobile Resources, a problem was fixed for Mobile Resource not being restored after an IPL. The missing resources can be started temporarily with Trial COD or some other methods, or the PEP recovery steps can be used to get the Mobile Resources restored. For more information, see the Change CoD Pool command on the HMC: https://www.ibm.com/support/knowledgecenter/en/POWER8/p8edm/chcodpool.html.
|
SC860_138_056 / FW860.42
2018/01/09 |
Impact: Security Severity: SPE
New features and functions
- In response to recently reported security vulnerabilities, this firmware update is being released to address Common Vulnerabilities and Exposures issue numbers CVE-2017-5715, CVE-2017-5753 and CVE-2017-5754. Operating System updates are required in conjunction with this FW level for CVE-2017-5753 and CVE-2017-5754.
|
SC860_127_056 / FW860.41
2017/12/08 |
Impact: Availability Severity: SPE
System firmware changes that affect certain systems
- On systems using PowerVM firmware that are co-managed with HMC and PowerVM NovaLink, a problem was fixed for the HMC going into the Incomplete state after deleting a NovaLink partition or after using the HMC "chsyscfg powervm_mgmt_capable=0" command to remove the NovaLink attribute from a partition. Partitions will continue running but cannot be changed by the management console and the Live Partitiion Mobility (LPM) will not function in this state. A power off of the system will remove it from the Incomplete state, but the NovaLink partition will not have been deleted. To force the delete of the NovaLink partition or partitions without the fix, erase the service processor NVRAM and then restore the HMC partition data.
- On systems using PowerVM firmware with PowerVM NovaLink, a problem was fixed for the HMC going into the incomplete state when restoring HMC profile data after deleting a NovaLink partition. This fix will prevent but not repair the problem once it has occurred. Recovery from the problem is to erase the service processor NVRAM and then restore the HMC partition data.
|
SC860_118_056 / FW860.40
2017/11/08 |
Impact: Availability Severity: SPE
New features and functions
- Support was added to the Advanced System Management Interface (ASMI) for providing an "All of the above" cable validation display option so that each individual cable option does not have to be selected to get a full report on the cable status. Select "System Service Aids -> Cable Validation -> Display Cable Status" "All of the above" and click "Continue" to see the status of all the cables.
System firmware changes that affect all systems
- A problem was fixed for recovery from clock card loss of lock failures that resulted in a clock card FRU unnecessarily being called out for repair. This error happened whenever there was a loss of lock (PLL or CRC) for the clock card. With the fix, the firmware will not be calling out the failing clock card, but rather it will be reconfigured as the new backup clock card after doing a clock card failover. Customers will see a benefit from improved system availability by the avoidance of disruptive clock card repairs.
- A problem was fixed for the "Minimum code level supported" not being shown by the Advanced System Management Interface (ASMI) when selecting the "System Configuration/Firmware Update Policy" menu. The message shown is "Minimum code level supported value has not been set". The workaround to find this value is to use the ASMI command line interface with the "registry -l cupd/MinMifLevel" command.
- A problem was fixed for "sh: errl: not found " error messages to the service processor console whenever the Advanced System Management Interface (ASMI) was used to display error logs. These messages did not cause any problems except to clutter the console output as seen in the service processor traces.
- A problem was fixed for the LineInputVoltage and LastPowerOutputWatts being displayed in millivolts and milliwatts, respectively, instead of volts and watts for the output from the Redfish API for power properties for the chassis. The URL affected is the following: "https://<fsp ip>/redfish/v1/Chassis/<id>/Power"
- A problem was fixed for system node fans going to maximum RPM speeds after a service processor failover that needed the On-Chip Controllers (OCC) to be reloaded. Without the fix, the system node fan speeds can be restored to normal speed by changing the Power Mode in the Advanced System Management Interface using steps from the IBM Knowledge Center: https://www.ibm.com/support/knowledgecenter/en/POWER8/p8hby/areaa_pmms.htm. After changing the Power Mode, wait about 10 minutes to change the Power Mode back to the original setting.
If the fix is applied without rebooting the system, the system node fan speeds can be corrected by either changing the Power Mode as above or using the HMC to do an Administrative Failover (AFO).
- A problem was fixed for a Power Supply Unit (PSU) failure of SRC 110015xF logged with a power supply fan call out when doing a hot re-plug of a PSU. The power supply may be made operational again by doing a dummy replace of the PSU that was called out (keeping the same PSU for the replace operation). A re-IPL of the system will also recover the PSU.
- A problem was fixed for the service processor low-level boot code always running off the same side of the flash image, regardless of what side has been selected for boot ( P-side or T-side). Because this low-level boot code rarely changes, this should not cause a problem unless corruption occurs in the flash image of the boot code. This problem does not affect firmware side-switches as the service processor initialization code (higher-level code than the boot code) is running correctly from the selected side. Without the fix, there is no recovery for boot corruption for systems with a single service processor as the service processor must be replaced.
- A problem was fixed for a missing serviceable event from a periodic call home reminder. This occurred if there was an FRU deconfigured for the serviceable event.
- A problem was fixed for help text in the Advanced System Management Interface (ASMI) not informing the user that system fan speeds would increase if the system Power Mode was changed to "Fixed Maximum Frequency" mode. If ASMI panel function "System Configuration->Power Management->Power Mode Setup" "Enable Fixed Maximum Frequency mode" help is selected, the updated text states "...This setting will result in the fans running at the maximum speed for proper cooling."
- A problem was fixed for a degraded PCI link causing a Predictive SRC for a non-cacheable unit (NCU) store time-out that occurred with SRC B113E540 or B181E450 and PRD signature "(NCUFIR[9]) STORE_TIMEOUT: Store timed out on PB". With the fix, the error is changed to be an Informational as the problem is not with the processor core and the processor should not be replaced. The solution for degraded PCI links is different from the fix for this problem, but a re-IPL of the CEC or a reset of the PCI adapters could help to recover the PCI links from their degraded mode.
- A problem was fixed for a Redfish Patch on the "Chassis" "HugeDynamicDMAWindowSlotCount" for the validation of incorrect values. Without the fix, the user will not get proper error messages when providing bad values to the patch.
System firmware changes that affect certain systems
- DEFERRED: On systems using PowerVM firmware, a problem was fixed for DPO (Dynamic Platform Optimizer) operations taking a very long and impacting the server system with a performance degradation. The problem is triggered by a DPO operation being done on a system with unlicensed processor cores and a very high I/O load. The fix involves using a different lock type for the memory relocation activities (to prevent lock contention between memory relocation threads and partition threads) that is created at IPL time, so an IPL is needed to activate the fix. More information on the DPO function can be found at the IBM Knowledge Center: https://www.ibm.com/support/knowledgecenter/en/8247-42L/p8hat/p8hat_dpoovw.htm
- On systems using PowerVM firmware, a problem was fixed for an intermittent service processor core dump and a callout for netsCommonMSGServer with SRC B181EF88. The HMC connection to the service processor automatically recovers with a new session.
- On systems using PowerVM firmware, a problem was fixed where the Power Enterprise Pool (PEP) grace period expired early, being short by one hour. For example, 71 hours may be provided instead of 72 hours in some cases. See https://www.ibm.com/support/knowledgecenter/en/POWER8/p8ha2/entpool_cod_compliance.htm for more information about the PEP grace period.
- On systems using PowerVM firmware, a problem was fixed for a concurrent firmware update failure with HMC error message "E302F865-PHYPTooBusyToQuiesce". This error can occur when the error log is full on the hypervisor and it cannot accept more error logs from the service processor. But the service processor keeps retrying the send of an error log, resulting in a "denial of service" scenario where the hypervisor is kept busy rejecting the error logging attempts. Without the fix, the problem may be circumvented by starting a logical partition (if none are running) or by purging the error logs on the service processor.
- On systems using PowerVM firmware with mirrored memory running IBM i partitions, a problem was fixed for memory fails in the partition that also caused the system to crash. The system failure will occur any time that IBM i partition memory towards the beginning of the partition's assigned memory fails. With the fix, the memory failure is isolated to the impacted partition, leaving the rest of the system unaffected.
- On systems using PowerVM firmware, a problem was fixed for failures deconfiguring SR-IOV Virtual Functions (VFs). This can occur during Live Partition Mobility (LPM) migrations with HMC error messages of HSCLAF16, HSCLAF15 and HSCLB602 shown. This results in an LPM migration failure and a system reboot is required to recover the VFs for the I/O adapters. This error may occur more frequently in cases where the I/O adapter has pending I/O at the time of the deconfigure request for the VF.
- On systems using PowerVM firmware, a problem was fixed for a vNIC client that has backing devices being assigned an active server that was not the one intended by an HMC user failover for the client adapter. This only can happen if the vNIC client adapter had never been activated. A circumvention is to activate the client OS and initialize the vNIC device (ifconfig "xxx" up) and an active backing device will then be selected.
- On systems using PowerVM firmware, a problem was fixed for partitions with more than 32TB memory failing to IPL with memory space errors. This can occur if the logical memory block (LMB) size is small as there is a memory loss associated with each LMB. The problem can be circumvented by reducing the amount of partition memory or increasing the LMB size to reduce the total number of LMBs needed for the memory allocation.
- On systems using PowerVM firmware, a problem was fixed for the error handling of EEH events for the SR-IOV Virtual Functions (VFs) that can result in IPL failure with B7006971, B400FF05, and BA210000 SRCs logged. In these cases, the partition console stops at an OFDBG prompt. Also, a DLPAR add of a VF may result in a partition crash due to a 300 DSI exception because of a low-level EEH event. A circumvention for the problem would be to debug the EEH events which should be recovered errors and eliminate the cause of the EEH events. With the fix, the EEH events still log Predictive Errors but do not cause a partition failure.
- On systems using PowerVM firmware, a problem was fixed for Power Enterprise Pool (PEP) "not applicable" error messages being displayed when re-entering PEP XML files for PEP updates, in which one of the XML operations calls for Conversion of Perm Resources to PEP Resources. There is no error as the PEP key was accepted on the first use. The following message may be seen on the HMC and can be ignored: "...HSCL0520 A Mobile CoD processor conversion code to convert 0 permanently activated processors to Mobile CoD processors on the managed system has been entered. HSCL050F This CoD code is not valid for your managed system. Contact your CoD administrator."
- On systems using PowerVM firmware, a problem was fixed for Power Enterprise Pool (PEP) busy errors from the system anchor card when creating or updating a PEP pool. The error returned by the HMC is "HSCL9015 The managed system cannot currently process this operation. This
condition is temporary. Please try the operation again." To try again, the customer needs to update the pool again. Typically on the second PEP update, the code is accepted.
The problem is intermittent and occurs only rarely.
- On systems using PowerVM firmware, a problem was fixed for an invalid date from the service processor causing the customer date and time to go to the Epoch value (01/01/1970) without a warning or chance for a correction. With the fix, the first IPL attempted on an invalid date will be rejected with a message alerting the user to set the time correctly in the service processor. If the warning is ignored and the date/time is not corrected, the next IPL attempt will complete to the OS with the time reverted to the Epoch time and date. This problem is very rare but it has been known to occur on service processor replacements when the repair step to set the date and time on the new service processor was inadvertently skipped by the service representative.
- On systems using PowerVM firmware, a problem was fixed for a Power Enterprise Pool (PEP) system losing its assigned processor and memory resources after an IPL of the system. This is an intermittent problem caused by a small timing window that makes it possible for the server to not get the IPL-time assignment of resources from the HMC. If this problem occurs, it can be corrected by the HMC to recover the pool without needing another IPL of the system.
- On systems using PowerVM firmware with PowerVM NovaLink, a problem was fixed for a lost of a communications channel between the hypervisor and the PowerVM NovaLink during a reset of the service processor. Various NovaLink tasks, including deploy, could fail with a "No valid host was found" error. With the fix, PowerVM NovaLink prevents normal operations from being impacted by a reset of the service processor.
- On systems using PowerVM firmware, a problem was fixed for a rare system hang caused by a process dispatcher deadlock timing window. If this problem occurs, the HMC will also go to an "Incomplete" state for the managed system.
- On systems using PowerVM firmware, a problem was fixed for communication failures on adapters in SR-IOV shared mode. This communication failure only occurs when a logical port's VLAN ID ( PVID) is dynamically changed from non-zero to zero. An SR-IOV logical port is an I/O device created for a partition or a partition profile using the management console (HMC) when a user intends for the partition to access an SR-IOV adapter Virtual Function. The error can be recovered from by a reboot of the partition.
This fix updates adapter firmware to 10.2.252.1929, for the following Feature Codes: EN15, EN16, EN17, EN18, EN0H, EN0J, EN0M, EN0N, EN0K, EN0L, EL38, EL3C, EL56, and EL57.
The SR-IOV adapter firmware level update for the shared-mode adapters happens under user control to prevent unexpected temporary outages on the adapters. A system reboot will update all SR-IOV shared-mode adapters with the new firmware level. In addition, when an adapter is first set to SR-IOV shared mode, the adapter firmware is updated to the latest level available with the system firmware (and it is also updated automatically during maintenance operations, such as when the adapter is stopped or replaced). And lastly, selective manual updates of the SR-IOV adapters can be performed using the Hardware Management Console (HMC). To selectively update the adapter firmware, follow the steps given at the IBM Knowledge Center for using HMC to make the updates: https://www.ibm.com/support/knowledgecenter/HW4M4/p8efd/p8efd_updating_sriov_firmware.htm.
Note: Adapters that are capable of running in SR-IOV mode, but are currently running in dedicated mode and assigned to a partition, can be updated concurrently either by the OS that owns the adapter or the managing HMC (if OS is AIX or VIOS and RMC is running).
- On systems using PowerVM firmware, a problem was fixed for error logs not getting sent to the OS running in a partition. This problem could occur if the error log buffer was full in the hypervisor and then a re-IPL of the system occurred. The error log full condition was persisting across the re-IPL, preventing further logs from being sent to the OS.
- On systems using PowerVM firmware, a problem was fixed in the text for the Firmware License agreement to correct a link that pointed to a URL that was not specific to microcode licensing. The message is displayed for a machine during its initial power on. Once accepted, the message is not displayed again. The fixed link in the licensing agreement is the following: http://www.ibm.com/support/docview.wss?uid=isg3T1025362.
|
SC860_103_056 / FW860.30
2017/06/30 |
Impact: Availability Severity: SPE
New features and functions
- Support was added for Redfish API to allow the ISO 8610 extended format for the time and date so that the date/time can be represented as an offset from UTC (Universal Coordinated Time).
- Support for the Redfish API for power and thermal properties for the chassis. The new URIs are as follows::
https://<fsp ip>/redfish/v1/Chassis/<id>/Power : Provides fan data
https://<fsp ip>/redfish/v1/Chassis/<id>/Thermal : Provides power supply data
Only the Redfish GET operation is supported for these resources.
System firmware changes that affect all systems
- A problem was fixed for service actions with SRC B150F138 missing an Advanced System Management Interface (ASMI) Deconfiguration Record. The deconfiguration records make it easier to organize the repairs that are needed for the system and they need to be consistent with the periodic maintenance reminders that are logged for the failed FRUs.
- A problem was fixed for a false 1100026B1 (12V power good failure) caused by an I2C bus write error for a LED state. This error can be triggered by the fan LEDs changing state.
- A problem was fixed for a fan LED turning amber on solid when there is no fan fault, or when the fan fault is for a different fan. This error can be triggered anytime a fan LED needs to change its state. The fan LEDs can be recovered to a normal state concurrently using the following link steps for a soft reset of the service processor: https://www.ibm.com/support/knowledgecenter/POWER8/p8hby/p8hby_softreset.htm
- A problem was fixed for sporadic blinking amber LEDs for the system fans with no SRCs logged. There was no problem with the fans. The LED corruption occurred when two service processor tasks attempted to update the LED state at the same time. The fan LEDs can be recovered to a normal state concurrently using the following link steps for a soft reset of the service processor: https://www.ibm.com/support/knowledgecenter/POWER8/p8hby/p8hby_softreset.htm
- A problem was fixed for a Redfish Patch on the "Chassis" or "IBMEnterpriseComputerSystem" with empty data that caused a "500 Internal Server Error". Validation for the empty data case has been added to prevent the server error.
- A problem was fixed for hardware dumps only collecting data for the master processor if a run-time service processor failover had occurred prior to the dump. Therefore, there would be only master chip and master core data in the event of a core unit checkstop. To recover to a system state that is able to do a full collection of debug data for all processors and cores after a run-time failover, a re-IPL of the system is needed.
- A problem was fixed for a Redfish Patch on power mode to "MaxPowerSaver" that caused a "500 Internal Server Error" when that power mode was not supported on the system. With the fix, the Redfish server response is a list of the valid power modes that be used for the system.
- A problem was fixed for the loss of Operations Panel function 30 (displaying ethernet port HMC1 and HMC2 IP addresses) after a concurrent repair of the Operations Panel. Operations Panel function 30 can be restored concurrently using the following link steps for a soft reset of the service processor: https://www.ibm.com/support/knowledgecenter/POWER8/p8hby/p8hby_softreset.htm
- A problem was fixed for a core dump of the rtiminit (service processor time of day) process that logs an SRC B15A3303 and could invalidate the time on the service processor. If the error occurs while the system is powered on, the hypervisor has the master time and will refresh the service processor time, so no action is needed for recovery. If the error occurs while the system is powered off, the service processor time must be corrected on the systems having only a single service processor. Use the following steps from the IBM Knowledge Center to change the UTC time with the Advanced System Management Interface: https://www.ibm.com/support/knowledgecenter/en/POWER8/p8hby/viewtime.htm.
- A problem was fixed for the service processor boot watch-dog timer expiring too soon during DRAM initialization in the reset/reload, causing the service processor to go unresponsive. On systems with a single service processor, the SRC B1817212 was displayed on the control panel. For systems with redundant service processors, the failing service processor was deconfigured. To recover the failed service processor, the system will need to be powered off with AC powered removed during a regularly scheduled system service action. This problem is intermittent and very infrequent as most of the reset/reloads of the service processor will work correctly to restore the service processor to a normal operating state.
- A problem was fixed for host-initiated resets of the service processor causing the system to terminate. A prior fix for this problem did not work correctly because some of the host-initiated resets were being translated to unknown reset types that caused the system to terminate. With this new correction for failed host-initiated resets, the service processor will still be unresponsive but the system and partitions will continue to run. On systems with a single service processor, the SRC B1817212 will be displayed on the control panel. For systems with redundant service processors, the failing service processor will be deconfigured. To recover the failed service processor, the system will need to be powered off with AC powered removed during a regularly scheduled system service action. This problem is intermittent and very infrequent as most of the host-initiated resets of the service processor will work correctly to restore the service processor to a normal operating state.
- A problem was fixed for a service processor reset triggered by a spurious false IIC interrupt request in the kernel. On systems with a single service processor, the SRC B1817201 is displayed on the Operator Panel. For systems with redundant service processors, an error failover to the backup service processor occurs. The problem is extremely infrequent and does not impact processes on the running system.
- A problem was fixed for the System Attention LED failing to light for an error failover for the redundant service processors with an SRC B1812028 logged.
- A problem was fixed for a system failure at run time with SRC B111E450 corefir(55) that could not reIPL. A system node should have been deconfigured for an ABUS error on a processor chip but instead, the system was terminated. To recover from this problem, manually guard the node containing the failed processor and then the IPL will be successful.
- A problem was fixed for an incorrect Redfish error message when trying to use the $metadata URI: "The resource at the URI https://<systemip>/redfish/v1/%24metadata was not found.". This %24 is meaningless. The "%24" has been replaced with a "$" in the error message. The Redfish $metadata URI is not supported.
- A problem was fixed for a system failure caused by Host boot problems with one node but the other nodes good. With the fix, the node that is failing the Hostboot is deconfigured and the system is able to IPL on the remaining nodes. To recover from this problem, manually guard the node that is failing and reIPL.
System firmware changes that affect certain systems
- DEFERRED: On systems using PowerVM firmware, a problem was fixed for PCIe3 I/O expansion drawer (#EMX0) link improved stability. The settings for the continuous time linear equalizers (CTLE) was updated for all the PCIe adapters for the PCIe links to the expansion drawer. The system must be re-IPLed for the fix to activate.
- On systems using PowerVM firmware with a Linux Little Endian (LE) partition, a problem was fixed for system reset interrupts returning the wrong values in the debug output for the NIP and MSR registers. This problem reduces the ability to debug hung Linux partitions using system reset interrupts. The error occurs every time a system reset interrupt is used on a Linux LE partition.
- On systems using PowerVM firmware, a problem was fixed for "Time Power On" enabled partitions not being capable of suspend and resume operations. This means Live Partition Mobility (LPM) would not be able to migrate this type of partition. As a workaround, the partition could be transitioned to a "Non-time Power On" state and then made capable of suspend and resume operations.
- On systems using PowerVM firmware, a problem was fixed for manual vNIC failovers (from the HMC, manually "Make the Backing Device Active") so that the selected server was chosen for the failover, regardless of its priority. With the problem, the server chosen for the VNIC failover will be the one with the most favorable priority.
There are two possible workarounds to the problem:
(1) Disable auto-priority-failover; Change priority to the server that is needed as the target of the failover; Force the vNIC failover; Change priority back to original setting.
(2) Or use auto-priority-failover and change the priority so the server that is needed as the target of the failover is favored.
- On systems using PowerVM firmware, a problem was fixed for extra error logs in the VIOS due to failovers taking place while the client vNIC is inactive. The inactive client vNIC failovers are skipped unless the force flag is on. With the problem occurring, Enhanced Error Handling (EEH) Freeze/Temporary Error/Recovery logs posted in the VIOS error log of the client partition boot can be ignored unless an actual problem is experienced.
- On systems using PowerVM firmware, a problem was fixed for a Live Partition Mobility (LPM) migration abort and reboot on the FW860 target CEC caused by a mismatched address space for the source and target partition. The occurrence of this problem is very rare and related to performance improvements made in the memory management on the FW860 system that exposed a timing window in the partition memory validation for the migration. The reboot of the migrated partition recovers from the problem as the migration was otherwise successful.
- On systems using PowerVM firmware, a problem was fixed for reboot retries for IBM i partitions such that the first load source I/O adapter (IOA) is retried instead of bypassed after the first failed attempt. The reboot retries are done for an hour before the reboot process gives up. This error can occur if there is more than one known load source, and the IOA of the first load source is different from the IOA of the last load source. The error can be circumvented by retrying the boot of the partition after the load source device has become available.
- On systems using PowerVM firmware, a problem was fixed for adapters failing to transition to shared SR-IOV mode on the IPL after changing the adapter from dedicated mode. This intermittent problem could occur on systems using SR-IOV with very large memory configurations.
- On systems using PowerVM firmware, a problem was fixed for SR-IOV adapters in shared mode for a transmission stall or time out with SRC B400FF01 logged. The time out happens during Virtual Function (VF) shutdowns and during Function Level Resets (FLRs) with network traffic running.
This fix updates adapter firmware to 10.2.252.1927, for the following Feature Codes: EN15, EN16, EN17, EN18, EN0H, EN0J, EN0M, EN0N, EN0K, EN0L, EL38, EL3C, EL56, and EL57.
The SR-IOV adapter firmware level update for the shared-mode adapters happens under user control to prevent unexpected temporary outages on the adapters. A system reboot will update all SR-IOV shared-mode adapters with the new firmware level. In addition, when an adapter is first set to SR-IOV shared mode, the adapter firmware is updated to the latest level available with the system firmware (and it is also updated automatically during maintenance operations, such as when the adapter is stopped or replaced). And lastly, selective manual updates of the SR-IOV adapters can be performed using the Hardware Management Console (HMC). To selectively update the adapter firmware, follow the steps given at the IBM Knowledge Center for using HMC to make the updates: https://www.ibm.com/support/knowledgecenter/HW4M4/p8efd/p8efd_updating_sriov_firmware.htm.
Note: Adapters that are capable of running in SR-IOV mode, but are currently running in dedicated mode and assigned to a partition, can be updated concurrently either by the OS that owns the adapter or the managing HMC (if OS is AIX or VIOS and RMC is running).
- On systems with maximum memory configurations (where every DIMM slot is populated - size of DIMM does not matter), a problem has been fixed for systems losing performance and going into Safe mode (a power mode with reduced processor frequencies intended to protect the system from overheating and excessive power consumption) with B1xx2AC3/B1xx2AC4 SRCs logged. This happened because of On-Chip Controller (OCC) timeout errors when collecting Analog Power Subsystem Sweep (APSS) data, used by the OCC to tune the processor frequency. This problem occurs more frequently on systems that are running heavy workloads. Recovery from Safe mode back to normal performance can be done with a re-IPL of the system, or concurrently using the following link steps for a soft reset of the service processor: https://www.ibm.com/support/knowledgecenter/POWER8/p8hby/p8hby_softreset.htm.
To check or validate that Safe mode is not active on the system will require a dynamic celogin password from IBM Support to use the service processor command line:
1) Log into ASMI as celogin with dynamic celogin password generated by IBM Support
2) Select System Service Aids
3) Select Service Processor Command Line
4) Enter "tmgtclient --query_mode_and_function" from the command line
The first line of the output, "currSysPwrMode" should say "NOMINAL" and this means the system is in normal mode and that Safe mode is not active.
- A problem has been fixed for systems losing performance and going into Safe mode (a power mode with reduced processor frequencies intended to protect the system from overheating and excessive power consumption) with B1xx2AC3/B1xx2AC4 SRCs logged. This happened because of an On-Chip Controller (OCC) internal queue overflow. The problem has only been observed for systems running heavy workloads with maximum memory configurations (where every DIMM slot is populated - size of DIMM does not matter), but this may not be required to encounter the problem. Recovery from Safe mode back to normal performance can be done with a re-IPL of the system, or concurrently using the following link steps for a soft reset of the service processor: https://www.ibm.com/support/knowledgecenter/POWER8/p8hby/p8hby_softreset.htm.
To check or validate that Safe mode is not active on the system will require a dynamic celogin password from IBM Support to use the service processor command line:
1) Log into ASMI as celogin with dynamic celogin password generated by IBM Support
2) Select System Service Aids
3) Select Service Processor Command Line
4) Enter "tmgtclient --query_mode_and_function" from the command line
The first line of the output, "currSysPwrMode" should say "NOMINAL" and this means the system is in normal mode and that Safe mode is not active.
- On systems using PowerVM firmware, a problem was fixed for a partition boot from a USB 3.0 device that has an error log SRC BA210003. The error is triggered by an Open Firmware entry to the trace buffer during the partition boot. The error log can be ignored as the boot is successful to the OS.
- On systems using PowerVM firmware, a problem was fixed for a partition boot fail or hang from a Fibre Channel device having fabric faults. Some of the fabric errors returned by the VIOS are not interpreted correctly by the Open Firmware VFC drive, causing the hang instead of generating helpful error logs.
- On systems with redundant service processors, a problem was fixed for an extra SRC B150F138 logged for a power supply that had already been replaced. The problem was triggered by a service processor failover and an old power supply fault event that was not cleared on the backup service processor. This caused the SRC B150F138 to be logged for a second time. This problem can be circumvented by clearing the error log associated with the bad FRU when the FRU is replaced.
- On systems using PowerVM firmware, a problem was fixed for a Power Enterprise Pool (PEP) resource Grace Period not being reset when the server is in the "Out of Compliance" state and the resource has been returned to put the server back in Compliance. The Grace Period was not being reset after a double-commit of a resource (doing an "remove" of an active resource) was resolved by restarting the server with the double-committed resource. When Grace Period ends, the "double-committed" resources on the server have to have been freed up from use to prevent the server from going to "Out of Compliance". If the user fails to free up the resource, the PEP is in an "Out of Compliance" state, and the only PEP actions allowed are ones to free up the double-commit. Once that is completed, the PEP is back In Compliance. The loss of the Grace Period for the error makes it difficult to move resources around in the PEP. Without the fix, the user can "Add" another PEP resource to the server, and the action of adding a PEP resource resets the Grace Period timer. One could then "Remove" that one PEP resource just added, and then any further "removes" of PEP resources would behave as expected with the full Grace Period in effect.
- On systems using PowerVM firmware, a problem was fixed for Power Enterprise Pool (PEP) IFL processors assignments causing an "Out of Compliance" for normal processor licenses. The number of IFL processors purchased was first credited as satisfying any "unreturned" PEP processor resources, thus potentially leaving the system "Out Of Compliance" since IFL processors should not be taking the place of the normal (expensive) processor usage. In this situation, without the fix, the user will need to either purchase more "expensive" non-IFL processors to satisfy the non-IFL workloads or adjust the partitions to reduce the usage of non-IFL processors. This is a very infrequent problem for the following reasons:
1) PEP processors are infrequently left "unreturned" for short periods of time for specialized operations such as LPM migrations
2) The user would have to purchase IFL processors from IBM, which is not a common occurrence.
3) The user would have to put in a COD key for IFL processors while a PEP processor is still "unreturned"
- On systems using PowerVM firmware, a problem was fixed for a power off hanging at D200C1FF caused by a vNIC VF failover error with SRC B200F011. The power off hang error is infrequent because it requires that a VF failover error having occurred first. The system can be recovered by using the power off immediate option from the Hardware Management Console (HMC).
- On systems using PowerVM firmware, a problem was fixed for the incorrect reporting of the Universally Unique Identifier (UUID) to the OS, which prevented the tracking of a partition as it moved within a data center. The UUID value as seen on HMC or the NovaLink did not match the value as displayed in the OS.
- On systems using PowerVM firmware, a problem was fixed for an error finding the partition load source that has a GPT format. GUID Partition Table (GPT) is a standard for the layout of the partition table on a physical storage device used in the server, such as a hard disk drive or solid-state drive, using globally unique identifiers (GUID). Other drives that are working may be using the older master boot record (MBR) partition table format. This problem occurs whenever load sources utilizing the GPT format occur in other than the first entry of the boot table. Without the fix, a GPT disk drive must be the first entry in the boot table to be able to use it to boot a partition.
- On systems using PowerVM firmware, a problem was fixed for an SRC BA090006 serviceable event log occurring whenever an attempt was made to boot from an ALUA (Asymmetric Logical Unit Access) drive. These drives are always busy by design and cannot be used for a partition boot, but no service action is required if a user inadvertently tries to do that. Therefore, the SRC was changed to be an informational log.
|
SC860_082_056 / FW860.20
2017/03/17 |
Impact: Availability Severity: SPE
New features and functions
- Support for the Redfish API for provisioning of Power Management tunable (EnergyScale) parameters. The Redfish Scalable Platforms Management API ("Redfish") is a DMTF specification that uses RESTful interface semantics to perform out-of-band systems management. (http://www.dmtf.org/standards/redfish).
Redfish service enables platform management tasks to be controlled by client scripts developed using secure and modern programming paradigms.
For systems with redundant service processors, the Redfish service is accessible only on the primary service processor. Usage information for the Redfish service is available at the following IBM Knowledge Center link: https://www.ibm.com/support/knowledgecenter/en/POWER8/p8hdx/p8_workingwithconsoles.htm.
The IBM Power server supports DMTF Redfish API (DSP0266, version 1.0.3 published 2016-06-17) for systems management.
A copy of the the Redfish schema files in JSON format published by the DMTF (http://redfish.dmtf.org/schemas/v1/) are packaged in the firmware image.
The schema files are distributed on chip to enable proper functioning in deployments with no WAN connectivity.
IBM extensions to the Redfish schema are published at http://public.dhe.ibm.com/systems/power/redfish/schemas/v1. Copyright notices for the DMTF Redfish API and schemas are at: (a) http://www.dmtf.org/about/policies/copyright, and (b) http://redfish.dmtf.org/schemas/README8010.html.
- Support added to reduce memory usage for shared SR-IOV adapters.
- Support for the Advanced System Management Interface (ASMI) was changed to allow the special characters of "I", "O", and "Q" to be entered for the serial number of the I/O Enclosure under the Configure I/O Enclosure option. These characters have only been found in an IBM serial number rarely, so typing in these characters will normally be an incorrect action. However, the special character entry is not blocked by ASMI anymore so it is able to support the exception case. Without the enhancement, the typing of one of the special characters causes message "Invalid serial number" to be displayed.
- Support was added to the Advanced System Management Interface (ASMI) "System Service Aids => Cable Validation" to add a timestamp for when the last time the cables were validated.
System firmware changes that affect all systems
- A problem was fixed for the setting the disable of a periodic notification for a call home error log SRC B150F138 for Memory Buffer resources (membuf) from the Advanced System Management Interface (ASMI).
- A problem was fixed for the call home data for the B1xx2A01 SRC to include the min/max/average readings for more values. The values for processor utilization, memory utilization, and node power usage were added.
- A problem was fixed for incorrect callouts of the Power Management Controller (PMC) hardware with SRC B1112AC4 and SRC B1112AB2 logged. These extra callouts occur when the On-Chip Controller (OCC) has placed the system in the safe state for a prior failure that is the real problem that needs to be resolved.
- A problem was fixed for System Vital Product Data (SVPD) FRUs being guarded but not having a corresponding error log entry. This is a failure to commit the error log entry that has occurred only rarely.
- A problem was fixed for the failover to the backup PNOR on a Hostboot Self Boot Engine (SBE) failure. Without the fix, the failed SBE causes loss of processors and memory with B15050AD logged. With the fix, the SBE is able to access the backup PNOR and IPL successfully by deconfiguring the failing PNOR and calling it out as a failed FRU.
- A problem was fixed for the Advanced System Management Interface (ASMI) "System Service Aids => Error/Event Logs" panel not showing the "Clear" and "Show" log options and also having a truncated error log when there are a large number of error logs on the system.
- A problem was fixed a system going into safe mode with SRC B1502616 logged as informational without a call home notification. Notification is needed because the system is running with reduced performance. If there are unrecoverable error logs and any are marked with reduced performance and the system has not been rebooted, then the system is probably running in safe mode with reduced performance. With the fix, the SRC B1502616 is a Unrecoverable Error (UE).
- A problem was fixed for valid IPv4 static IP addresses not being allowed to communicate on the network and not being allowed to be configured.
The Advanced System Management Interface (ASMI) static IPv4 address configuration was not allowing "255" in the IP address subfields. The corrected range checking is as follows:
Allowed values: x.255.x.x, x.x.255.x, x.255.255.x
Disallowed values: x.x.x.255
The failure for the communication on the network is seen if the problematic IP addresses are in use prior to a firmware update to 860.00, 860.10, 860.11, or 860.12. After the firmware update, the service processor is unable to communicate on the network. The problem can be circumvented by changing the service processor to use DHCP addressing, or by moving the IP address to a different static IP range, prior to doing the firmware update.
- A problem was fixed for corrupt service processor error log entries caused by incorrect error log synchronization between primary and backup service processor during firmware updates. At the time of the corruption an B1818601 is logged with a fipsdump generated. Then during normal operations, periodic B1818A12 SRC may be logged as the corrupted error log entries are encountered. No service action is needed for the corrupted error logs as the old corrupted entries will be deleted as new error logs are added as part of the error log housekeeping.
- A problem was fixed for an unneeded service action request for a informational VRM redundant phase fail error logged with SRC 11002701. If reminders for service action with SRC B150F138 are occurring for this problem, then firmware containing the fix needs to be installed and ASMI error logs need to be cleared in order to stop the periodic reminder.
System firmware changes that affect certain systems
- On systems using PowerVM firmware, a problem was fixed for a blank SRC in the LPA dump for user-initiated non-disruptive adjunct dumps. The A2D03004 SRC is needed for problem determination and dump analysis.
- On a system using PowerVM firmware with an IBM i partition and VIOS, a problem was fixed for a Live Partition Mobility migration for a IBM i partition that fails if there is a VIOS failover during the migration suspended window.
- On a system using PowerVM firmware and VIOS, a problem was fixed for a HMC "Incomplete State" after a Live Partition Mobility migration followed by a VIOS failover. The error is triggered by a delete operation on a migration adapter on the VIOS that did the failover. The HMC "Incomplete State" can be recovered from by doing a re-IPL of the system. This error can also prevent a VIOS from activating.
- On systems using PowerVM firmware, a problem was fixed with SR-IOV adapter error recovery where the adapter is left in a failed state in nested error cases for some adapter errors. The probability of this occurring is very low since the problem trigger is multiple low-level adapter failures. With the fix, the adapter is recovered and returned to an operational state.
- On systems using PowerVM firmware with PCIe adapters in Single Root I/O Virtualization (SR-IOV) shared mode, a problem was fixed for the hypervisor SR-IOV adjunct partition failing during the IPL with SRCs B200F011 and B2009014 logged. The SR-IOV adjunct partition successfully recovers after it reboots and the system is operational.
- On systems using PowerVM firmware with PCIe adapters in Single Root I/O Virtualization (SR-IOV) shared-mode in a PCIe slot with Enlarged IO Capacity and 2TB or more of system memory, a problem was fixed for the hypervisor SR-IOV adjunct partition failing during the IPL with SRCs B200F011 and B2009014 logged. In this configuration, it is possible the SR-IOV adapter will not become functional following a system reboot or when an adapter is first configured into shared-mode. Larger system memory configurations of 2TB or more than 1TB are more likely to encounter the problem. The problem can be avoided by reducing the number of PCIe slots with Enlarged IO Capacity enabled so it does not include adapters in SR-IOV shared-mode. Another circumvention option is to move the adapter to an SR-IOV capable PCIe slot where Enlarged IO Capacity is not enabled.
- On a system using PowerVM firmware and VIOS, a problem was fixed for a Live Partition Mobility (LPM) migration for an Active Memory Sharing (AMS) partition that hangs if there is a VIOS failover during the migration.
- On systems using PowerVM firmware, a problem was fixed for the PCIe3 Optical Cable Adapter for the PCIe3 Expansion Drawer failing with SRC B7006A84 error logged during the IPL. The failed cable adapter can be recovered by using a concurrent repair operation to power it off and on. Or the system can be re-IPLed to recover the cable adapter. The affected optical cable adapters have feature codes #EJ05, #EJ06, and #EJ08 with CCINs 2B1C, 6B52, and 2CE2, respectively.
- On systems using PowerVM firmware, the hypervisor "vsp" macro was enhanced to show the type of the adjunct partition. The "vsp -longname" macro option was also updated to list the location codes for the SR-IOV adjunct partitions. The hypervisor macros are used by IBM support to help debug Power system problems.
- On systems using PowerVM firmware, a problem was fixed for PCIe Host Bridge (PHB) outages and PCIe adapter failures in the PCIe I/O expansion drawer caused by error thresholds being exceeded for the LEM bit [21] errors in the FIR accumulator. These are typically minor and expected errors in the PHB that occur during adapter updates and do not warrant a reset of the PHB and the PCIe adapter failures. Therefore, the threshold LEM[21] error limit has been increased and the LEM fatal error has been changed to a Predictive Error to avoid the outages for this condition.
- On systems using PowerVM firmware, a problem was fixed for PCIe3 I/O expansion drawer (#EMX0) link improved stability. The settings for the continuous time linear equalizers (CTLE) was updated for all the PCIe adapters for the PCIe links to the expansion drawer. The CEC must be re-IPLed for the fix to activate.
- On systems using PowerVM firmware with IBM i partitions, a problem was fixed for frequent logging of informational B7005120 errors due to communications path closed conditions during messaging from HMCs to IBMi partitions. In the majority of cases these errors are due to normal operating conditions and not due to errors that require service or attention. The logging of informational errors due to this specific communications path closed condition that are the result of normal operating conditions has been removed.
- On a system using PowerVM firmware with an IBM i partition, a problem was fixed for a D-mode boot failure for IBM i from an USB RDX cartridge. There is a hang at the LPAR progress code C2004130 for a period of time and then a failure with SRC B2004158 logged. There is a USB External Dock (FC #EU04) and Removable Disk Cartridge (RDX) 63B8-005 attached. The error is intermittent so the RDX can be powered off and back on to retry the D-mode boot to recover.
- On systems using PowerVM firmware, the following problems were fixed for SR-IOV adapters:
1) Insufficient resources reported for SR-IOV logical port configured with promiscuous mode enable and a Port VLAN ID (PVID) when creating new interface on the SR-IOV adapters.
2) Spontaneous dumps and reboot of the adjunct partition for SR-IOV adapters.
3) Adapter enters firmware loop when single bit ECC error is detected. System firmware detects this condition as a adapter command time out. System firmware will reset and restart the adapter to recover the adapter functionality. This condition will be reported as a temporary adapter hardware failure.
4) vNIC interfaces not being deleted correctly causing SRC B400FF01 to be logged and Data Storage Interrupt (DSI) errors with failiure on boot of the LPAR.
This set of fixes updates adapter firmware to 10.2.252.1926, for the following Feature Codes: EN15, EN16, EN17, EN18, EN0H, EN0J, EN0M, EN0N, EN0K, EN0L, EL38 , EL3C, EL56, and EL57.
The SR-IOV adapter firmware level update for the shared-mode adapters happens under user control to prevent unexpected temporary outages on the adapters. A system reboot will update all SR-IOV shared-mode adapters with the new firmware level. In addition, when an adapter is first set to SR-IOV shared mode, the adapter firmware is updated to the latest level available with the system firmware (and it is also updated automatically during maintenance operations, such as when the adapter is stopped or replaced). And lastly, selective manual updates of the SR-IOV adapters can be performed using the Hardware Management Console (HMC). To selectively update the adapter firmware, follow the steps given at the IBM Knowledge Center for using HMC to make the updates: https://www.ibm.com/support/knowledgecenter/HW4M4/p8efd/p8efd_updating_sriov_firmware.htm.
Note: Adapters that are capable of running in SR-IOV mode, but are currently running in dedicated mode and assigned to a partition, can be updated concurrently either by the OS that owns the adapter or the managing HMC (if OS is AIX or VIOS and RMC is running).
- On systems using PowerVM firmware with an IBM i partition, a problem was fixed for incorrect maximum performance reports based on the wrong number of "maximum" processors for the system. Certain performance reports that can be generated on IBMi systems contain not only the existing machine information, but also "what-if" information, such as "how would this system perform if it had all the processors possible installed in this system". This "what-if" report was in error because the maximum number of processors possible was too high for the system.
- On systems using PowerVM firmware, a problem was fixed for degraded PCIe3 links for the PCIe3 expansion drawer with SRC B7006A8F not being visible on the HMC. This occurred because the SRC was informational. The problem occurs when the link attaching a drawer to the system trains to x8 instead of x16. With the fix, the SRC has been changed to a B70006A8B permanent error for the degraded link.
- On systems using PowerVM firmware, a problem was fixed for a concurrent exchange of a CAPI adapter that left the new adapter in a deactivated state. The system can be powered off and IPLed again to recover the new adapter. The CAPI adapters have the following feature codes: #EC3E, #EC3F, #EC3L, #EC3M, #EC3T, #EC3U, #EJ16, #EJ17, #EJ18, #EJ1A, and #EJ1B.
- On a system using PowerVM firmware with SR-IOV adapters, a problem was fixed for a DLPAR remove on a Virtual Function (VF) of a ConnectX-4 (CX4) adapter that failed with AIX error "0931-013 Unable to isolate the resource". The HMC reported error is "HSCL12B5 The operation to remove SR-IOV logical port xx failed because of the following error: HSCL131D The SR-IOV logical port is still in use by the partition". The failing PCIe3 adapters are sourced from Mellanox Corporation based on ConnectX-4 technology and have the following feature codes and CCINs: #EC3E, #EC3F with CCIN 2CEA; #EC3L and #EC3M with CCIN 2CEC; and #EC3T and #ECTU with CCIN 2CEB. The issue occurs each time a DLPAR remove operation is attempted on the VF. Restarting the partition after a failed DLPAR remove recovers from the error.
- On systems using PowerVM firmware, a problem was fixed for NVRAM corruption that can occur when deleting a partition that owns a CAPI adapter, if that CAPI adapter is not assigned to another partition before the system is powered off. On a subsequent IPL, the system will come up in recovery mode if there is NVRAM corruption. To recover, the partitions must be restored from the HMC. The frequency of this error is expected to be rare. The CAPI adapters have the following feature codes: #EC3E, #EC3F, #EC3L, #EC3M, #EC3T, #EC3U, #EJ16, #EJ17, #EJ18, #EJ1A, and #EJ1B.
- On systems using PowerVM firmware, a problem was fixed for NVRAM corruption and a HMC recovery state when using Simplified Remote Restart partitions. The failing systems will have at least one Remote Restart partition and on the failed IPL there will be a B70005301 SRC with word 7 being 0X00000002.
- On systems using PowerVM firmware, a problem was fixed for a group of shared processor partitions being able to exceed the designated capacity placed on a shared processor pool. This error can be triggered by using the DLPAR move function for the shared processor partitions, if the pool has already reached its maximum specified capacity. To prevent this problem from occurring when making DLPAR changes when the pool is at the maximum capacity, do not use the DLPAR move operation but instead break it into two steps: DLPAR remove followed by DLPAR add. This gives enough time for the DLPAR remove to be fully completed prior to starting the DLPAR add request.
- On systems using PowerVM firmware, a problem was fixed for partition boot failures and run time DLPAR failures when adding I/O that log BA210000, BA210003, and/or BA210005 errors. The fix also applies to run time failures configuring an I/O adapter following an EEH recovery that log BA188001 events. The problem can impact IBMi partitions running in any processor mode or AIX/Linux partitions running in P7 (or older) processor compatibility modes. The problem is most likely to occur when the system is configured in the Manufacturing Default Configuration (MDC) mode. The trigger for the problem is a race-condition between the hypervisor and the physical operations panel with a very rare frequency of occurrence.
|
SC860_070_056 / FW860.12
2017/01/13 |
Impact: Availability Severity: SPE
System firmware changes that affect certain systems
- On a system using PowerVM firmware, a problem was fixed for the System Management Services (SMS) SAS utility showing very large (incorrect) disk capacity values depending on the size of the disk or Volume Set/Array. The problem occurs when the number of blocks on a disk is 2 G or more.
- On a system using PowerVM firmware running a Linux OS, a problem was fixed for support for Coherent Accelerator Processor Interface (CAPI) adapters. The CAPI related RTAS h-calls for the CAPI devices could not be made by the Linux OS, impacting the CAPI adapter functionality and usability. This problem involves the following adapters: the PCIe3 LP CAPI Accelerator Adapter with F/C #EJ16 that is used on the S812L(8247-21L) and S822L (8247-22L) models; the PCIe3 CAPI FlashSystem Acclerator Adapter with F/C #EJ17 that is used on the S814(8286-41A) and S824(8286-42A) models; and the PCIe3 CAPI FlashSystem Accelerator Adapter with F/C #EJ18 that is used on the S822(8284-22A), E870(9119-MME), and E880(9119-MHE) models. This problem does not pertain to PowerVM AIX partitions using CAPI adapters.
- On a system using PowerVM firmware, a problem was fixed for Live Partition Mobility (LPM) migrations to FW860.10 or FW860.11 from any other level of firmware (i.e. not FW 860.10 or FW860.11) that caused errors in the output of the AIX "lsattr -El mem0" command and Dynamic LPAR (DLPAR) operations. The "lsattr" command will report the partition only has one logical memory block (LMB) of memory assigned to it, even though there is more memory assigned to the partition. Also, as a result of this problem, DLPAR operations will fail with an error indicating the request could not be completed. This issue affects AIX 5.3, AIX 6.1, AIX 7.1, AIX 7.2 TL 0, and may result in AIX DLPAR error message "0931-032 Firmware failure. Data may be out of sync and the system may require a reboot." This issue also affect all levels of Linux. Not affected by this issue are AIX 7.2 TL 1, VIOS and IBM i partitions.
In addition, after performing LPM from FW860 to earlier versions of firmware, the DLPAR of Virtual Adapters will fail with HMC error message HSCL294C, which contains text similar to the following: "0931-007 You have specified an invalid drc_name."
Without the fix, a reboot of the migrated partition will correct the problem.
- On a system using PowerVM firmware, a problem was fixed for I/O DLPARs that result in partition hangs. To trigger the problem, the DLPAR operation must be performed on a partition which has been migrated via a Live Partition Mobility (LPM) operation from a P6 or P7 system to a P8 system. Additionally, DLPAR of I/O will fail when performed on a partition which has been migrated via an LPM operation from a P8 system to a P6 or P7 system. The failure will produce HMC error message HSCL2928, which contains text similar to the following: "0931-011 Unable to allocate the resource to the partition." DLPAR operations for memory or CPU are not affected. This issue affects all Linux and AIX partitions. IBMi partitions are not affected.
|
SC860_063_056 / FW860.11
2016/12/05 |
Impact: N/A Severity: N/A
- This Service Pack contained updates for MANUFACTURING ONLY.
|
SC860_056_056 / FW860.10
2016/11/18 |
Impact: New Severity: New
New features and functions
- Support enabled for Live Partition Mobility (LPM) operations.
- Support enabled for partition Suspend and Resume from the HMC.
- Support enabled for partition Remote Restart.
- Support enabled for PowerVM vNIC. PowerVM vNIC combined many of the best features of SR-IOV and PowerVM SEA to provide a network solution with options for advanced functions such as Live Partition Mobility along with better performance and I/O efficiency when compared to PowerVM SEA. In addition PowerVM vNIC provided users with bandwidth control (QoS) capability by leveraging SR-IOV logical ports as the physical interface to the network.
- Support for dynamic setting of the Simplified Remote Restart VM property, which enables this property to be turned on or off dynamically with the partition running.
- Support for PowerVM and HMC to get and set the boot list of a partition.
- Support for PowerVM partition restart in a Disaster Recovery (DR) environment.
- Support on PowerVM for a partition with 32 TB memory. AIX, IBM i and Linux are supported but IBM i must be IBM i 7.3. TR1 IBM i 7.2 has a limit of 16 TB per partition and IBM i 7.1 has a limit of 8 TB per partition. AIX level must be 7.1S or later. Linux distributions supported are RHEL 7.2 P8, SLES 12 SP1, Ubuntu 16.04 LTS, RHEL 7.3 P8, SLES 12 SP2, Ubuntu 16.04.1, and SLES 11 SP4 for SAP HANA.
- Support for PowerVM and PowerNV (non-virtualized or OPAL bare-metal) booting from a PCIe Non-Volatile Memory express (NVMe) flash adapter. The adapters include feature codes #EC54 and #EC55 - 1.6 TB, and #EC56 and #EC57 - 3.2 TB NVMe flash adapters with CCIN 58CB and 58CC respectively.
- Support for PowerVM NovaLink V1.0.0.4 which includes the following features:
- IBM i network boot
- Live Partition Mobility (LPM) support for inactive source VIOS
- Support for SR-IOV configurations, vNIC, and vNIC failover
- Partition support for Red Hat Enterprise Linux
- Support for a decrease in the amount of PowerVM memory needed to support Huge Dynamic DMA Window (HDDW) for a PCI slot by using 64K pages instead of 4K pages. The hypervisor only allocates enough storage for the Enlarged IO Capacity (Huge Dynamic DMA Window) capable slots to map every page in main storage with 64K pages rather than 4K pages as was done previously. This affects only the Linux OS as AIX and IBM i do not use HDDW.
- Support added to reduce the number of error logs and call homes for the non-critical FRUs for the power and thermal faults of the system.
- Support for redundancy in the the transfer of partition state for Live Partition Mobility (LPM) migration operations. Redundant VIOS Mover Service Partitons (MSPs) can be defined along with redundant network paths at the VIOS/MSP level. When redundant MSP pairs are used, the migrating memory pages of the logical partition are transferred from the source system to the target system by using two MSP pairs simultaneously. If one of the MSP pair fails, the migration operation continues by using the other MSP pair. In some scenarios, where a common shared Ethernet adapter is not used, use redundant MSP pairs to improve performance and reliability.
Note: For a LPM migration for a partition using Advanced Memory Sharing (AMS) in a dual (redundant) MSP configuration the LPM operation may hang if the MSP connection fails during the LPM migration. To avoid this issue that applies only to AMS partitions, the AMS migrations should only be done from the HMC command line using the migrlpar command and specifying --redundentmsp 0 to disable the redundant MSPs.
Note: To use redundant MSP pairs, all VIOS MSPs must be at version 2.2.5.00 or later, the HMC at version 8.6.0 or later, and the firmware level FW860 or later.
For more information on LPM and VIOS supported levels and restrictions, refer to the following links on the IBM Knowledge Center:
http://www.ibm.com/support/knowledgecenter/PurePower/p8hc3/p8hc3_firmwaresupportmatrix.htm
https://www.ibm.com/support/knowledgecenter/HW4L4/p8eeo/p8eeo_ipeeo_main.htm
- Support for failover capability for vNIC client adapters in the PowerVM hypervisor, rather than requiring the failover configuration to be done in the client OS. To create a redundant connection, the HMC adds another vNIC server with the same remote lpar ID and remote DRC as the first, giving each server its own priority.
- Support for SAP HANA with Solution edition with feature code #EPVR on 3.65 GHZ processors and 12-core activations and 512 GB memory activations on SUSE Linux.. SAP HANA is an in-memory platform for processing high volumes of data in real-time. HANA allows data analysts to query large volumes of data in real-time. HANA's in-memory database infrastructure frees analysts from having to load or write-back data.
- Support for the Hardware Management Console (HMC) to access the service processor IPMI credentials and to retrieve Performance and Capacity Monitor (PCM) data for viewing in a tabular format or for exporting as CSV values. The enhanced HMC interface can now start and stop VIOS Shared Storage Pool (SSP) monitoring from the HMC and start and stop SSP historical data aggregation.
- Support for the Advanced System Management Interface (ASMI) was changed to not create VPD deconfiguration records and call home alerts for hardware FRUs that have one VPD chip of a redundant pair broken or inaccessible. The backup VPD chip for the FRU allows continued use of the hardware resource. The notification of the need for service for the FRU VPD is not provided until both of the redundant VPD chips have failed for a FRU.
System firmware changes that affect all systems
- A problem was fixed for a failed IPL with SRC UE BC8A090F that does not have a hardware callout or a guard of the failing hardware. The system may be recovered by guarding out the processor associated with the error and re-IPLing the system. With the fix, the bad processor core is guarded and the system is able to IPL.
- A problem was fixed for an infrequent service processor failover hang that results in a reset of the backup service processor that is trying to become the new primary. This error occurs more often on a failover to a backup service processor that has been in that role for a long period of time (many months). This error can cause a concurrent firmware update to fail. To reduce the chance of a firmware update failure because of a bad failover, an Administrative Failover (AFO) can be requested from the HMC prior to the start of the firmware update. When the AFO has completed, the firmware update can be started as normally done.
- A problem was fixed for an Operations Panel Function 04 (Lamp test) during an IPL causing the IPL to fail. With the fix, the lamp test request is rejected during the IPL until the hypervisor is available. The lamp test can be requested without problems anytime after the system is powered on to hypervisor ready or an OS is running in a partition.
- A problem was fixed for On-Chip Controller (OCC) errors that had excessive callouts for processor FRUs. Many of the OCC errors are recoverable and do not required that the processor be called out and guarded. With the fix, the processors will only be called out for OCC errors if there are three or more OCC failures during a time period of a week.
- A problem was fixed for the loss of the setting for the disable of a periodic notification for a call home error log after a failover to the backup service processor on a redundant service processor system. The call home for the presence of a failed resource can get re-enabled (if manually disabled in ASMI on the primary service processor) after a concurrent firmware update or any scenario that causes the service processor to fail over and change roles. With the fix, the periodic notification flag is synchronized between the service processors when the flag value is changed.
- A problem was fixed for the On-Chip Controller (OCC) incorrectly calling out processors with SRC B1112A16 for L4 Cache DIMM failures with SRC B124E504. This false error logging can occur if the DIMM slot that is failing is adjacent to two unoccupied DIMM slots.
- A problem was fixed for CEC drawer deconfiguration during a IPL due to SRCs BC8A0307 and BC8A1701 that did not have the correct hardware callout for the failing SCM. With the fix, the failing SCM is called out and guarded so the CEC drawer will IPL even though there is a failed processor.
- A problem was fixed for device time outs during a IPL logged with a SRC B18138B4. This error is intermittent and no action is needed for the error log. The service processor hardware server has allotted more time of the device transactions to allow the transactions to complete without a time-out error.
System firmware changes that affect certain systems
- DISRUPTIVE: On systems using the PowerVM firmware, a problem was fixed for an "Incomplete" state caused by initiating a resource dump with selector macros from NovaLink (vio -dump -lp 1 -fr). The failure causes a communication process stack frame, HVHMCCMDRTRTASK, size to be exceeded with a hypervisor page fault that disrupts the NovalLink and/or HMC communications. The recovery action is to re-IPL the CEC but that will need to be done without the assistance of the management console. For each partition that has a OS running on the system, shut down each partition from the OS. Then from the Advanced System Management Interface (ASMI), power off the managed system. Alternatively, the system power button may also be used to do the power off. If the management console Incomplete state persists after the power off, the managed system should be rebuilt from the management console. For more information on management console recovery steps, refer to this IBM Knowledge Center link: https://www.ibm.com/support/knowledgecenter/en/POWER7/p7eav/aremanagedsystemstate_incomplete.htm. The fix is disruptive because the size of the PowerVM hypervisor must be increased to accommodate the over-sized stack frame of the failing task.
- DEFERRED: On systems using the PowerVM firmware, a problem was fixed for a CAPI function unavailable condition on a system with the maximum number of CAPI adapters and partitions. Not enough bytes were allocated for CAPI for the maximum configuration case. The problem may be circumvented by reducing the number of active partitions or CAPI adapters. The fix is deferred because the size of the hypervisor must be increased to provide the additional CAPI space.
- DEFERRED: On systems using PowerVM firmware, a problem was fixed for cable card capable PCI slots that fail during the IPL. Hypervisor I/O Bus Interface UE B7006A84 is reported for each cable card capable PCI slot that doesn't contain a PCIe3 Optical Cable Adapter for the PCIe Expansion Drawer (feature code #EJ05). PCI slots containing a cable card will not report an error but will not be functional. The problem can be resolved by performing an AC cycle of the system. The trigger for the failure is the I2C devices used to detect the cable cards are not coming out of the power on reset process in the correct state due to a race condition.
- On systems using PowerVM firmware, a problem was fixed for network issues, causing critical situations for customers, when an SR-IOV logical port or vNIC is configured with a non-zero Port VLAN ID (PVID). This fix updates adapter firmware to 10.2.252.1922, for the following Feature Codes: EN15, EN16, EN17, EN18, EN0H, EN0J, EL38, EN0M, EN0N, EN0K, EN0L, and EL3C.
The SR-IOV adapter firmware level update for the shared-mode adapters happens under user control to prevent unexpected temporary outages on the adapters. A system reboot will update all SR-IOV shared-mode adapters with the new firmware level. In addition, when an adapter is first set to SR-IOV shared mode, the adapter firmware is updated to the latest level available with the system firmware (and it is also updated automatically during maintenance operations, such as when the adapter is stopped or replaced). And lastly, selective manual updates of the SR-IOV adapters can be performed using the Hardware Management Console (HMC). To selectively update the adapter firmware, follow the steps given at the IBM Knowledge Center for using HMC to make the updates: https://www.ibm.com/support/knowledgecenter/HW4M4/p8efd/p8efd_updating_sriov_firmware.htm.
Note: Adapters that are capable of running in SR-IOV mode, but are currently running in dedicated mode and assigned to a partition, can be updated concurrently either by the OS that owns the adapter or the managing HMC (if OS is AIX or VIOS and RMC is running).
- On systems using the PowerVM firmware, a problem was fixed for a Live Partition Mobility migration that resulted in the source managed system going to the management console Incomplete state after the migration to the target system was completed. This problem is very rare and has only been detected once.. The problem trigger is that the source partition does not halt execution after the migration to the target system. The management console went to the Incomplete state for the source managed system when it failed to delete the source partition because the partition would not stop running. When this problem occurred, the customer network was running very slowly and this may have contributed to the failure. The recovery action is to re-IPL the source system but that will need to be done without the assistance of the management console. For each partition that has a OS running on the source system, shut down each partition from the OS. Then from the Advanced System Management Interface (ASMI), power off the managed system. Alternatively, the system power button may also be used to do the power off. If the management console Incomplete state persists after the power off, the managed system should be rebuilt from the management console. For more information on management console recovery steps, refer to this IBM Knowledge Center link: https://www.ibm.com/support/knowledgecenter/en/POWER7/p7eav/aremanagedsystemstate_incomplete.htm
- On systems using PowerVM firmware, a problem was fixed for a shared processor pool partition showing an incorrect zero "Available Pool Processor" (APP) value after a concurrent firmware update. The zero APP value means that no idle cycles are present in the shared processor pool but in this case it stays zero even when idle cycles are available. This value can be displayed using the AIX "lparstat" command. If this problem is encountered, the partitions in the affected shared processor pool can be dynamically moved to a different shared processor pool. Before the dynamic move, the "uncapped" partitions should be changed to "capped" to avoid a system hang. The old affected pool would continue to have the APP error until the system is re-IPLed.
- On systems using PowerVM firmware, a problem was fixed for a latency time of about 2 seconds being added to a target Live Partition Mobility (LPM) migration system when there is a latency time check failure. With the fix, in the case of a latency time check failure, a much smaller default latency is used instead of two seconds. This error would not be noticed if the customer system is using a NTP time server to maintain the time.
- On multi-node systems with a incorrect memory configuration of DDR3 and DDR4 DIMMs, a problem was fixed for the IPL hanging for four hours instead of terminating immediately.
- On systems using PowerVM firmware, a rare problem was fixed for a system hang that can occur when dynamically moving "uncapped" partitions to a different shared processor pool. To prevent a system hang, the "uncapped" partitions should be changed to "capped" before doing the move.
- On systems using the PowerVM firmware, support was added fora new utility option for the System Management Services (SMS) menus. This is the SMS SAS I/O Information Utility. It has been introduced to allow an user to get additional information about the attached SAS devices. The utility is accessed by selecting option 3 (I/O Device Information) from the main SMS menu, and then selecting the option for "SAS Device Information".
- On systems using the PowerVM hypervisor firmware and Novalink, a problem was fixed for a NovaLink installation error where the hypervisor was unable to get the maximum logical memory buffer (LMB) size from the service processor. The maximum supported LMB size should be 0xFFFFFFFF but in some cases it was initialized to a value that was less than the amount of configured memory, causing the service processor read failure with error code 0X00000134.
- On systems using the PowerVM hypervisor firmware and CAPI adapters, a problem was fixed for CAPI adapter error recovery. When the CAPI adapter goes into the error recovery state, the Memory Mapped I/O (MMIO) traffic to the adapter from the OS continues, disrupting the recovery. With the fix, the MMIO and DMA traffic to the adapter are now frozen until the CAPI adapter is fully recovered. If the adapter becomes unusable because of this error, it can be recovered using concurrent maintenance steps from the HMC, keeping the adapter in place during the repair. The error has a low frequency since it only occurs when the adapter has failed for another reason and needs recovery.
- On systems using the PowerVM hypervisor firmware, when using affinity groups, if the group includes a VIOS, ensure the group is placed in the same drawer where the VIOS physical I/O is located. Prior to this change, if the VIOS was in an affinity group with other partitions, the partitions placement could over-ride the VIOS adapter placement rules and the VIOS could end up in a different drawer from the IO adapters.
- On systems using PowerVM firmware, a problem was fixed to improve error recovery when attempting to boot an iSCSI target backed by a drive formatted with a block size other than 512 bytes. Instead of stopping on this error, the boot attempt fails and then continues with the next potential boot device. Information regarding the reason for the boot failure is available in an error log entry. The 512 byte block size for backing devices for iSCSI targets is a partition firmware requirement.
- On systems using PowerVM firmware, a problem was fixed for extra resources being assigned in a Power Enterprise Pool (PEP). This only occurs if all of these things happen:
o Power server is in a PEP pool
o Power server has PEP resources assigned to it
o Power server powered down
o User uses HMC to 'remove' resources from the powered-down server
o Power server is then restarted. It should come up with no PEP resources, but it starts up and shows it still is using PEP resources it should not have.
To recover from this problem, the HMC 'remove' of the PEP resources from the server can be performed again.
- On systems using PowerVM firmware, a problem was fixed for a false thermal alarm in the active optical cables (AOC) for the PCIe3 expansion drawer with SRCs B7006AA6 and B7006AA7 being logged every 24 hours. The AOC cables have feature codes of #ECC6 through #ECC9, depending on the length of the cable. The SRCs should be ignored as they call for the replacement of the cable, cable card, or the expansion drawer module. With the fix, the false AOC thermal alarms are no longer reported.
- On systems using PowerVM firmware that have an attached HMC, a problem was fixed for a Live Partition Mobility migration that resulted in a system hang when an EEH error occurred simultaneously with a request for a page migration operation. On the HMC, it shows an incomplete state for the managed system with reference code A181D000. The recovery action is to re-IPL the source system but that will need to be done without the assistance of the HMC. From the Advanced System Management Interface (ASMI), power off the managed system. Alternatively, the system power button may also be used to do the power off. If the HMC Incomplete state persists after the power off, the managed system should be rebuilt from the HMC. For more information on HMC recovery steps, refer to this IBM Knowledge Center link: https://www.ibm.com/support/knowledgecenter/en/POWER7/p7eav/aremanagedsystemstate_incomplete.htm
|