IBM Support

Power9 System Firmware Fix History - Release levels VL9xx

Fix Readme


Abstract

Firmware History for VL9xx Levels.

Content

VL950

VL950
For Impact, Severity and other Firmware definitions, Please refer to the below 'Glossary of firmware terms' url:
https://www.ibm.com/support/pages/node/6555136
VL950_203_045 / FW950.G0
 
2025/12/19
Impact: Security     Severity: HIPER

System firmware changes that affect all systems
  • A security problem was fixed for CVE-2025-38556.
  • A problem was fixed in the Power Hypervisor code where resources assigned to Virtual Functions may be deallocated even when there are configured VF resources for partitions. This could happen during hardware repair or replacement (FRU Repair or Exchange) if the VF resources weren’t removed from the operating system before starting the procedure
  • A problem was fixed to continue to configure the all PCIe Switch slots under a PCIe Switch when one of the slot initialization operations fails under the PCIe Switch. The problem occurs only when there is an issue with one or more slots in IPL under PCIe Switch. This fix is applicable for all PCIe Switches in System Units and in IO Expansion Units.
  • A problem was fixed for a system state of incomplete on the management console which is not recoverable with a "Recover system" option of the managed system. The problem occurs when a new partition is created on the system, but the internal representation of the partition is not completed. The fix allows for the system firmware to validate partition creation is complete and will synchronize inconsistencies in the partition representation.
  • A problem was fixed that may cause a concurrent firmware update to fail and the system to go to Incomplete state on the HMC.
  • A problem was fixed where the hypervisor could allow memory to be over-configured on the system, leading to inconsistent memory values and errors on the management console.
  • A problem was fixed where if all TPM (trusted platform module) hardware has been marked as failed in the hypervisor, the hypervisor will block live partition migrations with the system key and capacity on demand activation codes from being applied.
  • A problem was fixed where the SMS boot position (index) may not populate properly. SMS options: 5 (Select Boot Options) -> 1 (Select/Install boot device) ->5 (List all Devices). Workaround: SMS options: 5 (Select Boot Options) -> 2 (Configure Boot Device Order) -> [ESC] -> 5 (Select Boot Options) -> 1 (Select/Install boot device) -> 5 (List all Devices).
  • A problem was fixed for a possible system termination with SRC B700F105 after changing the I/O configuration or deleting a partition that is in the failed state with the SRC B2001230.
  • The version of OpenSSL that PowerVM uses has been updated to OpenSSL 3.0.18. This change will mark all SR-IOV Shared Mode adapters as having an update available.
VL950_194_045 / FW950.F1
 
2025/12/19
Impact: Security     Severity: HIPER

System firmware changes that affect all systems
  • A security problem was fixed for CVE-2025-38556.
VL950_192_045 / FW950.F0
 
2025/08/15
Impact: Availability     Severity: ATT

System firmware changes that affect all systems
  • A problem was fixed where B400FF04 SRCs are seen on the system after non-concurrently replacing an SR-IOV adapter while it is in shared mode and has logical ports configured.
  • A problem was fixed where the error message "HSCL146A The I/O adapter: null is not valid." may be displayed by the management console when transitioning an SR-IOV adapter from shared to dedicated mode even though the operation was successful.
  • A problem was fixed for changing the Secure Boot setting for a partition where the Secure Boot setting for the partition will not take effect on the first partition boot after changing the setting. The Secure Boot setting will take effect on subsequent partition boots if the setting remains unchanged. This fix allows for the Secure Boot setting to take effect on the first partition boot after changing the Secure Boot setting.
  • A problem was fixed where link entries may be missing from the PCI configuration displayed from the HMC. As a workaround, the complete PCIe topology may be viewed from ASMI.
  • A change was made to suppress logging TPM failures with SRC B7009009 during TPM health checks.
  • A problem was fixed for a failure or partial completion for a processor config change for a partition. As a workaround, the processor config change can be retried.
  • A security problem was fixed for CVE-2025-36035
 
VL950_182_045 
/ FW950.E1
 
2025/09/12

Impact: Security Severity: HIPER

System firmware changes that affect all systems

A security problem was fixed for CVE-2025-36035

 

VL950_179_045 
/ FW950.E0
 
2025/04/18
 
Impact:  Security       Severity:  HIPER
 System firmware changes that affect all systems
  • A security problem was fixed for CVE-2024-13176.
  • A problem was fixed in the firmware for systems that are managed by both an HMC and Nohvalink that would prevent one of the management consoles from managing the system. This may cause one of the management consoles to display the system as in recovery state.
  • A problem was fixed which may prevent the Platform Keystore (PKS) consumer passwords from getting reset during a partition reboot. This issue can be avoided by not disabling PKS while the partition is running. In the event that this problem is encountered, rebooting the partition again while PKS is enabled should solve this problem. If that also does not work, a re-initialization of the partition or a full CEC reboot will reset this state.
  • A problem was fixed for a potential partition hang that could occur after a partition crash during Live Partition Mobility, Dynamic Platform Optimization (DPO), memory guard recovery, or memory mirroring defragmentation operation. As a workaround, the partition with the failure can be rebooted.
  • A problem was fixed where the replacement of FRU which contains TPM did not make it automatically usable by the system in the next IPL. The fix addresses this issue by detecting the replaced TPM and makes it usable in the next IPL.
  • A problem was fixed where there was no information callout regarding the failing TPM in SRC B15050AD. The fix addresses this issue by adding a medium priority callout for failing TPM.
 
VL950_176_045
/ FW950.D1
2025/04/18
Impact:  Security      Severity:  HIPER

System firmware changes that affect all systems
  • A security problem was fixed for CVE-2024-13176.
 
VL950_175_045
/ FW950.D0
 
2024/12/19
Impact:  Security      Severity:  HIPER

System firmware changes that affect all systems
 
  • Security problem was fixed for CVE-2023-45863
  • Security problem was fixed for CVE-2024-35960
  • Security problem was fixed for CVE-2024-35888
  • Security problem was fixed for CVE-2023-52781
  • Security problem was fixed for CVE-2024-26859
  • Security problem was fixed for CVE-2023-52686
  • Security problem was fixed for CVE-2024-26735
  • Security problem was fixed for CVE-2024-26763
  • Security problem was fixed for CVE-2024-26688
  • Security problem was fixed for CVE-2024-26733
  • Security problem was fixed for CVE-2023-52881
  • Security problem was fixed for CVE-2024-26659
  • Security problem was fixed for CVE-2024-26934
  • Security problem was fixed for CVE-2024-36883
  • Security problem was fixed for CVE-2023-52486
  • Security problem was fixed for CVE-2024-26671
  • Security problem was fixed for CVE-2024-36004
  • Security problem was fixed for CVE-2024-36008
  • Security problem was fixed for CVE-2024-26791
  • Security problem was fixed for CVE-2024-26644
  • Security problem was fixed for CVE-2023-52615
  • Security problem was fixed for CVE-2024-26640
  • Security problem was fixed for CVE-2023-52607
  • Security problem was fixed for CVE-2023-52457
  • Security problem was fixed for CVE-2023-52451
  • Security problem was fixed for CVE-2023-52454
  • Security problem was fixed for CVE-2023-6356
  • Security problem was fixed for CVE-2023-6536
  • Security problem was fixed for CVE-2023-6535
  • Security problem was fixed for CVE-2024-23850
  • Security problem was fixed for CVE-2024-0841
  • Security problem was fixed for CVE-2023-6915
  • Security problem was fixed for CVE-2023-6932
  • A problem has been fixed related to installing new firmware for platforms with EMX0 PCIe expansion drawers attached. SRC10009047 and SRC10009109 may be logged during the installation of new system firmware.
  • A problem was fixed for a Novalink partition where it may not get automatically restarted if it encounters a software issue during initialization.
  • A problem was fixed where an HMC will not properly display a server as being in a PEP2 pool. After a concurrent firmware update is performed to this Service Pack level, the problem can be fixed by rebooting the HMC or resetting the server's HMC connection.
  • The built-in version of OpenSSL that PowerVM uses has been updated to the most current LTS release of OpenSSL3, 3.0.15.
  • A problem was fixed where the certificate generation feature on IBM CertHub requires inclusion of the CN (Common Name) value in the CSR (certificate signing request). The FSP does not prompt for the CN field. Therefore, the CSR file generated from the FSP will fail to create a CA signed SSL certificate. The fix will prompt the user to add an CN in ASMi while CSR creation, which is required by IBM CertHub.
  • A problem was fixed where the certificate generation feature on IBM CertHub requires inclusion of the OU (organization unit) value in the CSR (certificate signing request). The FSP does not prompt for the OU field. Therefore, the CSR file generated from the FSP will fail to create a CA signed SSL certificate. The fix will prompt the user to add an OU in ASMi while CSR creation, which is required by IBM CertHub.
  • A problem was fixed where Redfish API output displays the psu input power value as Line input voltage.
  • A crash caused by SRC B113E504 was updated to isolate to the correct hardware for the crash.
VL950_174_045 / FW950.C2

2024/12/19
Impact:  Security      Severity:  HIPER

System firmware changes that affect all systems
  • Security problem was fixed for CVE-2023-45863
  • Security problem was fixed for CVE-2024-35960
  • Security problem was fixed for CVE-2024-35888
  • Security problem was fixed for CVE-2023-52781
  • Security problem was fixed for CVE-2024-26859
  • Security problem was fixed for CVE-2023-52686
  • Security problem was fixed for CVE-2024-26735
  • Security problem was fixed for CVE-2024-26763
  • Security problem was fixed for CVE-2024-26688
  • Security problem was fixed for CVE-2024-26733
  • Security problem was fixed for CVE-2023-52881
  • Security problem was fixed for CVE-2024-26659
  • Security problem was fixed for CVE-2024-26934
  • Security problem was fixed for CVE-2024-36883
  • Security problem was fixed for CVE-2023-52486
  • Security problem was fixed for CVE-2024-26671
  • Security problem was fixed for CVE-2024-36004
  • Security problem was fixed for CVE-2024-36008
  • Security problem was fixed for CVE-2024-26791
  • Security problem was fixed for CVE-2024-26644
  • Security problem was fixed for CVE-2023-52615
  • Security problem was fixed for CVE-2024-26640
  • Security problem was fixed for CVE-2023-52607
  • Security problem was fixed for CVE-2023-52457
  • Security problem was fixed for CVE-2023-52451
  • Security problem was fixed for CVE-2023-52454
  • Security problem was fixed for CVE-2023-6356
  • Security problem was fixed for CVE-2023-6536
  • Security problem was fixed for CVE-2023-6535
  • Security problem was fixed for CVE-2024-23850
  • Security problem was fixed for CVE-2024-0841
  • Security problem was fixed for CVE-2023-6915
  • Security problem was fixed for CVE-2023-6932
VL950_168_045 / FW950.C1

2024/10/25
 
Impact:  Security       Severity:  HIPER
 System firmware changes that affect all systems
  • A security problem was fixed for CVE-2024-45656
VL950_161_045 / FW950.C0

2024/09/27
Impact:  Availability       Severity:  ATT
 System firmware changes that affect all systems
  • A change was made to a base framework layer of various virtualization features, including SR-IOV.  While the change is not expressly needed for SR-IOV, concurrent system firmware updates to this level will cause SR-IOV adapters to indicate that a update is available.  Applying the update will cause a re-initialization of the specific SR-IOV adapter environment, causing a brief outage.  Update instructions: https://www.ibm.com/support/knowledgecenter/en/POWER9/p9efd/p9efd_updating_sriov_firmware.htm.
  • A security problem was fixed for CVE-2024-41781.
  • A problem was fixed for an HMC ExchangeFru operation which may fail with SRC B7006A9E when attempting to repair an EMX0 PCIe3 Expansion Drawer Module. This error only occurs with the RightBay in the case where the Low CXP cable has a fault or is improperly plugged. To workaround the problem, connect or replace the Low CXP cable and then retry the repair procedure.
  • A problem was fixed in the firmware for the EMX0 PCIe Gen3 I/O expansion drawer calling out cable thermal or power alarms. Most likely System Reference Codes logged can be: SRC B7006A99 SRC B7006AA6 SRC B7006AA7. This fix only pertains to systems with an attached EMX0 PCIe Gen3 I/O expansion drawer having EMXH fanout modules.
  • A problem was fixed for SRC B7006A99 with word4 of 3741412C being logged as a Predictive error calling out cable hardware when no cable replacement is needed. This SRC does not have an impact on PCIe function and will be logged as Informational to prevent unnecessary service actions for the non-functional error.
  • A problem was fixed for expansion drawer serviceable events not including expansion drawer cables in the FRU callout list when the expansion drawer cable may be the source of the problem. The fix changes some uses of SRC B7006A84 to either SRC B7006A85 or SRC B7006A89 to correctly include expansion drawer cables in the FRU callout list.
  • DEFERRED: A problem was fixed in the firmware for the EMX0 PCIe Gen3 I/O expansion drawer calling out cable or other related hardware, possibly leading to link degradation. Most likely System Reference Codes logged can be: SRC B7006A80, SRC B7006A85, SRC B7006A88, SRC B7006A89. This fix only pertains to systems with an attached EMX0 PCIe Gen3 I/O expansion drawer having EMXH fanout modules.
  • A problem was fixed that would cause an LPM to fail due to an insufficient memory for firmware error while deleting a partition on the source system.
  • A problem was fixed for a rare problem creating and offloading platform system dumps. An SRC B7000602 will be created at the time of the failure. The fix allows for platform system dumps to be created and offloaded normally.
  • A problem was fixed where, if TPM hardware communication becomes unstable, it can lead to sporadic LPM (Live Partition Mobility) failures. This fix adds robustness to LPM operations to avoid usage of TPM hardware that is deemed unstable in preference of more stable TPM HW or customer configured PowerVM Trusted System Key.
  • An enhancement was made to provide a daily TPM health check to allow for advance notification of a TPM failures so that it can be addressed before performing operations dependent on it such as LPM, Disruptive System Dumps, etc. The first two times this daily TPM health check fails, a new informational SRC will be posted: B700900D. After 3 failures, the TPM will be marked as failed and the existing serviceable TPM failure SRC will posted instead.
  • A problem was fixed where an LPAR posted error log with SRC BA54504D. The problem has been seen on systems where only one core is active.
  • A problem was fixed for possible intermittent shared processor LPAR dispatching delays. The problem only occurs for capped shared processor LPARs or uncapped shared processor LPARS running within their allocated processing units. The problem is more likely to occur when there is a single shared processor in the system. An SRC B700F142 informational log may also be produced.
  • A problem was fixed for a possible system hang during a Dynamic Platform Optimization (DPO), memory guard recovery, or memory mirroring defragmentation operation. The problem only occurs if the operation is performed while an LPAR is running in POWER9 processor compatibility mode.
  • A problem was fixed where ASMI menus are not displayed correctly in all the languages.
  • A problem was fixed where the firmware update process failed when the FSP went through the reset/reload due to FSP boot watchdog timeout error.
  • Support for new TPM cards will be added with this change. The codebase will continue to support the existing TPMs, as well.  If a new TPM card - Part Number: 03PK608, CCIN: 6B5A - is installed on the system *without* these code changes, you will see the following error SRCs:
    BC50070B <-- errors trying to talk to the TPM
    BC50270D <-- error saying that the 75x TPM is not present
    BC502BBE <-- marking the TPM as unusable and calling it out
    BC502BAD <-- Firmware code terminating the system because the TPM is not present, but the system setting "TPM REQUIRED" is enabled
    B15050B8 <-- Callout saying the TPM is not in the system
    B15050AD <-- System boot terminated
     
VL950_149_045 / FW950.B0
 
2024/04/16
 
Impact:  Availability      Severity:  ATT

System firmware changes that affect all systems
  • A problem was fixed where a long running firmware operation involving elastic and trial-based CoD (Capacity on Demand) entitlements may time-out. This results in the server state being set to incomplete on the HMC, which will require a rebuild of the server.
  • A problem was fixed where virtual serial numbers may not all be populated on a system properly when an activation code to generate them is applied. This results in some virtual serial numbers being incorrect or missing.
  • A problem was fixed for an intermittent issue preventing all Power Enterprise Pool mobile resources from being restored after a server power on when both processor and memory mobile resources are in use.  Additionally, a problem was fixed where Power Enterprise Pools mobile resources were being reclaimed and restored automatically during server power on such that resource assignments were impacted.  The problem only impacts systems utilizing Power Enterprise Pools 1.0 resources.
  • A problem was fixed where the target system would terminate with a B700F103 during LPM (Logical Partition Migration). The problem only occurs if there were low amounts of free space on the target system.
  • A problem was fixed for partitions configured to use shared processor mode and set to capped potentially not being able to fully utilize their assigned processing units. To mitigate the issue if it is encountered, the partition processor configuration can be changed to uncapped.
  • A problem was fixed where a bad core is not guarded and repeatedly causes system to crash. The SRC requiring service has the format BxxxE540. The problem can be avoided by replacing or manually guarding the bad hardware.
  • This service pack implements a new Update Access Key (UAK) Policy. See description at https://www.ibm.com/support/pages/node/7131459 .
  • A problem was fixed where CIMP provided sensor values (Ex. ambient temperature sensor) were not coming back after an FSP reset at system power off state.
  • A security problem is fixed in service processor firmware by upgrading curl library to the latest version beyond 8.1.0. The Common Vulnerabilities and Exposures number for this problem is CVE-2023-28322.
  • An enhancement was made related to vNIC failover performance. The performance benefit will be gained when a vNIC client unicast MAC address is unchanged during the failover. The performance benefit is not very significant but a minor one compared to overall vNIC failover performance.
  • A change was made for certain SR-IOV adapters to move up to the latest level of adapter firmware. No specific adapter problems were addressed at this new level. This change updates the adapter firmware to 16.35.2000 for Feature codes EC67,EC66 and CCIN 2CF3. If this adapter firmware levels are concurrently applied, AIX and VIOS VFs may become failed. Certain levels of AIX and VIOS do not properly handle concurrent SR-IOV updates and can leave the virtual resources in a DEAD state. Please review the following document for further details: https://www.ibm.com/support/pages/node/6997885. A re-IPL of the system instead of concurrently updating the SR-IOV adapter firmware would also work to prevent a VF failure. Update instructions: https://www.ibm.com/docs/en/power9?topic=adapters-updating-sr-iov-adapter-firmware
  • A problem was fixed where service for a processor FRU was requested when no service is actually was required. The SRC requiring service has the format BxxxE504 with a PRD Signature description matching (OCC_FIR[45]) PPC405 cache CE. The problem can be ignored unless the issue is persistently reported on subsequent IPLs. If that occurs, hardware replacement may be required.
 
VL950_145_045 / FW950.A0
2024/01/18
 
Impact:  Data      Severity:  HIPER

System firmware changes that affect all systems
  • HIPER: Power9 servers with an I/O adapter in SRIOV shared mode, and an SRIOV virtual function assigned to an active Linux partition assigned 8GB or less of platform memory may have undetected data loss or data corruption when Dynamic Platform Optimizer (DPO), memory guard recovery, or memory mirroring defragmentation is performed.
  • A security problem was fixed for CVE-2023-46183.
  • A change was made to update the POWER hypervisor version of OpenSSL
  • A change was made to update the OPAL Linux Kernel to the latest (v5.10.x) long-term stable version.  This only pertains to model ESS 5105-22E.
  • A security problem was fixed for CVE-2023-33851.
  • Updates NVDIMM firmware to address potential persistent data loss in storage-class systems.
  • Improves serviceability for NVDIMM-related errors.  This improvement only pertains to model ESS 5105-22E.
  • Problems were fixed for IBM Storage ESS systems by updating the NVDIMM/BPM firmware image to v4.5/v1.12. These problems include a report of "Not Enough Energy" for Catastrophic Save and a failure to save data after a planned reboot or unplanned power loss. Updates NVDIMM firmware to address potential persistent data loss in storage-class systems. This problem only pertains to model ESS 5105-22E.
  • A problem was fixed for assignment of memory to a logical partition which does not maximize the affinity between processors and memory allocations of the logical partition. This problem can occur when the system is utilizing Active Memory Mirroring (AMM) on a memory-constrained system. This only applies to systems which are capable of using AMM. As a workaround, Dynamic Platform Optimizer (DPO) can be run to improve the affinity.
  • A problem was fixed for Logical Partition Migration (LPM) failures with an HSCLB60C message. The target partition will be rebooted when the failure occurs. This error can occur during the LPM of partitions with a large amount of memory configured (32TB or more) and where an LPM failover has started on one of the connections to a Virtual I/O Server (VIOS) designated as the Mover Service Partitions (MSP).
  • A problem was fixed for an incorrect SRC B7005308 "SRIOV Shared Mode Disabled" error log being reported on an IPL after relocating an SRIOV adapter. This error log calls out the old slot where the SRIOV adapter was before being relocated. This error log occurs only if the old slot is not empty. However, the error log can be ignored as the relocation works correctly.
  • A problem was fixed for transitioning an IO adapter from dedicated to SR-IOV shared mode. When this failure occurs, an SRC B4000202 will be logged. This problem may occur if an IO adapter is transitioned between dedicated and SR-IOV shared mode multiple times on a single platform IPL.
  • A problem was fixed for Logical Partition Migration (LPM) to better handle errors reading/writing data to the VIOS which can lead to a VIOS and/or Hypervisor hang. The error could be encountered if the VIOS crashes during LPM.
  • A problem was fixed that prevents dumps (primarily SYSDUMP files) greater than or equal to 4GB (4294967296 bytes) in size from being offloaded successfully to AIX or Linux operating systems. This problem primarily affects larger dump files such as SYSDUMP files, but could affect any dump that reaches or exceeds 4GB (RSCDUMP, BMCDUMP, etc.). The problem only occurs for systems which are not HMC managed where dumps are offloaded directly to the OS. A side effect of an attempt to offload such a dump will be the continuous writing of the dump file to the OS until the configured OS dump space is exhausted which will potentially affect the ability to offload any subsequent dumps. The resulting dump file will not be valid and can be deleted to free dump space.
  • A problem was fixed for errors reported or partition hangs when using the SMS menu I/O Device Information to list SAN devices. One or more of SRCs BA210000, BA210003, or BA210013 will be logged. As a possible workaround, verify at least one LUN is mapped to each WWPN zoned to the partition. The partition console may display text similar to the following:
Detected bad memory access to address: ffffffffffffffff
Package path = /
Loc-code =
...
Return Stack Trace
------------------
@ - 2842558
ALLOC-FC-DEV-ENTRY - 2a9f4b4
RECORD-FC-DEV - 2aa0a00
GET-ATTACHED-FC-LIST - 2aa0fe4
SELECT-ATTACHED-DEV - 2aa12b0
PROCESS-FC-CARD - 2aa16d4
SELECT-FC-CARD - 2aa18ac
SELECT-FABRIC - 2aae868
IO-INFORMATION - 2ab0ed4
UTILS - 2ab6224
OBE - 2ab89d4
evaluate - 28527e0
invalid pointer - 2a79c4d
invalid pointer - 7
invalid pointer - 7
process-tib - 28531e0
quit - 2853614
quit - 28531f8
syscatch - 28568b0
syscatch - 28568b
 
  • A problem was fixed for fetching the CPU temperature data from HMC energy and thermal metrics.
VL950_136_045 / FW950.90
2023/09/22
 
Impact:  Availability       Severity:  SPE

System firmware changes that affect all systems
  • A problem was fixed for being unable to make configuration changes for partitions, except to reduce memory to the partitions, when upgrading to a new firmware release.  This can occur on systems with SR-IOV adapters in shared mode that are using most or all the available memory on the system, not leaving enough memory for the PowerVM hypervisor to fit.  As a workaround, configuration changes to the system to reduce memory usage could be made before upgrading to a new firmware release. This problem does not pertain to model ESS 5105-22E.
  • A problem was fixed for possible performance degradation in a partition when doing Nest Accelerator (NX) GZIP hardware compression.  The degradation could occur if the partition falls back to software-based GZIP compression if a new Virtual Accelerator Switchboard (VAS) window allocation becomes blocked. Only partitions running in Power9 processor compatibility mode are affected. This problem does not pertain to model ESS 5105-22E.
  • A problem was fixed for a Live Partition Mobility (LPM) migration hang that can occur during the suspended phase.  The migration can hang if an error occurs during the suspend process that is ignored by the OS.  This problem rarely happens as it requires an error to occur during the LPM suspend. To recover from the hang condition, IBM service can be called to issue a special abort command, or, if an outage is acceptable, the system or VIOS partitions involved in the migration can be rebooted. This problem does not pertain to model ESS 5105-22E.
  • A problem was fixed for a possible shared processor partition becoming unresponsive or having reduced performance. This problem only affects partitions using shared processors.  As a workaround, partitions can be changed to use dedicated processors.  If a partition is hung with this issue, the partition can be rebooted to recover. This problem does not pertain to model ESS 5105-22E.
  • A problem was fixed that causes slot power on processing to occur a second time when the slot is already powered on.  The second slot power-on can occur in certain cases and is not needed.  There is a potential for this behavior to cause a failure in older adapter microcode. This problem does not pertain to model ESS 5105-22E.
  • A problem was fixed for SRC B7006A99 being logged as a Predictive error calling out cable hardware when no cable replacement is needed. This SRC does not have an impact on PCIe function and will be logged as Informational to prevent unnecessary service actions for the non-functional error. This problem does not pertain to model ESS 5105-22E.
  • A problem was fixed for an extra IFL (Integrated Facility for Linux) proc resource being available during PEP 2.0 throttling. This issue can be triggered by the following scenario for Power Enterprise Pools 2.0 (PEP 2.0), also known as Power Systems Private Cloud with Shared Utility Capacity: PEP 2.0 throttling has been engaged and IFL processors are being used in the environment. This problem does not pertain to model ESS 5105-22E.
  • A problem was fixed for inconsistencies in the link status LED to help with the service of faulty cables using the link activity lights.  With the fix, LEDs are now “all or none”. If one lane or more is active in the entire link where the link spans both cables, then both link activity LEDs are activated. If zero lanes are active (link train fail), then the link activity LEDs are off. This problem does not pertain to model ESS 5105-22E.
  • A problem was fixed for a boot failing from the SMS menu if a network adapter has been configured with VLAN tags.  This issue can be seen when a VLAN ID is used during a boot from the SMS menu and if the external network environment, such as a switch, triggers incoming ARP requests to the server. This problem can be circumvented by not using VLAN ID from the SMS menu.  After the install and boot, VLAN can be configured from the OS. This problem does not pertain to model ESS 5105-22E.
  • A problem was fixed for a bad format of a PEL reported by SRC BD802002.  In this case, the malformed log will be a Partition Firmware created SRC of BA28xxxx (RTAS hardware error), BA2Bxxxx (RTAS non-hardware error), or BA188001 (EEH Temp error) log. No other log types are affected by this error condition. This problem occurs any time one of the affected SRCs is created by Partition Firmware.  These are hidden informational logs used to provide supplemental FFDC information so there should not be a large impact on system users by this problem. This problem does not pertain to model ESS 5105-22E.
  • A problem was fixed for DLPAR removes of embedded I/O (such as integrated USB) that fail.  An SRC BA2B000B hidden log will also be produced because of the failure. This error does not impact DLPAR remove of slot-based (hot-pluggable) I/O. Any attempt to DLPAR remove embedded I/O will trigger the issue and result in a DLPAR failure. This problem does not pertain to model ESS 5105-22E.
  • Problems were fixed for IBM Storage ESS systems by updating the NVDIMM/BPM firmware image to v4.4/v1.11.  These problems include a report of "Not Enough Energy" for Catastrophic Save and a failure to save data after a planned reboot or unplanned power loss. This problem only pertains to model ESS 5105-22E.
  • A problem was fixed for the total hardware uptime on the ASMI power on/off system page being incorrect. For a system run for a longer time (more than 30 days), the uptime value overflows and resets to 0, before counting up again. With the fix, the internal 32-bit counter has been increased to 64 bits to prevent the overflow condition.
  • A problem was fixed for SRC 110015x1 for a current share fault calling out a power supply for replacement. For this SRC, the power supply does not need to be replaced or serviced, so this fix changes the SRC to be informational instead of a serviceable event.  As a workaround, this SRC can be ignored.
  • A problem was fixed for an incorrect “Current hardware uptime” being displayed on the backup FSP ASMI welcome screen. Since this value cannot be maintained by the backup FS, the field has been removed from the backup FSP with the fix. The “Current hardware uptime” value can be found shown correctly on the primary FSP ASMI welcome screen.
  • A problem was fixed for a missing hardware callout for NVMe drives that are having a temperature failure (failure to read temperature or over temperature).
 
 
VL950_131_045 / FW950.80
2023/05/26
 
Impact:  Data       Severity:  HIPER
 
 New features and functions
  • Support added for a CX-5 InfiniBand/VPI adapter in PCIe form factor with Feature Code #AJP1 for model 5105-22E.  This provides InfiniBand at EDR 100 Gb and Ethernet at 100 GbE. This support pertains to model ESS 5105-22E only.

System firmware changes that affect all systems
 
  • HIPER/Pervasive:  AIX logical partitions that own virtual I/O devices or SR-IOV virtual functions may have data incorrectly written to platform memory or an I/O device, resulting in undetected data loss when Dynamic Platform Optimizer (DPO), predictive memory deconfiguration occurs, or memory mirroring defragmentation is performed. To mitigate the risk of this issue, please install the latest FW950 service pack (FW950.80 or later). This problem does not pertain to model ESS 5105-22E.
  • A security problem was fixed for a scenario where the IBM PowerVM Hypervisor could allow an attacker to obtain sensitive information if they gain service access to the HMC.  The Common Vulnerabilities and Exposures number for this problem is CVE-2023-25683. This problem does not pertain to model ESS 5105-22E.
  • A change was made for certain SR-IOV adapters to move up to the latest level of adapter firmware.  This update contains important reliability improvements and security hardening enhancements   This change updates the adapter firmware to XX.34.1002 for the following Feature Codes and CCIN: #EC66/EC67 with CCIN 2CF3.  If this adapter firmware level is concurrently applied, AIX and VIOS VFs may become failed.  Certain levels of AIX and VIOS do not properly handle concurrent SR-IOV updates, and can leave the virtual resources in a DEAD state.  Please review the following document for further details:  https://www.ibm.com/support/pages/node/6997885.
  • A problem was fixed for an SR-IOV virtual function (VF) failing to configure for a Linux partition.  This problem can occur if an SR-IOV adapter that had been in use on prior activation of the partition was removed and then replaced with an SR-IOV adapter VF with a different capacity.  As a workaround, the partition with the failure can be rebooted. This problem does not pertain to model ESS 5105-22E.
  • A problem was fixed for a timeout occurring for an SR-IOV adapter firmware LID load during an IPL, with SRC B400FF04 logged.  This problem can occur if a system has a large number of SR-IOV adapters to initialize.  The system recovers automatically when the boot completes for the SR-IOV adapter.  This problem does not pertain to model ESS 5105-22E.
  • A problem was fixed for the ASMI "Real-time Progress Indicator" not refreshing automatically to show the new progress codes.  The ASMI must be refreshed manually to show the new progress codes during the IPL.
  • A problem was fixed for a system failing an IPL with SRC B700F10A but not calling out the processor with the TOD error.  This happens whenever the PowerVM hypervisor does a TI checkstop due to a TOD error.  As a workaround, the bad processor must be guarded or replaced to allow the system to IPL. This problem does not pertain to model ESS 5105-22E.
  • A problem was fixed for a current sharing issue in the power supplies with SRC 110015x1 calling out a power supply for replacement when other "cracked capacitor" power supplies are present in the system that are the cause of the failure.  Since the system can still be run in this scenario, the SRC is being changed to Informational.  The frequency of this problem is low.
    • As a workaround, the problem can either be ignored or the bad power supply can be replaced with assistance from IBM support.
  • A problem was fixed for the Redfish (REST) API not returning data.  The REST API to gather power usage for all nodes in watts and the ambient temperature for the system does not return the data.  The new schema IBMEnterpriseComputerSystem.v1_1_0.json is missing, causing the Redfish GETs to fail.
  • A problem was fixed for unexpected vNIC failovers that can occur if all vNIC backing devices are in LinkDown status.  This problem is very rare that only occurs if both vNIC server backing devices are in LinkDown, causing vNIC failovers that bounce back and forth in a loop until one of the vNIC backing devices comes to Operational status. This problem does not pertain to model ESS 5105-22E.
  • A problem was fixed for the HMC Repair and Verify (R&V) procedure failing during concurrent maintenance of the #EMX0 Cable Card. This problem can occur if a partition is IPLed after a hardware failure before attempting the R&V operation.   As a workaround, the R&V can be performed with the affected partition powered off or the system powered off. This problem does not pertain to model ESS 5105-22E.
  • A problem was fixed for a possible incomplete state for the HMC-managed system with SRCs B17BE434 and B182953C logged, with the PowerVM hypervisor hung.  This error can occur if a system has a dedicated processor partition configured to not allow processor sharing while active. This problem does not pertain to model ESS 5105-22E.
  • A problem was fixed for incomplete descriptions for the display of devices attached to the FC adapter in SMS menus.  The FC LUNs are displayed using this path in SMS menus:  "SMS->I/O Device Information -> SAN-> FCP-> <FC adapter>".  This problem occurs if there are LUNs in the SAN that are not OPEN-able, which prevents the detailed descriptions from being shown for that device. This problem does not pertain to model ESS 5105-22E.
  • A problem was fixed for an HMC lpar_netboot error for a partition with a VNIC configuration.  The lpar_netboot logs show a timeout due to a missing value.  As a workaround, doing the boot manually in SMS works.  The lpar_netboot could also work as long as broadcast bootp is not used, but instead use lpar_netboot with a standard set of parameters that include Client, Server, and Gateway IP addresses. This problem does not pertain to model ESS 5105-22E.
  • A problem was fixed for an IBM i partition dump failing with an SRC B2008105.  This may happen on IBM i partitions running v7r4 or newer and running with more than 64 virtual processors. It requires at least one DLPAR remove of a virtual processor followed by a partition dump sometime afterward.  The problem can be avoided if DLPAR remove of virtual processors is not performed for the IBM i partition.
    • If the problem is encountered, either the fix can be installed and the dump retried, or if the fix is not installed, the partition dump can be retried repeatedly until it succeeds. This problem does not pertain to model ESS 5105-22E.
  • A problem was fixed for an IBM i partition dump failing with an SRC B2008105.  This may happen on IBM i partitions running v7r4 or newer and running with more than 64 virtual processors. It requires at least one DLPAR remove of a virtual processor followed by a partition dump sometime afterward.  The problem can be avoided if DLPAR remove of virtual processors is not performed for the IBM i partition.
    • If the problem is encountered, either the fix can be installed and the dump retried, or if the fix is not installed, the partition dump can be retried repeatedly until it succeeds.
 
VL950_124_045 / FW950.71
 
2023/05/17

Impact: Security     Severity: HIPER

System Firmware changes that affect all systems

  • HIPER/Pervasive: An internally discovered vulnerability in PowerVM on Power9 and Power10 systems could allow an attacker with privileged user access to a logical partition to perform an undetected violation of the isolation between logical partitions which could lead to data leakage or the execution of arbitrary code in other logical partitions on the same physical server. The Common Vulnerability and Exposure number is CVE-2023-30438. For additional information refer to  https://www.ibm.com/support/pages/node/6987797

  • A problem was identified internally by IBM related to SRIOV virtual function support in PowerVM.  An attacker with privileged user access to a logical partition that has an assigned SRIOV virtual function (VF) may be able to create a Denial of Service of the VF assigned to other logical partitions on the same physical server and/or undetected arbitrary data corruption.  The Common Vulnerability and Exposure number is CVE-2023-30440.

VL950_119_045 / FW950.70

2023/02/15

Impact: Data   Severity:  HIPER

System firmware changes that affect all systems

  • HIPER/Pervasive:  If a partition running in Power9 compatibility mode encounters memory errors and a Live Partition Mobility (LPM) operation is subsequently initiated for that partition, undetected data corruption within GZIP operations (via hardware acceleration) may occur within that specific partition

    This problem does not pertain to model ESS 5105-22E.
  • HIPER/Pervasive:  If a partition running in Power9 mode encounters an uncorrectable memory error during a Dynamic Platform Optimization (DPO), memory guard, or memory mirroring defragmentation operation, undetected data corruption may occur in any partition(s) within the system or the system may terminate with SRC B700F105.
    This problem does not pertain to model ESS 5105-22E.
  • A problem was fixed for performance slowdowns that can occur during the Live Partition Mobility (LPM) migration of a partition in POWER9, POWER10, or default processor compatibility modes. For this to happen to a partition in default processor compatibility mode, it must have booted on a Power10 system.  If this problem occurs, the performance will return to normal after the partition migration completes.  As a workaround, the partition to be migrated can be put into POWER9_base processor compatibility mode or older.
    This problem does not pertain to model ESS 5105-22E.
  • A problem was fixed for FSP slowness or system failing to IPL with SRC B1812624 errors logged.  This may occur if IPMI is used to request CPU temperatures when the On-Chip Controller is not available.  This would be the case if the IPMI requests were made while the system was powered down.
  • A problem was fixed for a processor core not being called out and guarded if a recoverable core error recovery fails and triggers a system checkstop.  This happens only if core error recovery fails with a core unit checkstop.
  • For a system with I/O Enlarged Capacity enabled, and greater than 8 TB of memory, and having an adapter in SR-IOV shared mode, a problem was fixed for partition or system termination for a failed memory page relocation.  This can occur if the SR-IOV adapter is assigned to a VIOS and virtualized to a client partition and then does an I/O DMA on a section of memory greater than 2 GB in size.  This problem can be avoided by not enabling "I/O Enlarged Capacity".
    This problem does not pertain to model ESS 5105-22E.
  • A problem was fixed for an SR-IOV adapter showing up as "n/a" on the HMC's Hardware Virtualized I/O menu.  This is an infrequent error that can occur if an I/O drawer is moved to a different parent slot.  As a workaround, the PowerVM Hypervisor NVRAM can be cleared or the I/O drawer can be moved back to the original parent slot to clean up the configuration.
    This problem does not pertain to model ESS 5105-22E.
  • A problem was fixed for too frequent callouts for repair action for recoverable errors for Predictive Error (PE) SRCs B7006A72, B7006A74, and B7006A75.  These SRCs for PCIe correctable error events called for a repair action but the threshold for the events was too low for a recoverable error that does not impact the system.  The threshold for triggering the PE SRCs has been increased for all PLX and non-PLX switch correctable errors.
    This problem does not pertain to model ESS 5105-22E.
  • A problem was fixed for not being able to reduce partition memory when the PowerVM hypervisor has insufficient memory for normal operations.  With the fix, a partition configuration change to reduce memory is allowed when the hypervisor has insufficient memory.  A possible workaround for this error is to free up system memory by deleting a partition.
    This problem does not pertain to model ESS 5105-22E.
  • A problem was fixed for an incorrect capacity displayed for a Fibre Channel device using SMS option "I/O Device Information".  This happens every time for a device that has a capacity greater than 2 TB.  For this case, the capacity value displayed may be significantly less than 2 TB.   For example, a 2 TB device would be shown as having a capacity of 485 GB.
    This problem does not pertain to model ESS 5105-22E.
  • A problem was fixed for a partition firmware data storage error with SRC BA210003 logged or for a failure to locate NVMe target namespaces when attempting to access NVMe devices over Fibre Channel (FC-NVME) SANs connected to third-party vendor storage systems.  This error condition, if it occurs, prevents firmware from accessing NVMe namespaces over FC as described in the following scenarios:
     1) Boot attempts from an NVMe namespace over FC using the current SMS bootlist could fail.
     2) From SMS menus via option 3 - I/O Device Information - no devices can be found when attempting to view NVMe over FC devices.
     3) From SMS menus via option 5 - Select Boot Options - no bootable devices can be found when attempting to view and select an NVMe over FC bootable device for the purpose of boot, viewing the current device order, or modifying the boot device order.
    The trigger for the problem is the attempted access of NVMe namespaces over Fibre Channel SANs connected to storage systems via one of the scenarios listed above.  The frequency of this problem can be high for some of the vendor storage systems.
    This problem does not pertain to model ESS 5105-22E.
  • A problem was fixed for a failed NIM download/install of OS images that are greater than 32M.  This only happens when using the default TFTP block size of 512 bytes.  The latest versions of AIX are greater than 32M in size and can have this problem.  As a workaround, in the SMS menu, change "TFTP blocksize" from 512 to 1024. To do this, go to the SMS "Advanced Setup: BOOTP" menu option when setting up NIM install parameters.  This will allow a NIM download of an image up to 64M.
    This problem does not pertain to model ESS 5105-22E.
  • A problem was fixed for a security scan with NSFOCUS reporting the following low-priority vulnerabilities:
    1. Low. Web server enabled "options"
    2. Low. Response no "Referrer-Policy" header
    3. Low. Response no "X-Permitted-Cross-Domain-Policies" header
    4. Low. Response no "X-Download-Options" header
    5. Low. Response no "Content-Security-Policy" header
    There is no impact to the system from these as the FSP service processor does not provide any features which can be exploited by the five vulnerabilities.
  • A problem was fixed for a security scan with NSFOCUS reporting a medium-level vulnerability for a slow HTTPS request denial of service attack against ASMI.  This occurs whenever NSFOCUS scans are run.
  • Support for using a Redfish (REST) API to gather power usage for all nodes in watts and the ambient temperature for the system.
    Redfish sample response is as shown below:
    ==>> GET redfish/v1/Systems/<>
    ...
        "Oem": {
            "IBMEnterpriseComputerSystem": {
                ...
                ...
                "PowerInputWatts" : <> ( number in watts),  <<<<============
                "AmbientTemp" : <> (number in Celsius) <<<<============
            }
        },
    ...


System firmware changes that affect certain systems

  • For a system with an IBM i partition, a problem was fixed for the IBMi 60-day "Trial 5250" function not working.  The "Trial 5250" is only needed for the case of an incomplete system order that results in the IBM i 100% 5250 feature being missing.  Since the "Trial 5250" is temporary anyway and valid for only 60 days, an order for the permanent 5250 feature is needed to fully resolve the problem.
    This problem does not pertain to model ESS 5105-22E.
VL950_111_045 / FW950.60

2022/10/20

Impact: Availbility   Severity:  SPE

System firmware changes that affect all systems

  • A change was made for certain SR-IOV adapters to move up to the latest level of adapter firmware.  No specific adapter problems were addressed at this new level.  This change updates the adapter firmware to XX.32.1010 for the following Feature Codes and CCINs: #EC2R/EC2S with CCIN 58FA; #EC2T/EC2U with CCIN 58FB; and #EC66/EC67 with CCIN 2CF3. If this adapter firmware level is concurrently applied, AIX and VIOS VFs may become failed. To prevent the VF failure, the VIOS and AIX partitions must have the fix for IJ44288 (or a sibling APAR) applied prior to concurrently updating SR-IOV adapter firmware. AIX/VIOS SPs Spring 2023 will ship this fix.  Until then, interim fixes (ifixes) are available from https://aix.software.ibm.com/aix/efixes/ij44288/ or by calling IBM support if an ifix is required for a different level. A re-IPL of the system instead of concurrently updating the SR-IOV adapter firmware would also work to prevent a VF failure.
    This problem does not pertain to model ESS 5105-22E. Please review the following document for further details:  https://www.ibm.com/support/pages/node/6997885
  • Security problems were fixed for vTPM 1.2 by updating its OpenSSL library to version 0.9.8zh.  Security vulnerabilities CVE-2022-0778, CVE-2018-5407, CVE-2014-0076, and CVE-2009-3245 were addressed.  These problems only impact a partition if vTPM version 1.2 is enabled for the partition.
    This problem does not pertain to model ESS 5105-22E.
  • A problem was fixed for an intermittent service processor core dump for MboxDeviceMsg with SRCs B1818601 and B6008601 logged while the system is running.  This is a timing failure related to a double file close on an NVRAM file.  The service processor will automatically recover from this error with no impact on the system.
    This problem does not pertain to model ESS 5105-22E.
  • A problem was fixed for an SR-IOV adapter in shared mode failing on an IPL with SRC B2006002 logged.  This is an infrequent error caused by a different SR-IOV adapter than expected being associated with the slot because of the same memory buffer being used by two SR-IOV adapters.  The failed SR-IOV adapter can be powered on again and it should boot correctly.
    This problem does not pertain to model ESS 5105-22E.
  • A problem was fixed for an SR-IOV adapter in shared mode failing during run time with SRC B400FF04 or B400F104 logged.  This is an infrequent error and may result in a temporary loss of communication as the affected SR-IOV adapter is reset to recover from the error.
    This problem does not pertain to model ESS 5105-22E.
  • A problem was fixed for a system crash with a B700F103 logged after a local core checkstop of a core with a running partition.  This infrequent error also requires a configuration change on the system like changing the processor configuration of the affected partition or running Dynamic Platform Optimizer (DPO).
    This problem does not pertain to model ESS 5105-22E.
  • A problem was fixed for a rare system hang that can happen any time Dynamic Platform Optimizer (DPO), memory guard recovery, or memory mirroring defragmentation occurs for a dedicated processor partition running in Power9 or Power10 processor compatibility mode. This does not affect partitions in Power9_base or older processor compatibility modes. If the partition has the "Processor Sharing" setting set to "Always Allow" or "Allow when partition is active", it may be more likely to encounter this than if the setting is set to "Never allow" or "Allow when partition is inactive".
    This problem can be avoided by using Power9_base processor compatibility mode for dedicated processor partitions. This can also be avoided by changing all dedicated processor partitions to use shared processors.
    This problem does not pertain to model ESS 5105-22E.
  • A problem was fixed for a partition with VPMEM failing to activate after a system IPL with SRC B2001230 logged for a "HypervisorDisallowsIPL" condition.  This problem is very rare and is triggered by the partition's hardware page table (HPT) being too big to fit into a contiguous space in memory.  As a workaround, the problem can be averted by reducing the memory needed for the HPT.  For example, if the system memory is mirrored, the HPT size is doubled, so turning off mirroring is one option to save space.  Or the size of the VPMEM LUN could be reduced.  The goal of these options would be to free up enough contiguous blocks of memory to fit the partition's HPT size.
    This problem does not pertain to model ESS 5105-22E.
  • A problem was fixed for a rare partition hang that can happen any time Dynamic Platform Optimizer (DPO), memory guard recovery, or memory mirroring defragmentation occurs for a shared processor partition running in any compatibility mode if there is also a dedicated processor partition running in Power9 or Power10 processor compatibility mode.  This does not happen if the dedicated partition is in Power9_base or older processor compatibility modes. Also, if the dedicated partition has the "Processor Sharing" setting set to "Always Allow" or "Allow when partition is active", it may be more likely to cause a shared processor partition to hang than if the setting is set to "Never allow" or "Allow when partition is inactive".
    This problem can be avoided by using Power9_base processor compatibility mode for any dedicated processor partitions. This problem can also be avoided by changing all dedicated processor partitions to use shared processors.
    This problem does not pertain to model ESS 5105-22E.
  • A problem was fixed for booting an OS using iSCSI from SMS menus that fails with a BA010013 information log.  This failure is intermittent and infrequent.  If the contents of the BA010013 are inspected, the following messages can be seen embedded within the log:
    " iscsi_read: getISCSIpacket returned ERROR"
    " updateSN: Old iSCSI Reply - target_tag, exp_tag"
    This problem does not pertain to model ESS 5105-22E.
  • A problem was fixed for the SMS menu option "I/O Device Information".  When using a partition's SMS menu option "I/O Device Information" to list devices under a physical or virtual Fibre Channel adapter, the list may be missing or entries in the list may be confusing. If the list does not display, the following message is displayed:
    "No SAN adapters present.  Press any key to continue".
    An example of a confusing entry in a list follows:
    "Pathname: /vdevice/vfc-client@30000004
    WorldWidePortName: 0123456789012345
     1.  500173805d0c0110,0                 Unrecognized device type: c"
    This problem does not pertain to model ESS 5105-22E.
  • A problem was fixed for a memory leak in the service processor (FSP) that can result in an out of memory (OOM) condition in the FSP kernel with an FSP dump and reset of the FSP.  This can occur after the FSP has been active for more than 80 days of uptime.  If the problem occurs, the system automatically recovers with a reset/reload of the FSP.
  • A problem was fixed for too frequent callouts for repair action for recoverable errors for SRCs B7006A72, B7006A74, and B7006A75.   The current threshold limit for the switch correctable errors is 5 occurring in 10 minutes, which is too low for a predictable event that requests a part replacement.  With the fix, the threshold value for calling out a part replacement is increased to match what is done for the PCIe Host Bridge ( PHB) correctable errors.  Every correctable error threshold condition on the switch link triggers the too frequent callouts.
    This problem does not pertain to model ESS 5105-22E.
  • A problem was fixed for a service processor FSP kernel panic dump and reset/reload that can occur if there is a network configuration error when using ASMI to change the network.  The SRCs B1817201 and B1817212 are logged prior to the dump.  This problem only occurs when changing the network configuration to an incorrect setting that causes a network timeout.


System firmware changes that affect certain systems

  • On a system with no HMC and a serially attached terminal, a problem was fixed for an intermittent service processor core dump for NetsVTTYServer with B181D30B logged that can when using the terminal console for the OS.  This error causes the console to be lost but can be recovered by doing a soft reset of the service processor.
    This problem does not pertain to model ESS 5105-22E.E.
VL950_105_045 / FW950.50

2022/07/29

Impact: Availability    Severity:  HIPER

New Features and Functions

  • Support for the 1.2, 3.2, and 6.4 TB 15mm SSD PCIe4 NVMe U.2 module for systems with AIX/Linux and IBM i partitions.   For AIX, Linux, or VIOS, the SSD is formatted in 4096 sectors.  For IBM i, the SSD is formatted in 4160 sectors  The 3,2 and 6.4 TB sizes are considered hot PCIe adapters so fan speeds are increased when these are installed.  These adapters have the following Feature Codes and CCINs:
    #ES3B/#ES3C with CCIN 5B52: 1.2 TB with #ES3B for AIX, Linux and VIOS and #ES3C for IBM i.
    #ES3D/#ES3E with CCIN 5B51: 3.2 TB with #ES3D for AIX, Linux and VIOS and #ES3E for IBM i.
    #ES3F/#ES3G with CCIN 5B50:  6.4 TB with #ES3F for AIX, Linux and VIOS and #ES3G for IBM i.
    This support only pertains to models S914 (9009-41G), S922 (9009-22G), S924 (9009-42G), H922 (9223-22S), and H924 (9223-42S).

System firmware changes that affect all systems

  • HIPER/Non-Pervasive: The following problems were fixed for certain SR-IOV adapters in shared mode when the physical port is configured for Virtual Ethernet Port Aggregator (VEPA):
    1) A security problem for CVE-2022-34331 was addressed where switches configured to monitor network traffic for malicious activity are not effective because of errant adapter configuration changes. The misconfigured adapter can cause network traffic to flow directly between the VFs and not out the physical port hence bypassing any possible monitoring that could be configured in the switch.
    2) Packets may not be forwarded after a firmware update, or after certain error scenarios which require an adapter reset. Users configuring or using VEPA mode should install this update. 
    These fixes pertains to adapters with the following Feature Codes and CCINs: #EC2R/#EC2S with CCIN 58FA; #EC2T/#EC2U with CCIN 58FB; #EC3L/#EC3M with CCIN 2CEC;  and #EC66/#EC67 with CCIN 2CF3.
    Update instructions: https://www.ibm.com/support/knowledgecenter/en/POWER9/p9efd/p9efd_updating_sriov_firmware.htm.
    This problem does not pertain to model ESS 5105-22E.
  • A problem was fixed for a rare service processor core dump for NetsCommonMsgServer with SRC B1818611 logged that can occur when doing an AC power-on of the system.  This error does not have a system impact beyond the logging of the error as an auto-recovery happens.
  • A problem was fixed for the wrong IBM part number (PN) being displayed in inventory reports and callouts for the 16GB-based Samsung 128 GB DIMM with IBM part number 78P7468 and Samsung part number: M393AAG40M32-CAE. The PN 78P7468 should be shown for the Samsung memory DIMM instead of PN 78P6925 which is specific to the Hynix 128GB memory DIMM.
  • A problem was fixed for an apparent hang in a partition shutdown where the HMC is stuck in a status of "shutting down" for the partition.  This infrequent error is caused by a timing window during the system or partition power down where the HMC checks too soon and does not see the partition in the "Powered Off" state. However, the power off of the partition does complete even though the HMC does not acknowledge it.  This error can be recovered by rebuilding the HMC representation of the managed system by following the below steps:
    1) In the navigation area on the HMC, select Systems Management > Servers.
    2) In the contents pane, select the required managed system.
    3) Select Tasks > Operations > Rebuild.
    4) Select Yes to refresh the internal representation of the managed system.
    This problem does not pertain to model ESS 5105-22E.
  • A problem was fixed for a hypervisor task failure with SRC B7000602 logged when running debug macro "sbdumptrace -sbmgr -detail 2" to capture diagnostic data.  The secure boot trace buffer is not aligned on a 16-byte boundary in memory which triggers the failure. With the fix, the hypervisor buffer dump utility is changed to recognize 8-byte aligned end of buffer boundaries.
    This problem does not pertain to model ESS 5105-22E.
  • A problem was fixed for Predictive Error (PE) SRCs B7006A72 and B7006A74 being logged too frequently. These SRCs for PCIe correctable error events called for a repair action but the threshold for the events was too low for a recoverable error that does not impact the system. The threshold for triggering the PE SRCs has been increased.
    This problem does not pertain to model ESS 5105-22E.
  • A problem was fixed for a system crash with SRC B7000103 that can occur when adding or removing FRUs from a PCIe3 expansion drawer (Feature code #EXM0). This error is caused by a very rare race scenario when processing multiple power alerts from the expansion drawer at the same time.
    This problem does not pertain to model ESS 5105-22E.
  • A problem was fixed for an HMC incomplete state for the managed system after a concurrent firmware update.  This is an infrequent error caused by an HMC query race condition while the concurrent update is rebooting tasks in the hypervisor.  A system re-IPL is needed to recover from the error.
    This problem does not pertain to model ESS 5105-22E.
  • A problem was fixed for an On-Chip Controller (OCC) and a Core Management Engine ( CME) boot failure during the IPL with SRC BC8A090F and a RC_STOP_GPE_INIT_TIMEOUT error logged.  This is an intermittent IPL failure. The system can be recovered by retrying the IPL.  This fix reduces the frequency of the error but it may still rarely occur.  If it does occur, the retry of the IPL will be successful to recover the system.
  • A problem was fixed for a failed correctable error recovery for a DIMM that causes a flood of SRC BC81E580 error logs and also can prevent dynamic memory deallocation from occurring for a hard memory error. This is a very rare problem caused by an unexpected number of correctable error symbols for the DIMM in the per-symbol counter registers.
VL950_099_045 / FW950.40

2022/05/06

Impact: Security    Severity:  HIPER

New Features and Functions

  • Support has been added for a new 16GB-based Samsung 128 GB DIMM with IBM part number 78P7468 and Samsung part number: M393AAG40M32-CAE.  This is replacing the 128GB Hynix with IBM part number 78P6925 for new memory installs.

System firmware changes that affect all systems

  • HIPER/Non-Pervasive: A problem was fixed for a flaw in OpenSSL TLS which can lead to an attacker being able to compute the pre-master secret in connections that have used a Diffie-Hellman (DH) based ciphersuite. In such a case this would result in the attacker being able to eavesdrop on all encrypted communications sent over that TLS connection.  OpenSSL supports encrypted communications via the Transport Layer Security (TLS) and Secure Sockets Layer (SSL) protocols.  With the fix, the service processor Lighttpd web server is changed to only use a strict cipher list that precludes the use of the vulnerable ciphersuites.  The Common Vulnerability and Exposure number for this problem is CVE-2020-1968.
  • A problem was fixed for a change made to disable Service Location Protocol (SLP) by default for a newly shipped system so that the SLP is disabled by a reset to manufacturing defaults on all systems and to also disable SLP on all systems when this fix is applied by the firmware update. The SLP configuration change has been made to reduce memory usage on the service processor by disabling a service that is not needed for normal system operations.  In the case where SLP does need to be enabled, the SLP setting can be changed using ASMI with the options "ASMI -> System Configuration -> Security -> External Services Management" to enable or disable the service.  Without this fix, resetting to manufacturing defaults from ASMI does not change the SLP setting that is currently active.
  • A problem was fixed for ASMI TTY menus allowing an unsupported change in hypervisor mode to OPAL.  This causes an IPL failure with BB821410 logged if OPAL is selected.  The hypervisor mode is not user-selectable in POWER9 and POWER10.  Instead, the hypervisor mode is determined by the MTM of the system. With this fix, the "Firmware Configuration" option in ASMI TTY menus is removed so that it matches the options given by the ASMI GUI menus.
    This problem does not pertain to model ESS 5105-22E.
  • A problem was fixed for correct ASMI passwords being rejected when accessing ASMI using an ASCII terminal with a serial connection to the server.  This problem always occurs for systems at firmware level FW950.10 and later.
  • A problem was fixed for a flaw in OpenSSL certificate parsing that could result in an infinite loop in the hypervisor, causing a hang in a Live Partition  Mobility (LPM) target partition.   The trigger for this failure is an LPM migration of a partition with a corrupted physical trusted platform module (pTPM) certificate.
    This is expected to be a rare problem  The Common Vulnerability and Exposure number for this problem is CVE-2022-0778.
    This problem does not pertain to model ESS 5105-22E.
  • A problem was fixed for a partition with an SR-IOV logical port (VF) having a delay in the start of the partition. If the partition boot device is an SR-IOV logical port network device, this issue may result in the partition failing in boot with SRCs BA180010 and BA155102 logged and then stuck on progress code SRC 2E49 for an AIX partition.  This problem is infrequent because it requires multiple error conditions at the same time on the SR-IOV adapter.  To trigger this problem, multiple SR-IOV logical ports for the same adapter must encounter EEH conditions at roughly the same time such that a new logical port EEH condition is occurring while a previous EEH condition's handling is almost complete but not notified to the hypervisor yet.  To recover from this problem, reboot the partition.
    This problem does not pertain to model ESS 5105-22E.
  • A problem was fixed for errors that can occur if doing a Live Partition Mobility (LPM)  migration and a Dynamic Platform Optimizer (DPO) operation at the same time.  The migration may abort or the system or partition may crash.  This problem requires running multiple migrations and DPO at the same time.  As a circumvention, do not use DPO while doing LPM migrations.
    This problem does not pertain to model ESS 5105-22E.
  • A problem was fixed for a secondary fault after a partition creation error that could result in a Terminate Immediate (TI) of the system with an SRC B700F103 logged.  The failed creation of partitions can be explicit or implicit which might trigger the secondary fault.  One example of an implicit partition create is the ghost partition created for a Live Partition Mobility (LPM) migration.  This type of partition can fail to create when there is insufficient memory available for the hardware page table (HPT) for the new partition.
    This problem does not pertain to model ESS 5105-22E.
  • A problem was fixed for a partition reboot recovery for an adapter in SR-IOV shared mode that rebooted with an SR-IOV port missing.  Prior to the reboot, this adapter had SR-IOV ports that failed and were removed after multiple adapter faults,  This problem should only occur rarely as it requires a sequence of multiple faults on an SR-IOV adapter in a short time interval to force the SR-IOV Virtual Function (VF) into the errant unrecoverable state.  The missing SR-IOV port can be recovered for the partition by doing a remove and add of the failed adapter with DLPAR, or the system can be re-IPLed.
    This problem does not pertain to model ESS 5105-22E.
  • The following problems were fixed for certain SR-IOV adapters:
    1) A problem was fixed for certain SR-IOV adapters that occurs during a VNIC failover where the VNIC backing device has a physical port down due to an adapter internal error with an SRC B400FF02 logged.  This is an improved version of the fix delivered in earlier service pack FW950.10 for adapter firmware level 11.4.415.37 and it significantly reduces the frequency of the error being fixed.
    2) A problem was fixed for an adapter issue where traffic doesn’t flow on a VF when the VF is configured with a PVID set to zero and using OS VLAN tagging is configured on a physical port where a VF with a PVID set to the same VLAN ID already exists. This problem occurs whenever this specific VF configuration is dynamically added to a partition or is activated as part of a partition activation.
    This fix updates the adapter firmware to 11.4.415.43 for the following Feature Codes and CCINs: #EN15/#EN16 with CCIN 2CE3, #EN17/#EN18 with CCIN 2CE4, #EN0H/#EN0J with CCIN 2B93, #EN0M/#EN0N with CCIN 2CC0, #EN0K/#EN0L with CCIN 2CC1, #EL56/#EL38 with CCIN 2B93, and #EL57/#EL3C with CCIN 2CC1.
    Update instructions:  https://www.ibm.com/support/knowledgecenter/en/POWER9/p9efd/p9efd_updating_sriov_firmware.htm.
    This problem does not pertain to model ESS 5105-22E.
  • A problem was fixed for multiple  incorrect informational error logs with Thermal Management SRC B1812649 being logged on the service processor, These error logs are more frequent on multiple node systems, but can occur on all system models.  The error is triggered by a false time-out and does not reflect a real problem on the service processor.

System firmware changes that affect certain systems

  • For a system with an AIX or Linux partition, a problem was fixed a partition start failure for AIX or Linux with SRC BA54504D logged.  This problem occurs if the partition is an MDC default partition with virtual Trusted Platform Module (vTPM) enabled.  As a circumvention, power off the system and disable vTPM using the HMC GUI to change the default partition property for Virtualized Trusted Platform Module (VTPM) to off.
    This problem does not pertain to model ESS 5105-22E.
  • For a system with vTPM enabled,  a problem was fixed for an intermittent system hang with SRCs 11001510 and B17BE434 logged and the HMC showing the system in the "Incomplete" state. This problem is very rare.  It may be triggered by different scenarios such as a partition power off; a processor DLPAR remove operation; or a Simultaneous Multi-threading (SMT) mode change in a partition.
    This problem does not pertain to model ESS 5105-22E.
  • For an S922 (9009-22G) system with an IBM i partition,  a problem was fixed for the Processor Feature system value in IBM incorrectly showing "P20 EP5G" instead of "P10 EP5W".  This only applies to systems that have a Processor DD2.2 with CCIN=5C26. Systems with Processor DD2.3 with CCIN=5C3D already show the "P10 EP5W" Processor Feature correctly.
  • For an S922 (9009-22G) system with an IBM i partition, a problem was fixed for an informational SRC A7004731 being logged incorrectly when servers with the EP5Y processor feature are using Physical I/O for IBM i.  Physical I/O on IBM i is allowed for the servers with the EP5Y processor, so the SRC A7004731 can be safely ignored.  No loss of function occurs, and this message about a possible configuration compliance issue is not a real issue.
  • For a system that does not have an HMC attached, a problem was fixed for a system dump 2GB or greater in size failing to off-load to the OS with an SRC BA280000 logged in the OS and an SRC BA28003B logged on the service processor.  This problem does not affect systems with an attached HMC since in that case system dumps are off-loaded to the HMC, not the OS, where there is no 2GB boundary error for the dump size.
    This problem does not pertain to model ESS 5105-22E.
VL950_092_045 / FW950.30

2021/12/09

Impact: Data    Severity:  HIPER

System firmware changes that affect all systems

  • HIPER/Non-Pervasive:  For systems using OPAL firmware, a problem was fixed for a potential problem with I/O adapters that could result in undetected data corruption.
    This problem only pertains to model ESS 5105-22E.
  • HIPER/Non-Pervasive:  A security problem was fixed to prevent an attacker that gains service access to the FSP service processor from reading and writing PowerVM system memory using a series of carefully crafted service procedures.  This problem is Common Vulnerability and Exposure number CVE-2021-38917.
    This problem does not pertain to model ESS 5105-22E.
  • HIPER/Non-Pervasive:  A problem was fixed for the IBM PowerVM Hypervisor where through a specific sequence of VM management operations could lead to a violation of the isolation between peer VMs.  The Common Vulnerability and Exposure number is CVE-2021-38918.
    This problem does not pertain to model ESS 5105-22E.
  • HIPER/Non-Pervasive:  For systems with IBM i partitions,  the PowerVM hypervisor is vulnerable to a carefully crafted IBM i hypervisor call that can lead to a system crash  This Common Vulnerability and Exposure number is CVE-2021-38937.
    This problem does not pertain to model ESS 5105-22E.
  • A problem was fixed for a possible denial of service on the service processor for ASMI and Redfish users.  This problem is very rare and could be triggered by a large number of invalid log in attempts to Redfish over a short period of time.
  • A problem was fixed for a service processor hang after a successful system power down with SRC B181460B and SRC B181BA07 logged.  This is a very rare problem that results in a fipsdump and a reset/reload of the service processor that recovers from the problem.
  • A problem was fixed for system fans not increasing in speed when partitions are booted with PCIe hot adapters that require additional cooling.  This fan speed problem can also occur if there is a change in the power mode that requires a higher minimum speed for the fans of the system than is currently active.  Fans running at a slower speed than required for proper system cooling could lead to over-temperature conditions for the system.
  • A problem was fixed for a hypervisor hang and HMC Incomplete error with SRC B17BE434 logged on a system with a virtual Network Interface Controller (vNIC) adapters.   The failure is triggered by actions occurring on two different SR-IOV logical ports for the same adapter in the VIOS that is backing the vNIC that result in a deadlock condition.  This is a rare failure that can occur during a Live Partition Mobility (LPM) migration for a partition with vNIC adapters.
    This problem does not pertain to model ESS 5105-22E.
  • A problem was fixed for a longer boot time for a shared processor partition on the first boot after the processor chip 0 has been guarded.  The partition boot could stall at SRC C20012FF but eventually complete.  This rare problem is triggered by the loss of all cores in processor chip 0.  On subsequent partition boots after the slow problem boot, the boot speeds return to normal.
    This problem does not pertain to model ESS 5105-22E.
  • A problem was fixed for a Live Partition Mobility (LPM) hang during LPM validation on the target system.  This is a rare system problem triggered during an LPM migration that causes LPM attempts to fail as well as other functionality such as configuration changes and partition shutdowns.
    To recover from this problem to be able to do LPM and other operations such as configuration changes and shutting down partitions, the system must be re-IPLed.
    This problem does not pertain to model ESS 5105-22E.
  • A problem was fixed for incorrect Power Enterprise Pools(PEP) 2.0 throttling when the system goes out of compliance.  When the system is IPLed after going out of compliance, the amount of throttled resources is lower than it should be on the first day after the IPL.  Later on, the IBM Cloud Management Console (CMC) corrects the throttle value.  This problem requires that a PEP 2.0 system has to go out of compliance, so it should happen only rarely.  To recover from this problem, the user can wait for up to one day after the IPL or have the CMC resend the desired PEP Throttling resource amount to correct it immediately.
    This problem does not pertain to model ESS 5105-22E.
  • A problem was fixed for the system powering off after a hardware discovery IPL.  This will happen if a hardware discovery IPL is initiated while the system is set to "Power off when last partition powers off".  The system will power off when the Hardware Discovery Information (IOR) partition that does hardware discovery powers off.  As a workaround, one should not use the "Power off when last partition powers off" setting when doing the hardware discovery IPL. Alternatively, one can just do a normal IPL after the system powers off, and then continue as normal.
    This problem does not pertain to model ESS 5105-22E.
  • A problem was fixed for system NVRAM corruption that can occur during PowerVM hypervisor shutdown.  This is a rare error caused by a timing issue during the hypervisor shutdown.   If this error occurs, the partition data will not be able to read from the invalid NVRAM when trying to activate partitions, so the NVRAM must be cleared and the partition profile data restored from the HMC.
    This problem does not pertain to model ESS 5105-22E.
  • A problem was fixed for the HMC Repair and Verify (R&V) procedure failing with "Unable to isolate the resource" during concurrent maintenance of the #EMX0 Cable Card.  This could lead one to take disruptive action in order to do the repair. This should occur infrequently and only with cases where a physical hardware failure has occurred which prevents access to the PCIe reset line (PERST) but allows access to the slot power controls.
    As a workaround, pulling both cables from the Cable Card to the #EMX0 expansion drawer will result in a completely failed state that can be handled by bringing up the "PCIe Hardware Topology" screen from either ASMI or the HMC. Then retry the R&V operation to recover the Cable Card.
    This problem does not pertain to model ESS 5105-22E.
  • A problem was fixed to prevent a flood of informational PCIe Host Bridge (PHB) error logs with SRC B7006A74 that cause a wrap of internal flight recorders and loss of data needed for problem debug.  This flood can be triggered by bad cables or other issues that cause frequent informational error logs. With the fix, thresholding has been added for informational PHB correctable errors at 10 in 24 hours before a Predictive Error is logged.
    This problem does not pertain to model ESS 5105-22E.
  • A problem was fixed for vague and misleading errors caused by using an invalid logical partition (LP) id for a resource dump request.  With the fix, the invalid LP id is rejected immediately as a user input error instead of being processed by the main storage dump to create what appear to be severe errors.
    This problem does not pertain to model ESS 5105-22E.
  • A problem was fixed for certain SR-IOV adapters that encountered a rare adapter condition, had some response delays, and logged an Unrecoverable Error with SRC B400FF02.  With the fix, handling of this rare condition is accomplished without the response delay and an Informational Error is logged. and the adapter initialization continues without interruption.  This fix pertains to adapters with the following Feature Codes and CCINs: #EC2R/EC2S with CCIN 58FA; #EC2T/EC2U with CCIN 58FB; #EC3L/EC3M with CCIN 2CEC; and #EC66/EC67 with CCIN 2CF3.
    Update instructions:  https://www.ibm.com/support/knowledgecenter/en/POWER9/p9efd/p9efd_updating_sriov_firmware.htm.
    This problem does not pertain to model ESS 5105-22E.
  • A change was made for certain SR-IOV adapters to move up to the latest level of adapter firmware.  No specific adapter problems were addressed at this new level.  This change updates the adapter firmware to XX.30.1004 for the following Feature Codes and CCINs: #EC2R/EC2S with CCIN 58FA; #EC2T/EC2U with CCIN 58FB; and #EC66/EC67 with CCIN 2CF3.
    Update instructions:  https://www.ibm.com/support/knowledgecenter/en/POWER9/p9efd/p9efd_updating_sriov_firmware.htm.
    This problem does not pertain to model ESS 5105-22E.
  • A problem was fixed for an SR-IOV adapter in shared mode configured as Virtual Ethernet Port Aggregator (VEPA) where the SR-IOV adapter goes through EEH error recovery, causing an informational error with SRC B400FF04 and additional information text that indicates a command failed.  This always happens when an adapter goes through EEH recovery and a physical port is in VEPA mode.  With the fix, the informational error is not logged.
    Update instructions:  https://www.ibm.com/support/knowledgecenter/en/POWER9/p9efd/p9efd_updating_sriov_firmware.htm.
    This problem does not pertain to model ESS 5105-22E.
  • A problem was fixed for certain SR-IOV adapters where Virtual Functions (VFs) failed to configure after an immediate restart of a logical partition (LPAR) or a shutdown/restart of an LPAR.  This problem only happens intermittently but is more likely to occur for the immediate restart case.  A workaround for the problem is to try another shutdown and restart of the partition or use DLPAR to remove the failing VF and then use DLPAR to add it back in.  This fix pertains to adapters with the following Feature Codes and CCINs: #EC2R/EC2S with CCIN 58FA; #EC2T/EC2U with CCIN 58FB; #EC3L/EC3M with CCIN 2CEC; and #EC66/EC67 with CCIN 2CF3.
    The fix is in the Partition Firmware and is effective immediately after a firmware update to the fix level.
    This problem does not pertain to model ESS 5105-22E.
  • A problem was fixed for a system hypervisor hang and an Incomplete state on the HMC after a logical partition (LPAR) is deleted that has an active virtual session from another LPAR.  This problem happens every time an LPAR is deleted with an active virtual session.  This is a rare problem because virtual sessions from an HMC (a more typical case) prevent an LPAR deletion until the virtual session is closed, but virtual sessions originating from another LPAR do not have the same check.
    This problem does not pertain to model ESS 5105-22E.


System firmware changes that affect certain systems

  • For a system with an IBM i partition, a problem was fixed for memory-mapped I/O and interrupt resources not being cleaned up for an SR-IOV VF when an IBM i partition is shut down.  This is a rare problem that requires adapters in SR-IOV shared mode to be assigned to the partition and certain timings of activity on the adapter prior to a shutdown of the partition.  The lost resources are not available on the next activation of the partition, but in most cases, this should not result in a loss of function.  The lost resources are recovered on the next re-IPL of the system.
    This problem does not pertain to model ESS 5105-22E.
  • For systems with IBM i partitions, a problem was fixed for incorrect Power Enterprise Pools(PEP) 2.0 messages reporting "Out of Compliance" with regards to IBM i licenses.  These messages can be ignored as there is no compliance issue to address in this case.
    This problem does not pertain to model ESS 5105-22E.
  • For a system with a Linux partition using an SR-IOV adapter, a problem was fixed for ping failures and packet loss for an SR-IOV logical port when a Dynamic DMA Window (DDW) changes from a bigger DMA window page size (such as 64K) back to the smaller default window page size (4K).  This can happen during an error recovery that causes a DDW reset back to the default window page size.
    This problem does not pertain to model ESS 5105-22E.
  • For a system with an IBM i partition. a problem was fixed for an IBM i partition running in P7 or P8 processor compatibility mode failing to boot with SRCs BA330002 and B200A101 logged.  This problem can be triggered as larger configurations for processors and memory are added to the partition.  A circumvention for this problem could be to reduce the number of processors and memory in the partition, or booting in P9 or later compatibility mode will also allow the partition to boot.
    This problem does not pertain to model ESS 5105-22E.
  • For a system with an AIX or Linux partition. a problem was fixed for Platform Error Logs (PELs) that are truncated to only eight bytes for error logs created by the firmware and reported to the AIX or Linux OS.  These PELs may appear to be blank or missing on the OS.  This rare problem is triggered by multiple error log events in the firmware occurring close together in time and each needing to be reported to the OS, causing a truncation in the reporting of the PEL.  As a problem workaround, the full error logs for the truncated logs are available on the HMC or using ASMI on the service processor to view them.
    This problem does not pertain to model ESS 5105-22E.
  • For systems with NVDIMMs, a problem was fixed for a false BC8A43FC information log generated on each IPL, indicating a successful BPM update was completed for the NVDIMMs in the system. This error log should be ignored as there is not a BPM update occurring on every IPL.
    This problem pertains to model ESS 5105-22E only.
VL950_087_045 / FW950.20

2021/09/16

Impact: Data    Severity:  HIPER

New Features and Functions

  • Support added for a mainstream 800GB NVME U.2 7 mm SSD (Solid State Drive) PCIe4 drive in a 15 mm carrier with Feature Code #EC7T and CCIN 59B7 for AIX, Linux, and VIOS.  This PCIe4 drive requires a PCIe4 slot on the system.
    This feature pertains only to the S914 (9009-41G), S922 (9009-22G), S924 (9009-42G), H922 (9223-22S), and H924 (9223-42S) models.
  • Support was changed to disable Service Location Protocol (SLP) by default for newly shipped systems or systems that are reset to manufacturing defaults.  This change has been made to reduce memory usage on the service processor by disabling a service that is not needed for normal system operations.  This change can be made manually for existing customers by changing it in ASMI with the options "ASMI -> System Configuration -> Security -> External Services Management" to disable the service.
  • Support was added to generate a service processor fipsdump whenever there is Hostboot (HB) TI and HB dump.  Without this new support, a HB crash (with a HB dump) does not generate a fipsdump and the FSP FFDC at that point in time. So it was difficult to correlate what was seen in the HB dump to what was happening on the FSP at the time of the HB fail.


System firmware changes that affect all systems

  • HIPER:  A problem was fixed which may occur on a target system following a Live Partition Mobility (LPM) migration of an AIX partition utilizing Active Memory Expansion (AME) with 64 KB page size enabled using the vmo tunable: "vmo -ro ame_mpsize_support=1".  The problem may result in AIX termination, file system corruption, application segmentation faults, or undetected data corruption.
    Note:  If you are doing an LPM migration of an AIX partition utilizing AME and 64 KB page size enabled involving a POWER8 or POWER9 system, ensure you have a Service Pack including this change for the appropriate firmware level on both the source and target systems.
    This problem does not pertain to model ESS 5105-22E.
  • A problem was fixed to reduce system fan speeds for systems with hot PCIe I/O adapters.  The fan floor speeds have been slightly lowered for certain ambient temperatures.  A reduction of fan speed as well as fan noise is expected with the system fans allowed to run at lower than the maximum speed setting in more cases.
    This fix pertains to the S924 (9009-42A), H924 (9223-42H), H924 (9223-42S), and S924 (9009-42G) models.
  • A problem was fixed for a missing hardware callout and guard for a processor chip failure with SRC BC8AE540 and signature "ex(n0p0c5) (L3FIR[28]) L3 LRU array parity error".
  • A problem was fixed for a missing hardware callout and guard for a processor chip failure with Predictive Error (PE) SRC BC70E540 and signature "ex(n1p2c6) (L2FIR[19]) Rc or NCU Pb data CE error".  The PE error occurs after the number of CE errors reaches a threshold of 32 errors per day.
  • A problem was fixed for an infrequent SRC of B7006956 that may occur during a system power off.  This SRC indicates that encrypted NVRAM locations failed to synchronize with the copy in memory during the shutdown of the hypervisor. This error can be ignored as the encrypted NVRAM information is stored in a redundant location, so the next IPL of the system is successful.
    This problem does not pertain to model ESS 5105-22E.
  • A problem was fixed for a service processor mailbox (mbox) timeout error with SRC B182953C during the IPL of systems with large memory configurations and "I/O Adapter Enlarged Capacity" enabled from ASMI.  The error indicates that the hypervisor did not respond quickly enough to a message from the service processor but this may not result in an IPL failure.  The problem is intermittent, so if the IPL does fail, the workaround is to retry the IPL
    This problem does not pertain to model ESS 5105-22E.
  • A problem was fixed for a misleading SRC B7006A20 (Unsupported Hardware Configuration) that can occur for some error cases for PCIes #EMX0 expansion drawers that are connected with copper cables.  For cable unplug errors, the SRC B7006A88 (Drawer TrainError) should be shown instead of the B7006A20.   If a B7006A20 is logged against copper cables with the signature "Prc UnsupportedCableswithFewerChannels" and the message "NOT A 12CHANNEL CABLE", this error should instead follow the service actions for a B7006A88 SRC.   
    This problem does not pertain to model ESS 5105-22E.
  • Problems were fixed for DLPAR operations that change the uncapped weight of a partition and DLPAR operations that switch an active partition from uncapped to capped.  After changing the uncapped weight, the weight can be incorrect.  When switching an active partition from uncapped to capped, the operation can fail.
    These problems do not pertain to model ESS 5105-22E.
  • A problem was fixed where the Floating Point Unit Computational Test, which should be set to "staggered" by default, has been changed in some circumstances to be disabled. If you wish to re-enable this option, this fix is required.  After applying this service pack, do the following steps:
    1) Sign in to the Advanced System Management Interface (ASMI).
    2) Select Floating Point Computational Unit under the System Configuration heading and change it from disabled to what is needed: staggered (run once per core each day) or periodic (a specified time).
    3) Click "Save Settings".
    This problem does not pertain to model ESS 5105-22E.
  • A problem was fixed for a hypervisor hang and HMC Incomplete error as a secondary problem after an SR-IOV adapter has gone into error recovery for a failure.  This secondary failure is infrequent because it requires an unrecovered error first for an SR-IOV adapter.
    This problem does not pertain to model ESS 5105-22E.
  • A problem was fixed for a system termination with SRC B700F107 following a time facility processor failure with SRC B700F10B.  With the fix, the transparent replacement of the failed processor will occur for the B700F10B if there is a free core, with no impact to the system.  
    This problem does not pertain to model ESS 5105-22E.
  • A problem was fixed for an incorrect "Power Good fault" SRC logged for an #EMX0 PCIe3 expansion drawer on the lower CXP cable of B7006A85 (AOCABLE, PCICARD).  The correct SRC is B7006A86 (PCICARD, AOCABLE).
    This problem does not pertain to model ESS 5105-22E.
  • A problem was fixed for an incorrect "Power Good fault" SRC logged for an #EMX0 PCIe3 expansion drawer on the lower CXP cable of B7006A85 (AOCABLE, PCICARD).  The correct SRC is B7006A86 (PCICARD, AOCABLE).
  • A problem was fixed for a Live Partition Mobility (LPM) migration that failed with the error "HSCL3659 The partition migration has been stopped because orchestrator detected an error" on the HMC.  This problem is intermittent and rare that is triggered by the HMC being overrun with unneeded LPM message requests from the hypervisor that can cause a timeout in HMC queries that result in the LPM operation being aborted.  The workaround is to retry the LPM migration which will normally succeed.
    This problem does not pertain to model ESS 5105-22E.
  • A problem was fixed for an SR-IOV adapter in shared mode configured as Virtual Ethernet Port Aggregator (VEPA) where unmatched unicast packets were not forwarded to the promiscuous mode VF.
    Update instructions:  https://www.ibm.com/support/knowledgecenter/en/POWER9/p9efd/p9efd_updating_sriov_firmware.htm.
    This problem does not pertain to model ESS 5105-22E.
  • A problem was fixed for certain SR-IOV adapters in SR-IOV Shared mode which may cause a network interruption and SRCs B400FF02 and B400FF04 logged.  The problem occurs infrequently during normal network traffic.
    This fix updates the adapter firmware to 11.4.415.38 for the following Feature Codes and CCINs: #EN15/#EN16 with CCIN 2CE3, #EN17/#EN18 with CCIN 2CE4, #EN0H/#EN0J with CCIN 2B93, #EN0M/#EN0N with CCIN 2CC0, #EN0K/#EN0L with CCIN 2CC1, #EL56/#EL38 with CCIN 2B93, and #EL57/#EL3C with CCIN 2CC1.
    Update instructions:  https://www.ibm.com/support/knowledgecenter/en/POWER9/p9efd/p9efd_updating_sriov_firmware.htm.
    This problem does not pertain to model ESS 5105-22E.
  • A problem was fixed for the Device Description in a System Plan related to Crypto Coprocessors and NVMe cards that were only showing the PCI vendor and device ID of the cards.  This is not enough information to verify which card is installed without looking up the PCI IDs first.  With the fix, more specific/useful information is displayed and this additional information does not have any adverse impact on sysplan operations.  The problem is seen every time a System Plan is created for an installed Crypto Coprocessor or NVMe card.
    This problem does not pertain to model ESS 5105-22E.
  • A problem was fixed for some serviceable events specific to the reporting of EEH errors not being displayed on the HMC.  The sending of an associated call home event, however, was not affected.  This problem is intermittent and infrequent.
    This problem does not pertain to model ESS 5105-22E.
  • A problem was fixed for possible partition errors following a concurrent firmware update from FW910 or later. A precondition for this problem is that DLPAR operations of either physical or virtual I/O devices must have occurred prior to the firmware update.  The error can take the form of a partition crash at some point following the update. The frequency of this problem is low.  If the problem occurs, the OS will likely report a DSI (Data Storage Interrupt) error.  For example, AIX produces a DSI_PROC log entry.  If the partition does not crash, it is also possible that some subsequent I/O DLPAR operations will fail.
    This problem does not pertain to model ESS 5105-22E.
  • A problem was fixed for Platform Error Logs (PELS) not being logged and shown by the OS if they have an Error Severity code of "critical error".   The trigger is the reporting by a system firmware subsystem of an error log that has set an Event/Error Severity in the 'UH' section of the log to a value in the range, 0x50 to 0x5F.  The following error logs are affected:
    B200308C ==> PHYP ==>  A problem occurred during the IPL of a partition.  The adapter type cannot be determined. Ensure that a valid I/O Load Source is tagged.
    B700F104 ==> PHYP ==> Operating System error.  Platform Licensed Internal Code terminated a partition.
    B7006990 ==> PHYP ==> Service processor failure
    B2005149 ==> PHYP ==>  A problem occurred during the IPL of a partition.
    B700F10B ==> PHYP ==>  A resource has been disabled due to hardware problems
    A7001150 ==> PHYP ==> System log entry only, no service action required. No action needed unless a serviceable event was logged.
    B7005442 ==> PHYP ==> A parity error was detected in the hardware Segment Lookaside Buffer (SLB).
    B200541A ==> PHYP ==> A problem occurred during a partition Firmware Assisted Dump
    B7001160 ==> PHYP ==> Service processor failure.
    B7005121 ==> PHYP ==> Platform LIC failure
    BC8A0604 ==> Hostboot  ==> A problem occurred during the IPL of the system.
    BC8A1E07 ==> Hostboot  ==>  Secure Boot firmware validation failed.
    Note that these error logs are still reported to the service processor and HMC properly.  This issue does not affect the Call Home action for the error logs..
    This problem does not pertain to model ESS 5105-22E.
  • A problem was fixed for Live Partition Mobility (LPM) migrations from non-trusted POWER9 systems to POWER10 systems.  The LPM migration failure occurs every time an LPM migration is attempted from a non-trusted system source to FW1010 and later.  For POWER9 systems, non-trusted is the default setting.  The messages shown on the HMC for the failure are the following:
     HSCL365C The partition migration has been stopped because platform firmware detected an error (041800AC).
     HSCL365D The partition migration has been stopped because target MSP detected an error (05000127).
     HSCL365D The partition migration has been stopped because target MSP detected an error (05000127).
    A workaround for the problem is to enable the trusted system key on the POWER9 FW940/FW950 source system which can be done using an intricate procedure.  Please contact IBM Support for help with this workaround.
    This problem does not pertain to model ESS 5105-22E.
  • A problem was fixed for a missing error log SRC for an SR-IOV adapter in Shared mode that fails during the IPL because of adapter failure or because the system has insufficient memory for SR-IOV Shared mode for the adapter.  The error log SRC added is B7005308, indicating a serviceable event and providing the adapter and error information.
    This problem does not pertain to model ESS 5105-22E.
  • A problem was fixed for a Live Partition Mobility (LPM) migration failure from a POWER9 FW950 source to a POWER10 FW1010 target.  This will fail on every attempt with the following message on the HMC:
    "HSCLA2CF The partition migration has been stopped unexpectedly.  Perform a migration recovery for this partition, if necessary."
    This problem does not pertain to model ESS 5105-22E.
  • A problem was fixed for error logs not being sent over to HMC when disconnecting/reconnecting power cords that caused a flood on informational SRCs of B1818A37 and B18187D7.  After the flood of error logs, the reporting of error logs to the HMC stopped, which also prevented Call Home from working.  To recover from the error, the service processor can have a reset/reload done using ASMI.


System firmware changes that affect certain systems

  • For a system with a partition running AIX 7.3, a problem was fixed for running Live Update or Live Partition Mobility (LPM).  AIX 7.3 supports Virtual Persistent Memory (PMEM) but it cannot be used with these operations, but the problem was making it appear that PMEM was configured when it was not.  The Live Update and LPM operations always fail when attempted on AIX 7.3.  Here is the failure output from a Live Update Preview:
    "1430-296 FAILED: not all devices are virtual devices.
    nvmem0
    1430-129 FAILED: The following loaded kernel extensions are not known to be safe for Live Update:
    nvmemdd
    ...
    1430-218 The live update preview failed.
    0503-125 geninstall:  The lvupdate call failed.
    Please see /var/adm/ras/liveupdate/logs/lvupdlog for details."
    This problem does not pertain to model ESS 5105-22E.
  • For systems with an AIX partition and Platform Keystore (PKS) enabled for the partition, a problem was fixed for AIX not being able to access the PKS during a Main Store Dump (MSD) IPL.  This may prevent the dump from completing.  This will happen for every MSD IPL when the partition PKS is enabled and in use by the AIX OS.
    This problem does not pertain to model ESS 5105-22E.
  • For a system with an AIX or Linux partition, a problem was fixed for a boot hang in RTAS for a partition that owns I/O which uses MSI-X interrupts.  A BA180007 SRC may be logged prior to the hang.  The frequency of this RTAS hang error is very low.
    This problem does not pertain to model ESS 5105-22E.
  • For an IBM i partition on a model S922 (9009-22G) or H922 (9223-22S)  system, a problem was fixed for the built-in configuration rules for IBM i partitions.  Without this fix, it is possible to define and build an IBM i configuration of LPARs and be outside of limits or constraints of what IBM i supports. Examples of this situation would be having more IBM i cores per partition than IBM i officially supports on this server or not enabling the restricted I/O partition attribute.
    A workaround to this problem is to follow the IBM i Sales manuals when building 22G or 22S configurations for IBM i.
    With the fix, existing IBM i partitions are checked for compliance and, if needed, flagged with "Out of Compliance" with an SRC A7004731 logged.  This compliance checking continues on a regular basis until compliance is achieved, and any new IBM i partitions created will have the rules enforced.
    This problem only pertains to models S922 (9009-22G) and H922 (9223-22S).
VL950_075_045 / FW950.11

2021/06/08

Impact:  Availability     Severity:  HIPER

System firmware changes that affect all systems

  • HIPER/Pervasive:  A problem was fixed for a checkstop due to an internal Bus transport parity error or a data timeout on the Bus.  This is a very rare problem that requires a particular SMP transport link traffic pattern and timing.  Both the traffic pattern and timing are very difficult to achieve with customer application workloads.  The fix will have no measurable effect on most customer workloads although highly intensive OLAP-like workloads may see up to 2.5% impact.
VL950_072_045 / FW950.10

2021/04/28

Impact: Data      Severity:  HIPER

New Features and Functions

  • Support added to Redfish to provide a command to set the ASMI user passwords using a new AccountService schema. Using this service, the ASMI admin, HMC, and general user passwords can be changed.
  • PowerVM support for the Platform KeyStore (PKS) for partitions has removed the FW950.00 restriction where the total amount of PKS for the system that could be configured was limited to 1 MB across all the partitions.  This restriction has been removed for FW950.10.
    This feature does not pertain to model ESS 5105-22E.
  • Support was added for a 12-core 2.3G/2.7G 190W/225W processor module with feature codes #EP5W and #EP5X.  This processor has CCIN 5C3D and runs in DD2.3 mode.
    This feature pertains only to IBM Power System models S922(9009-22G) and H922(9223-22S).
  • Support was added for Samsung DIMMs with part number 01GY853.  If these DIMMs are installed in a system with older firmware than FW950.10, the DIMMs will fail and be guarded with SRC BC8A090F logged with HwpReturnCode " RC_CEN_MBVPD_TERM_DATA_UNSUPPORTED_VPD_ENCODE".
  • Support was added for a new service processor command that can be used to 'lock' the power management mode, such that the mode can not be changed except by doing a factory reset.
  • Support for new mainstream 931 GB, 1.86 TB, 3.72 TB, and 7.44 TB capacity SSDs.  A 2.5-inch serial-attached SCSI (SAS) SSD is mounted on an SFF-3 carrier or tray for a POWER9 system unit or mounted on an SFF-2 for placement in an expansion drawer, such as the EXP24SX drawer, when attached to a POWER9 server. The drive is formatted to use 4224-byte (4k) sectors and does not support the 4k JBOD 4096-byte sector.  Firmware level FW950.10 or later is required for these drives.  The following are the feature codes and CCINs for the new drives:
    #ESKJ/#ESKK with CCIN 5B2B/5B29 – 931 GB Mainstream SAS 4k SFF-3/SFF-2 SSD for AIX/Linux
    #ESKL/#ESKM with CCIN 5B2B/5B29 - 931GB Mainstream SAS 4k SFF-3/SFF-2 SSD for IBM i
    #ESKN/#ESKP with CCIN 5B20/5B21- 1.86TB Mainstream SAS 4k SFF-3/SFF-2 SSD for AIX/Linux
    #ESKQ/#ESKR with CCIN 5B20/5B21- 1.86TB Mainstream SAS 4k SFF-3/SFF-2 SSD for IBM i
    #ESKS/#ESKT with CCIN 5B2C/5B2D - 3.72TB Mainstream SAS 4k SFF-3/SFF-2 SSD for AIX/Linux
    #ESKU/#ESKV with CCIN 5B2C/5B2D - 3.72TB Mainstream SAS 4k SFF-3/SFF-2 SSD for IBM i
    #ESKW/#ESKX with CCIN 5B2E/5B2F- 7.44TB Mainstream SAS 4k SFF-3/SFF-2 SSD for AIX/Linux
    #ESKY/#ESKZ with CCIN 5B2E/5B2F -7.44TB Mainstream SAS 4k SFF-3/SFF-2 SSD for IBM i
  • Support for new enterprise SSDs refresh the previously available 387 GB, 775 GB, and 1550 GB capacity points for POWER9 servers. These are 400 GB, 800 GB, and 1600 GB SSDs that are always formatted either to 4224 (4k) byte sectors or to 528 (5xx) byte sectors for additional protection, resulting in 387 GB, 775 GB, and 1550 GB capacities. The 4096-byte sector, the 512-byte sector, and JBOD are not supported.  Firmware level FW950.10 or later is required for these drives. The following are the feature codes and CCINs for the new drives:
    #ESK0/#ESK1 with CCIN 5B19/ 5B16  - 387GB Enterprise SAS 5xx SFF-3/SFF-2 SSD for AIX/Linux
    #ESK2/#ESK3 with CCIN 5B1A/5B17 - 775GB Enterprise SAS 5xx SFF-3/SFF-2 SSD for AIX/Linux
    #ESK6/#ESK8 with CCIN 5B13/5B10.- 387GB Enterprise SAS 4k SFF-3/SFF-2 SSD for AIX/Linux
    #ESK7/#ESK9 with CCIN 5B13/5B10- 387GB Enterprise SAS 4k SFF-3/SFF-2 SSD for IBM i
    #ESKA/#ESKC with CCIN 5B14/5B11- 775GB Enterprise SAS 4k SFF-3/SFF-2 SSD for AIX/Linux
    #ESKB/#ESKD with CCIN 5B14/5B11- 775GB Enterprise SAS 4k SFF-3/SFF-2 SSD for IBM i
    #ESKE/#ESKG with CCIN 5B15/5B12- 1.55TB Enterprise SAS 4k SFF-3/SFF-2 SSD for AIX/Linux
    #ESKF/#ESKH with CCIN 5B15/5B12- 1.55TB Enterprise SAS 4k SFF-3/SFF-2 SSD for IBM i
  • Support for new PCIe 4.0 x8 dual-port 32 Gb optical Fibre Channel (FC) short form adapter based on the Marvell QLE2772 PCIe host bus adapter (6.6 inches x 2.731 inches). The adapter provides two ports of 32 Gb FC capability using SR optics. Each port can provide up to 6,400 MBps bandwidth. This adapter has feature codes #EN1J/#EN1K with CCIN 579C. Firmware level FW950.10 or later is required for this adapter.
  • Support for new PCIe 3.0 16 Gb quad-port optical Fibre Channel (FC)l x8 short form adapter based on the Marvell QLE2694L PCIe host bus adapter (6.6 inches x 2.371 inches). The adapter provides four ports of 16 Gb FC capability using SR optics. Each port can provide up to 3,200 MBps bandwidth. This adapter has feature codes #EN1E/#EN1F with CCIN 579A. Firmware level FW950.10 or later is required for this adapter.
  • Support for Enterprise 800 GB PCIe4 NVMe SFF U.2 15mm SSD for IBM i. The SSD can be used in any U.2 15mm NVMe slot in the system. This drive has feature code #ES1K and CCIN 5947. Firmware level FW950.10 or later is required for this drive.
  • Added support in ASMI for a new panel to do Self -Boot Engine (SBE) SEEPROM validation.  This validation can only be run at the service processor standby state.  
    If the validation detects a problem, IBM recommends the system not be used and that IBM service be called.

System firmware changes that affect all systems

  • HIPER/Pervasive: A problem was fixed for a system IPL failure when DIMMs (RDIMMs or NVDIMMs) have mixed configurations with dual populated memory channels and single populated memory channels.  This problem occurs If there are dually populated memory channels that precede a single DIMM memory channel for a processor.  This causes the IPL to fail with B150BA40 and BC8A090F logged with HwpReturnCode "RC_MSS_CALC_POWER_CURVE_NEGATIVE_OR_ZERO_SLOPE " and HWP Error description "Power curve slope equals 0 or is negative".  A workaround for this problem is to reconfigure the memory to have the single DIMM memory channels be in front of memory channels that have both DIMM slots occupied.
  • A problem was fixed for certain SR-IOV adapters that have a rare, intermittent error with B400FF02 and B400FF04 logged, causing a reboot of the VF.  The error is handled and recovered without any user intervention needed.  The SR-IOV adapters affected have the following Feature Codes and CCINs: #EC2R/#EC2S with CCIN 58FA; #EC2T/#EC2U with CCIN 58FB; #EC3L/#EC3M with CCIN 2CE; and #EC66/#EC67 with CCIN 2CF3.
    This problem does not pertain to model ESS 5105-22E.
  • A problem was fixed for initiating a Remote Restart from a PowerVC/NovaLink source system to a remote target. This happens whenever the source system is running FW950.00. The error would look like this from PowerVC (system name, release level would be specific to the environment):
    "Virtual machine RR-5 could not be remote restarted to Ubu_AX_9.114.255.10. Error message: PowerVM API failed to complete for instance=RR-5-71f5c2cf-0000004e.HTTP error 500 for method PUT on path /rest/api/uom/ManagedSystem/598c1be4-cb4c-3957-917d-327b764d8ac1/LogicalPartition: Internal Server Error -- [PVME01040100-0004] Internal error PVME01038003 occurred while trying to perform this command.".
    This problem does not pertain to model ESS 5105-22E.
  • A problem was fixed for a B1502616 SRC logged after a system is powered off.  This rare error, "A critical error occurred on the thermal/power management device (TPMD); it is being disabled. " is not a real problem but occurred because the Power Management (PM) complex was being reset during the power off.  No recovery is needed as the next IPL of the system is successful.
  • A problem was fixed for the error handling of a system with an unsupported memory configuration that exceeds available memory power. Without the fix, the IPL of the system is attempted and fails with a segmentation fault with SRCs B1818611 and B181460B logged that do not call out the incorrect DIMMs.
  • A problem was fixed for an error in the HMC GUI (Error launching task) when clicking on "Hardware Virtualized IO". This error is infrequent and is triggered by an optical cable to a PCIe3 #EMX0 expansion drawer that is failed or unplugged.  With the fix, the HMC can show the working I/O adapters.
    This problem does not pertain to model ESS 5105-22E.
  • A problem was fixed for performance degradation of a partition due to task dispatching delays. This may happen when a processor chip has all of its shared processors removed and converted to dedicated processors. This could be driven by DLPAR remove of processors or Dynamic Platform Optimization (DPO).
    This problem does not pertain to model ESS 5105-22E.
  • A problem was fixed for an unrecoverable UE SRC B181BE12 being logged if a service processor message acknowledgment is sent to a Hostboot instance that has already shutdown.  This is a harmless error log and it should have been marked as an informational log.
  • A problem was fixed for Time of Day (TOD) being lost for the real-time clock (RTC) with an SRC B15A3303 logged when the service processor boots or resets.  This is a very rare problem that involves a timing problem in the service processor kernel.  If the server is running when the error occurs, there will be an SRC B15A3303 logged, and the time of day on the service processor will be incorrect for up to six hours until the hypervisor synchronizes its (valid) time with the service processor.  If the server is not running when the error occurs, there will be an SRC B15A3303 logged, and If the server is subsequently IPLed without setting the date and time in ASMI to fix it, the IPL will abort with an SRC B7881201 which indicates to the system operator that the date and time are invalid.
  • A problem was fixed for the Systems Management Services ( SMS) menu "Device IO Information" option being incorrect when displaying the capacity for an NVMe or Fibre Channel (FC) NVMe disk.  This problem occurs every time the data is displayed.
    This problem does not pertain to model ESS 5105-22E.
  • A problem was fixed for intermittent failures for a reset of a Virtual Function (VF) for SR-IOV adapters during Enhanced Error Handling (EEH) error recovery. This is triggered by EEH events at a VF level only, not at the adapter level. The error recovery fails if a data packet is received by the VF while the EEH recovery is in progress. A VF that has failed can be recovered by a partition reboot or a DLPAR remove and add of the VF.
    This problem does not pertain to model ESS 5105-22E.
  • A problem was fixed for a logical partition activation error that can occur when trying to activate a partition when the adapter hardware for an SR-IOV logical port has been physically removed or is unavailable due to a hardware issue. This message is reported on the HMC for the activation failure:  "Error:  HSCL12B5 The operation to remove SR-IOV logical port <number> failed because of the following error: HSCL1552 The firmware operation failed with extended error" where the logical port number will vary.  This is an infrequent problem that is only an issue if the adapter hardware has been removed or another problem makes it unavailable.  The workaround for this problem is to physically add the hardware back in or correct the hardware issue.  If that cannot be done, create an alternate profile for the logical partition without the SR-IOV logical port and use that until the hardware issue is resolved.
    This problem does not pertain to model ESS 5105-22E.
  • A problem was fixed for incomplete periodic data gathered by IBM Service for #EMXO PCIe expansion drawer predictive error analysis.  The service data is missing the PLX (PCIe switch) data that is needed for the debug of certain errors.
    This problem does not pertain to model ESS 5105-22E.
  • A problem was fixed for a partition hang in shutdown with SRC B200F00F logged.  The trigger for the problem is an asynchronous NX accelerator job (such as gzip or NX842 compression) in the partition that fails to clean up successfully.  This is intermittent and does not cause a problem until a shutdown of the partition is attempted.  The hung partition can be recovered by performing an LPAR dump on the hung partition.  When the dump has been completed, the partition will be properly shut down and can then be restarted without any errors.
    This problem does not pertain to model ESS 5105-22E.
  • A problem was fixed for a rare failure for an SPCN I2C command sent to a PCIe I/O expansion drawer that can occur when service data is manually collected with hypervisor macros "xmsvc -dumpCCData and xmsvc -logCCErrBuffer".   If the hypervisor macro "xmsvc "is run to gather service data and a CMC Alert occurs at the same time that requires an SPCN command to clear the alert, then the I2C commands may be improperly serialized, resulting in an SPCN I2C command failure.  To prevent this problem,  avoid using xmsvc -dumpCCData and xmsvc -logCCErrBuffer to collect service data until this fix is applied.
    This problem does not pertain to model ESS 5105-22E.
  • The following problems were fixed for certain SR-IOV adapters:
    1) An error was fixed that occurs during a VNIC failover where the VNIC backing device has a physical port down or read port errors with an SRC B400FF02 logged.
    2) A problem was fixed for adding a new logical port that has a PVID assigned that is causing traffic on that VLAN to be dropped by other interfaces on the same physical port which uses OS VLAN tagging for that same VLAN ID.  This problem occurs each time a logical port with a non-zero PVID that is the same as an existing VLAN is dynamically added to a partition or is activated as part of a partition activation, the traffic flow stops for other partitions with OS configured VLAN devices with the same VLAN ID.  This problem can be recovered by configuring an IP address on the logical port with the non-zero PVID and initiating traffic flow on this logical port.  This problem can be avoided by not configuring logical ports with a PVID if other logical ports on the same physical port are configured with OS VLAN devices.
    This fix updates the adapter firmware to 11.4.415.37 for the following Feature Codes and CCINs:  #EN15/#EN16 with CCIN 2CE3, #EN17/#EN18 with CCIN 2CE4, #EN0H/#EN0J with CCIN 2B93, #EN0M/#EN0N with CCIN 2CC0, #EN0K/#EN0L with CCIN 2CC1, #EL56/#EL38 with CCIN 2B93, and #EL57/#EL3C with CCIN 2CC1.
    The SR-IOV adapter firmware level update for the shared-mode adapters happens under user control to prevent unexpected temporary outages on the adapters.  A system reboot will update all SR-IOV shared-mode adapters with the new firmware level.  In addition, when an adapter is first set to SR-IOV shared mode, the adapter firmware is updated to the latest level available with the system firmware (and it is also updated automatically during maintenance operations, such as when the adapter is stopped or replaced).  And lastly, selective manual updates of the SR-IOV adapters can be performed using the Hardware Management Console (HMC).  To selectively update the adapter firmware, follow the steps given at the IBM Knowledge Center for using HMC to make the updates:  https://www.ibm.com/support/knowledgecenter/en/POWER9/p9efd/p9efd_updating_sriov_firmware.htm.
    Note: Adapters that are capable of running in SR-IOV mode, but are currently running in dedicated mode and assigned to a partition, can be updated concurrently either by the OS that owns the adapter or the managing HMC (if OS is AIX or VIOS and RMC is running).
    This problem does not pertain to model ESS 5105-22E.
  • A problem was fixed for certain PCIe adapters and NVMe U.2 devices to lower fan speeds where additional cooling is not needed. The following feature codes are affected in the  9009-22G, 9009-41G, 9009-42G, and 5105-22E servers:  #EC5J with CCIN 59B4;  #EC5X with CCIN 59B7; and #ES1E/#ES1F with CCIN 59B8.
  • A problem was fixed for a system hang or terminate with SRC B700F105 logged during a Dynamic Platform Optimization (DPO) that is running with a partition in a failed state but that is not shut down.  If DPO attempts to relocate a dedicated processor from the failed partition, the problem may occur.  This problem can be avoided by doing a shutdown of any failed partitions before initiating DPO.
    This problem does not pertain to model ESS 5105-22E.
  • A problem was fixed for a system crash with HMC message HSCL025D and SRC B700F103 logged on a Live Partition Mobility (LPM) inactive migration attempt that fails. The trigger for this problem is inactive migration that fails a compatibility check between the source and target systems.
    This problem does not pertain to model ESS 5105-22E.
  • A problem was fixed for certain SR-IOV adapters not being able to create the maximum number of VLANs that are supported for a physical port. There were insufficient memory pages allocated for the physical functions for this adapter type. The SR-IOV adapters affected have the following Feature Codes and CCINs:  #EC66/#EC67 with CCIN 2CF3.
    This problem does not pertain to model ESS 5105-22E.
  • A problem was fixed for certain SR-IOV adapters that can have B400FF02 SRCs logged with LPA dumps during a vNIC remove operation.  The adapters can have issues with a deadlock in managing memory pages.  In most cases, the operations should recover and complete.  This fix updates the adapter firmware to XX.29.2003 for the following Feature Codes and CCINs: #EC2R/EC2S with CCIN 58FA; #EC2T/EC2U with CCIN 58FB; #EC3L/EC3M with CCIN 2CE; and #EC66/EC67 with CCIN 2CF3.
    The SR-IOV adapter firmware level update for the shared-mode adapters happens under user control to prevent unexpected temporary outages on the adapters.  A system reboot will update all SR-IOV shared-mode adapters with the new firmware level.  In addition, when an adapter is first set to SR-IOV shared mode, the adapter firmware is updated to the latest level available with the system firmware (and it is also updated automatically during maintenance operations, such as when the adapter is stopped or replaced).  And lastly, selective manual updates of the SR-IOV adapters can be performed using the Hardware Management Console (HMC).  To selectively update the adapter firmware, follow the steps given at the IBM Knowledge Center for using HMC to make the updates:   https://www.ibm.com/support/knowledgecenter/en/POWER9/p9efd/p9efd_updating_sriov_firmware.htm.
    Note: Adapters that are capable of running in SR-IOV mode, but are currently running in dedicated mode and assigned to a partition, can be updated concurrently either by the OS that owns the adapter or the managing HMC (if OS is AIX or VIOS and RMC is running).
    This problem does not pertain to model ESS 5105-22E.

System firmware changes that affect all systems

  • HIPER/Pervasive: A problem was fixed for a very small timing window which may prevent the saving of volatile write cache in the NVDIMM in the event of a surprised/unscheduled power down of the system or power down with the SBE in an error state.  This may result in data loss for the storage system.
    This problem pertains to model ESS 5105-22E only.
  • HIPER/Pervasive: On systems with OPAL firmware, a problem was fixed for intermittent B150DAA0 and/or B181DA93 errors during power off.  On the subsequent IPL, the system will terminate with an SRC BC23352C pointing to an NVDIMM.  An NVDIMM factory reset is then required to clear that condition. Because of this error, the SBE was unable to save NVDIMM data, resulting in a loss of data.
    This problem pertains to model ESS 5105-22E only.
  • HIPER/Pervasive: On systems with OPAL firmware, a problem was fixed for a processor core checkstop with SRC BC70E540 logged with Signature Description " ex(n0p1c4) (NCUFIR[11]) NCU no response to snooped TLBIE".  This problem is intermittent and random but occurs with relatively high frequency for certain workloads.  The trigger for the failure is one core of a fused core pair going into a stopped state while the other core of the pair continues running.
    This problem pertains to model ESS 5105-22E only.
  • On systems with OPAL firmware, a problem was fixed for sporadic error messages in the OPAL log for xscom read/write calls.  These calls are handled internally and are not in the error log, so these informational messages should not be printed in the OPAL log.  
    This problem pertains to model ESS 5105-22E only.
  • On systems with OPAL firmware, a problem was fixed for a Linux OS hang on reboot that can occur if there is an error during fast reboot.  With the fix, if a fast reboot cannot be supported, a normal reboot is done instead.  
    This problem pertains to model ESS 5105-22E only.
  • A problem was fixed for a fast automatic recovery for an NVDIMM slot 0 error with SRC BC23550 logged. This is a recovery fix that changes recovery from many minutes (approximately 10 minutes per NVDIMM) with a re-IPL going through NVDIMM firmware reload, to less than 10 seconds with no re-lPL required.
    This problem pertains to model ESS 5105-22E only.
  • A problem was fixed for a missing SRC and callout for when an NVDIMM does not have enough energy is save the NVDIMM image on a loss of power for the system. Although the NVDIMM data loss is recognized by the storage system, the lack of an SRC makes it difficult to isolate the failing hardware for repair or replacement.
    This problem pertains to model ESS 5105-22E only.
  • On systems with an IBM i partition, a problem was fixed for physical I/O property data not being able to be collected for an inactive partition booted in "IOR" mode with SRC B200A101 logged. This can happen when making a system plan (sysplan) for an IBM i partition using the HMC and the IBM i partition is inactive.  The sysplan data collection for the active IBM i partitions is successful.
    This problem does not pertain to model ESS 5105-22E.
VL950_045_045 / FW950.00

2020/11/23

Impact:  New      Severity:  New

GA Level with key features included listed below

  • All features and fixes from the FW930.30, FW940.20, and FW941.03 service packs (and below) are included in this release.

New Features and Functions

  • Host firmware support for anti-rollback protection.  This feature implements firmware anti-rollback protection as described in NIST SP 800-147B "BIOS Protection Guidelines for Servers".  Firmware is signed with a "secure version".  Support added for a new menu in ASMI called "Host firmware security policy" to update this secure version level at the processor hardware.  Using this menu, the system administrator can enable the "Host firmware secure version lock-in" policy, which will cause the host firmware to update the "minimum secure version" to match the currently running firmware. Use the "Firmware Update Policy" menu in ASMI to show the current "minimum secure version" in the processor hardware along with the "Minimum code level supported" information. The secure boot verification process will block installing any firmware secure version that is less than the "minimum secure version" maintained in the processor hardware.
    Prior to enabling the "lock-in" policy, it is recommended to accept the current firmware level.
    WARNING: Once lock-in is enabled and the system is booted, the "minimum secure version" is updated and there is no way to roll it back to allow installing firmware releases with a lesser secure version.
    Note:  If upgrading from FW930.30 or FW940.20, this feature is already applied.
  • Support added for IBM Power System H922 for SAP HANA ((9223-22S) and IBM Power System H924 for SAP HANA ((9223-22S).  These models have integrated PCIe Gen4 switches.
  • OPAL is supported with skiboot level v6.6.4 and petitboot level v1.12.  This pertains to model ESS 5105-22E only.
  • This server firmware level includes the SR-IOV adapter firmware level 11.4.415.33 for the following Feature Codes and CCINs: #EN15/EN16 with CCIN 2CE3, #EN17/EN18 with CCIN 2CE4, #EN0H/EN0J with CCIN 2B93, #EN0M/EN0N with CCIN 2CC0, #EN0K/EN0L with CCIN 2CC1, #EL56/EL38 with CCIN 2B93, and #EL57/EL3C with CCIN 2CC1.
  • This server firmware includes the SR-IOV adapter firmware level 1x.25.6100 for the following Feature Codes and CCINs: #EC2R/EC2S with CCIN 58FA; #EC2T/EC2U with CCIN 58FB; #EC3L/EC3M with CCIN 2CEC; and #EC66/EC67 with CCIN 2CF3.
  • Support added for a 110V/900W power supply on the 9009-22G model.
  • Support added for IBM i 7.1 (Tech Refresh 11 + PTFs) for restricted I/O only on the S922 (9009-22A) model.
  • Support for PCIe4 x8 1.6/3.2/6.4 TB NVMe Adapters that are Peripheral Component Interconnect Express (PCIe) Generation 4 (Gen4) x8 adapters with the following feature codes and CCINs:
    #EC7A/#EC7B with CCIN 594A ; #EC7C/#EC7D with CCIN 594B; and #EC7E/#EC7F with CCIN 594C for AIX/Linux.
    #EC7J/#EC7K with CCIN 594A ; #EC7L/#EC7M with CCIN 594B; and #EC7N/#EC7P with CCIN 594C for IBM i.
  • PowerVM boot support for AIX for NVMe over Fabrics (NVMf) for 32Gb Fibre Channel.  Natively attached adapters are supported with the following feature codes and CCINs: #EN1A/#EN1B with CCIN 578F.
  • Support added for a PCIe2 2-Port USB 3.0 adapter with the following feature codes and CCIN: #EC6J/#EC6K with CCIN 590F.
  • Support added for dedicated processor partitions in IBM Power Enterprise Pools (PEP) 2.0.  Previously, systems added to PEP 2.0 needed to have all partitions as shared processor partitions.
  • Support added for SR-IOV Hybrid Network Virtualization (HNV) for Linux.   This capability allows a Linux partition to take advantage of the efficiency and performance benefits of SR-IOV logical ports and participate in mobility operations such as active and inactive Live Partition Mobility (LPM) and Simplified Remote Restart (SRR).  HNV is enabled by selecting a new Migratable option when an SR-IOV logical port is configured. The Migratable option is used to create a backup virtual device.  The backup virtual device must be a Virtual Ethernet adapter (virtual Network Interface Controller (vNIC) adapter not supported as a backup device). In addition to this firmware, HNV support in a production environment requires HMC 9.1.941.0 or later, RHEL 8., SLES 15, and VIOS 3.1.1.20 or later.
  • Enhanced  Dynamic DMA Window (DDW) for I/O adapter slots to enable the OS to use 64KB TCEs.   The OS supported is Linux RHEL 8.3 LE.
  • PowerVM support for the Platform KeyStore (PKS) for partitions.  PowerVM has added new h-call interfaces allowing the partition to interact with the Platform KeyStore that is maintained by PowerVM.  This keystore can be used by the partition to store items requiring confidentiality or integrity like encryption keys or certificates.
    Note:  The total amount of PKS for the system is limited to 1 MB across all the partitions for FW950.00.
  • Support for 64 GB 16Gbit DDR4 system memory running at 2666 Mhz with feature code #EM7B and part number 78P6815.
  • Support for 128 GB 16Gbit DDR4 system memory running at 2666 Mhz with feature code #EM7C and part number 78P6925.

System firmware changes that affect all systems

  • HIPER/Pervasive:  A problem was fixed to be able to detect a failed PFET sensing circuit in a core at runtime, and prevent a system fail with an incomplete state when a core fails to wake up.  The failed core is detected on the subsequent IPL.  With the fix. a core is called out with the PFET failure with SRC BC13090F and hardware description "CME detected malfunctioning of PFET headers." to isolate the error better with a correct callout.
  • A problem was fixed in ASMI for the Update Access Key (UAK) displaying as "0000-00-00" on the OPAL systems.  With the fix, the UAK displays as "NA" since it is not used for this type of system during the firmware update.  This problem pertains only to the 5105-22E model.
  • A problem was fixed for a VIOS, AIX, or Linux partition hang during an activation at SRC CA000040.  This will occur on a system that has been running more than 814 days when the boot of the partition is attempted if the partitions are in POWER9_base or POWER9 processor compatibility mode.
    A workaround to this problem is to re-IPL the system or to change the failing partition to POWER8 compatibility mode.
    Note:  If upgrading from FW930.30, this fix is already applied.
  • A problem was fixed for certain PCIe adapters and NVMe U.2 devices. The following feature codes are affected and the presence of these features in the 9009-41G and 9009-41A  servers can result in higher fan speeds and therefore higher acoustic levels after upgrading the firmware (the other models are also affected but to a lesser extent).  Refer to the acoustic levels published in the IBM Knowledge Center currently located at ===> https://www.ibm.com/support/knowledgecenter/9009-41A/p9had/p9had_90x.htm:
    #EC5J witn CCIN 59B4;  #EC5K with CCIN 59B5; #EC5L with CCIN 59B6; #EC5X with CCIN 59B7; #EC7A/#EC7B with CCIN  594A; #EC7C/#EC7D with CCIN 594B; #EC7E/#EC7F with CCIN 594C; #EC7J/#EC7K  with CCIN  594A; #EC7L/#EC7M  with CCIN 594B; #EC7N/#EC7P with CCIN 594C; #ES1E/#ES1F with CCIN 59B8; #ES1G/#ES1H with CCIN 59B9;  #EC5V/#EC5W with CCIN 59BA; #EC5G/#EC5B and #EC6U/#EC6V with CCIN 58FC;  #EC5W/#EC5D and #EC6W/#EC6X with CCIN 58FD; and #EC5G/#EC5B and #EC6Y/#EC6Z with CCIN 58FE.
    Note:  A version of this fix in earlier service pack FW941.03 did not add cooling for the 9009-22A and 9009-22G models for the following adapters: #EC5G/#EC5B and #EC6U/#EC6V with CCIN 58FC;  #EC5W/#EC5D and #EC6W/#EC6X with CCIN 58FD; and #EC5G/#EC5B and #EC6Y/#EC6Z with CCIN 58FE.
  • A problem was fixed for a security vulnerability for the Self Boot Engine (SBE). The SBE can be compromised from the service processor to allow injection of malicious code. An attacker that gains root access to the service processor could compromise the integrity of the host firmware and bypass the host firmware signature verification process. This compromised state can not be detected through TPM attestation.  This is Common Vulnerabilities and Exposures issue number CVE-2021-20487.

System firmware changes that affect all systems

  • On systems with an IBM i partition, a problem was fixed for the QPRCFEAT QMODEL IBM System Value showing the 9009-xxG models as 9009-xxA models.  For example, the 9009-22G reports as "EP11" instead of "EP51".  This mismatch can prevent 3rd party software licenses from working.
    The system has to be re-IPLed for the fix to take effect.  This pertains to 9009-xxG models only that are being upgraded from FW941.00.
    Note: If 3rd party software licenses were installed based on the old incorrect QPRCFEAT QMODEL value, new licenses will be needed to work with the updated value.

VL941

VL941
This package provides firmware for Power Systems S922 (9009-22G), Power Systems S914 (9009-41G), Power Systems S924 (9009-42G), and IBM ESS (5105-22E) servers only.
For Impact, Severity and other Firmware definitions, Please refer to the below 'Glossary of firmware terms' url:
https://www.ibm.com/support/pages/node/6555136
VL941_045_035 / FW941.03

11/05/20

Impact: Data        Severity:  HIPER

System firmware changes that affect all systems

  • HIPER/Pervasive: DEFERRED: A security problem was fixed for the case where "Speculative execution controls to mitigate user-to-kernel side channel attacks" was selected in ASMI, but the system was instead running with full speculative execution.  When this fix is applied concurrently, ASMI will prematurely show for this case (until a re-IPL is performed) that speculative execution controls are fully mitigated, when actually the system is still running with full speculative execution.  The system must be re-IPLed for this change to take effect such that the new security setting will be "Speculative execution controls to mitigate user-to-kernel and user-to-user side-channel attacks" to provide full speculative execution mitigation.
  • A problem was fixed for certain PCIe adapters and NVMe U.2 devices. The following feature codes are affected and the presence of these features in the 9009-22G, 9009-41G, 9009-42G, and 5105-22E servers can result in higher fan speeds and therefore higher acoustic levels after upgrading the firmware.  Refer to the acoustic levels published in the IBM Knowledge Center currently located at ===> https://www.ibm.com/support/knowledgecenter/9009-41A/p9had/p9had_90x.htm:
    #EC5J witn CCIN 59B4;  #EC5K with CCIN 59B5; #EC5L with CCIN 59B6; #EC5X with CCIN 59B7; #EC7A/#EC7B with CCIN  594A; #EC7C/#EC7D with CCIN 594B; #EC7E/#EC7F with CCIN 594C; #EC7J/#EC7K  with CCIN  594A; #EC7L/#EC7M  with CCIN 594B; #EC7N/#EC7P with CCIN 594C; #ES1E/#ES1F with CCIN 59B8; #ES1G/#ES1H with CCIN 59B9;  #EC5V/#EC5W with CCIN 59BA; #EC5G/#EC5B and #EC6U/#EC6V with CCIN 58FC;  #EC5W/#EC5D and #EC6W/#EC6X with CCIN 58FD; and #EC5G/#EC5B and #EC6Y/#EC6Z with CCIN 58FE.

System firmware changes that affect certain systems

  • HIPER/Pervasive:  A problem was fixed for certain SR-IOV adapters for a condition that may result from frequent resets of adapter Virtual Functions (VFs), or transmission stalls and could lead to potential undetected data corruption.
    The following additional fixes are also included:
    1) The VNIC backing device goes to a powered off state during a VNIC failover or Live Partition Mobility (LPM) migration.  This failure is intermittent and very infrequent.
    2) Adapter time-outs with SRC B400FF01 or B400FF02 logged.
    3) Adapter time-outs related to adapter commands becoming blocked with SRC B400FF01 or B400FF02 logged.
    4) VF function resets occasionally not completing quickly enough resulting in SRC B400FF02 logged.
    This fix updates the adapter firmware to 11.4.415.33 for the following Feature Codes and CCINs:  #EN15/#EN16 with CCIN 2CE3, #EN17/#EN18 with CCIN 2CE4, #EN0H/#EN0J with CCIN 2B93, #EN0M/#EN0N with CCIN 2CC0, #EN0K/#EN0L with CCIN 2CC1, #EL56/#EL38 with CCIN 2B93, and #EL57/#EL3C with CCIN 2CC1.
    The SR-IOV adapter firmware level update for the shared-mode adapters happens under user control to prevent unexpected temporary outages on the adapters.  A system reboot will update all SR-IOV shared-mode adapters with the new firmware level.  In addition, when an adapter is first set to SR-IOV shared mode, the adapter firmware is updated to the latest level available with the system firmware (and it is also updated automatically during maintenance operations, such as when the adapter is stopped or replaced).  And lastly, selective manual updates of the SR-IOV adapters can be performed using the Hardware Management Console (HMC).  To selectively update the adapter firmware, follow the steps given at the IBM Knowledge Center for using HMC to make the updates:  https://www.ibm.com/support/knowledgecenter/en/POWER9/p9efd/p9efd_updating_sriov_firmware.htm.
    Note: Adapters that are capable of running in SR-IOV mode, but are currently running in dedicated mode and assigned to a partition, can be updated concurrently either by the OS that owns the adapter or the managing HMC (if OS is AIX or VIOS and RMC is running).
    This pertains to 9009-xxG models only.
VL941_039_035 / FW941.01

10/02/20

Impact: Availability       Severity:  SPE

System firmware changes that affect certain systems

  • DEFERRED:  On systems with an IBM i partition, a problem was fixed for the QPRCFEAT IBM System Value showing the 9009-xxG models as 9009-xxA models.  For example, the 9009-22G reports as "EP11" instead of "EP51".  This mismatch can prevent 3rd party software licenses from working.
    The system has to be re-IPLed for the fix to take effect.  This pertains to 9009-xxG models only.
    For IBM i stand-alone systems (no HMC), please contact IBM i Software Support (http://www.ibm.com/mysupport) for help installing this service pack as there is no officially released IBM i  PTF available for it.
VL941_035_035 / FW941.00

07/27/20

Impact: New        Severity:  New

GA Level with key features included are listed below with new field defects since FW940.10.
All features and fixes from the FW940.10. service pack are included in this release but are not shown.

New features and functions

  • OPAL is supported with skiboot level v6.6-rc1 and petitboot level v1.12. This pertains to model ESS 5105-22E only.
  • Support for NVDIMMs for Linux OS RHEL 8.1 and later with feature code #EM71. This pertains to model ESS 5105-22E only.
  • Support for NVDIMM Backup Power Module (BPM) firmware level v1.07 (0x0107). This pertains to model ESS 5105-22E only.
  • Support for NVDIMM controller firmware level v3.B (0x3B). This pertains to model ESS 5105-22E only.
  • This server firmware level includes the SR-IOV adapter firmware level 11.4.415.28 for the following Feature Codes and CCINs: #EN15/#EN16 with CCIN 2CE3; #EN0H/#EN0J with CCIN 2B93; and #EN0K/#EN0L with CCIN 2CC1. This pertains to 9009-xxG models only.
  • This server firmware includes the SR-IOV adapter firmware level 1x.26.6000 for the following Feature Codes and CCINs: #EC2R/EC2S with CCIN 58FA;  #EC2T/EC2U with CCIN 58FB;  #EC3L/EC3M with CCIN 2CEC; and #EC66/EC67 with CCIN 2CF3. This pertains to 9009-xxG models only.
  • Support for the 6.4TB SSD PCIe4 NVMe U.2 module for AIX/Linux and IBM i with feature codes #EC5V/#EC5W and CCIN 59BA. Feature #EC5V indicates usage by AIX, Linux, or VIOS in which the SSD is formatted in 4096 byte sectors. Feature #EC5W indicates usage by IBM i in which the SSD is formatted in 4160 byte sectors. This pertains to 9009-xxG models only.
  • Support for the 800 GB SSD PCIe3 NVMe U.2 module for AIX/Linux with feature codes #EC5X and CCIN 59B7. Feature #EC5X indicates usage by AIX, Linux, or VIOS in which the SSD is formatted in 4096 byte sectors. This pertains to 9009-xxG models only.
  • Support for the 1.6 TB SSD PCIe4 NVMe U.2 module for AIX/Linux and IBM i with feature codes #ES1E/#ES1F and CCIN 59B8. Feature #ES1E indicates usage by AIX, Linux, or VIOS in which the SSD is formatted in 4096 byte sectors. Feature #ES1F indicates usage by IBM i in which the SSD is formatted in 4160 byte sectors. This pertains to 9009-xxG models only.
  • Support for the 3.2 TB SSD PCIe4 NVMe U.2 module for AIX/Linux and IBM i with feature codes #ES1G/#ES1H and CCIN 59B9. Feature #ES1G indicates usage by AIX, Linux, or VIOS in which the SSD is formatted in 4096 byte sectors. Feature #ES1H indicates usage by IBM i in which the SSD is formatted in 4160 byte sectors. This pertains to 9009-xxG models only.

System firmware changes that affect all systems

  • A problem was fixed for the REST/Redfish interface to change the success return code for object creation from "200" to "201". The "200" status code means that the request was received and understood and is being processed. A "201" status code indicates that a request was successful and, as a result, a resource has been created. The Redfish Ruby Client, "redfish_client" may fail a transaction if a "200" status code is returned when "201" is expected.
  • A problem was fixed to allow quicker recovery of PCIe links for the #EMXO PCIe expansion drawer for a run time fault with B7006A22 logged.  The time for recovery attempts can exceed six minutes on rare occasions which may cause I/O adapter failures and failed nodes. With the fix, the PCIe links will recover or fail faster (in the order of seconds) so that redundancy in a cluster configuration can be used with failure detection and failover processing by other hosts, if available, in the case where the PCIe links fail to recover. This pertains to 9009-xxG models only.
  • A problem was fixed for a concurrent maintenance "Repair and Verify" (R&V) operation for a #EMX0 fanout module that fails with an "Unable to isolate the resource" error message. This should occur only infrequently for cases where a physical hardware failure has occurred which prevents access to slot power controls. This problem can be worked around by bringing up the "PCIe Hardware Topology" screen from either ASMI or the HMC after the hardware failure but before the concurrent repair is attempted. This will avoid the problem with the PCIe slot isolation. These steps can also be used to recover from the error to allow the R&V repair to be attempted again. This pertains to 9009-xxG models only.
  • A problem was fixed for utilization statistics for commands such as HMC lslparutil and third-party lpar2rrd that do not accurately represent CPU utilization. The values are incorrect every time for a partition that is migrated with Live Partition Mobility (LPM).  Power Enterprise Pools 2.0 is not affected by this problem. If this problem has occurred, here are three possible recovery options:
    1) Re-IPL the target system of the migration.
    2) Or delete and recreate the partition on the target system.
    3) Or perform an inactive migration of the partition.  The cycle values get zeroed in this case.
    This pertains to 9009-xxG models only.

VL940

VL940
For Impact, Severity and other Firmware definitions, Please refer to the below 'Glossary of firmware terms' url:
https://www.ibm.com/support/pages/node/6555136
VL940_098_027 / FW940.60

03/21/22

Impact: Availability      Severity:  SPE

System firmware changes that affect all systems

  • A problem was fixed for a possible denial of service on the service processor for ASMI and Redfish users.  This problem is very rare and could be triggered by a large number of invalid login attempts to Redfish over a short period of time.
  • A problem was fixed for system fans not increasing in speed when partitions are booted with PCIe hot adapters that require additional cooling.  This fan speed problem can also occur if there is a change in the power mode that requires a higher minimum speed for the fans of the system than is currently active.  Fans running at a slower speed than required for proper system cooling could lead to over-temperature conditions for the system.
  • A problem was fixed for correct ASMI passwords being rejected when accessing ASMI using an ASCII terminal with a serial connection to the server.  This problem always occurs for systems at firmware level FW940.40 and later.
  • A problem was fixed for a partition with an SR-IOV logical port (VF) having a delay in the start of the partition. If the partition boot device is an SR-IOV logical port network device, this issue may result in the partition failing it boot with SRCs BA180010 and BA155102 logged and then stuck on progress code SRC 2E49 for an AIX partition.  This problem is infrequent because it requires multiple error conditions at the same time on the SR-IOV adapter.  To trigger this problem, multiple SR-IOV logical ports for the same adapter must encounter EEH conditions at roughly the same time such that a new logical port EEH condition is occurring while a previous EEH condition's handling is almost complete but not notified to the hypervisor yet.  To recover from this problem, reboot the partition.
  • A problem was fixed for a system hypervisor hang and an Incomplete state on the HMC after a logical partition (LPAR) is deleted that has an active virtual session from another LPAR.  This problem happens every time an LPAR is deleted with an active virtual session.  This is a rare problem because virtual sessions from an HMC (a more typical case) prevent an LPAR deletion until the virtual session is closed, but virtual sessions originating from another LPAR do not have the same check.
  • A problem was fixed for a secondary fault after a partition creation error that could result in a Terminate Immediate (TI) of the system with an SRC B700F103 logged.  The failed creation of partitions can be explicit or implicit that might trigger the secondary fault.  One example of an implicit partition create is the ghost partition created for a Live Partition Mobility (LPM) migration.  This type of partition can fail to create when there is insufficient memory available for the hardware page table (HPT) for the new partition.
  • A problem was fixed for certain SR-IOV adapters that occurs during a VNIC failover where the VNIC backing device has a physical port down due to an adapter internal error with an SRC B400FF02 logged.  This is an improved version of the fix delivered in earlier service pack FW940.40 for adapter firmware level 11.4.415.37 and it significantly reduces the frequency of the error being fixed.
    This fix updates the adapter firmware to 11.4.415.41 for the following Feature Codes and CCINs: #EN15/#EN16 with CCIN 2CE3, #EN17/#EN18 with CCIN 2CE4, #EN0H/#EN0J with CCIN 2B93, #EN0M/#EN0N with CCIN 2CC0, #EN0K/#EN0L with CCIN 2CC1, #EL56/#EL38 with CCIN 2B93, and #EL57/#EL3C with CCIN 2CC1.
    Update instructions:  https://www.ibm.com/support/knowledgecenter/en/POWER9/p9efd/p9efd_updating_sriov_firmware.htm.
     

System firmware changes that affect certain systems

  • For a system with an AIX or Linux partition, a problem was fixed a partition start failure for AIX or Linux with SRC BA54504D logged.  This problem occurs if the partition is an MDC default partition with virtual Trusted Platform Module (vTPM) enabled.  As a circumvention, power off the system and disable vTPM using the HMC GUI to change the default partition property for Virtualized Trusted Platform Module (VTPM) to off.
VL940_095_027 / FW940.50

11/17/21

Impact: Availability      Severity:  SPE

System firmware changes that affect all systems

  • A problem was fixed for an incorrect "Power Good fault" SRC logged for an #EMX0 PCIe3 expansion drawer on the lower CXP cable of B7006A85 (AOCABLE, PCICARD).  The correct SRC is B7006A86 (PCICARD, AOCABLE).
  • A problem was fixed for a missing error log SRC for an SR-IOV adapter in Shared mode that fails during the IPL because of adapter failure or because the system has insufficient memory for SR-IOV Shared mode for the adapter.  The error log SRC added is B7005308, indicating a serviceable event and providing the adapter and error information.
  • A problem was fixed for a longer boot time for a shared processor partition on the first boot after the processor chip 0 has been guarded.  The partition boot would stall at SRC C20012FF but eventually complete.  This rare problem is triggered by the loss of all cores in processor chip 0.  On subsequent partition boots after the slow problem boot, the boot speeds return to normal.
  • A problem was fixed for a Live Partition Mobility (LPM) hang during LPM validation on the target system.  This is a rare system problem triggered during a LPM migration that causes LPM attempts to fail as well as other functionality such as configuration changes and partition shutdowns.
    To recover from this problem to be able to LPM and other operations such as configuration changes and shutting down partitions, the system must be re-IPLed.
  • A problem was fixed for the system powering off after a hardware discovery IPL.  This will happen if a hardware discovery IPL is initiated while the system is set to "Power off when last partition powers off".  The system will power off when the Hardware Discovery Information (IOR) partition that does hardware discovery powers off.  As a workaround, one should not use the "Power off when last partition powers off" setting when doing the hardware discovery IPL. Alternatively, one can just do a normal IPL after the system powers off, and then continue as normal.
  • A problem was fixed for the HMC Repair and Verify (R&V) procedure failing with "Unable to isolate the resource" during concurrent maintenance of the #EMX0 Cable Card.  This could lead one to take a disruptive action in order to do the repair. This should occur infrequently and only with cases where a physical hardware failure has occurred which prevents access to the PCIe reset line (PERST) but allows access to the slot power controls.
    As a workaround, pulling both cables from the Cable Card to the #EMX0 expansion drawer will result in a completely failed state that can be handled by bringing up the "PCIe Hardware Topology" screen from either ASMI or the HMC. Then retry the R&V operation to recover the Cable Card.
  • A problem was fixed to prevent a flood of informational PCIe Host Bridge (PHB) error logs with SRC B7006A74 that cause a wrap of internal flight recorders and loss of data needed for problem debug.  This flood can be triggered by bad cables or other issues that cause frequent informational error logs. With the fix, thresholding has been added for informational PHB correctable errors at 10 in 24 hours before a Predictive Error is logged.
  • A change was made for certain SR-IOV adapters to move up to the latest level of adapter firmware.  No specific adapter problems were addressed at this new level.  This change updates the adapter firmware to XX.30.1004 for the following Feature Codes and CCINs: #EC2R/EC2S with CCIN 58FA; #EC2T/EC2U with CCIN 58FB; #EC3L/EC3M with CCIN 2CE; and #EC66/EC67 with CCIN 2CF3.
    Update instructions:  https://www.ibm.com/support/knowledgecenter/en/POWER9/p9efd/p9efd_updating_sriov_firmware.htm.
  • A problem was fixed for an SR-IOV adapter in shared mode configured as Virtual Ethernet Port Aggregator (VEPA) where
     the SR-IOV adapter goes through EEH error recovery, causing an informational error with SRC B400FF04 and additional information text that indicates a command failed.
    This always happens when an adapter goes through EEH recovery and a physical port is in VEPA mode.  With the fix, the informational error is not logged.
    Update instructions:  https://www.ibm.com/support/knowledgecenter/en/POWER9/p9efd/p9efd_updating_sriov_firmware.htm.
  • A problem was fixed for certain SR-IOV adapters that encountered a rare adapter condition, had some response delays, and logged an Unrecoverable Error with SRC B400FF02. With the fix, handling of this rare condition is accomplished without the delay and an  Informational Error is logged. and the adapter initialization continues without interruption.  This fix pertains to adapters with the following Feature Codes and CCINs: #EC2R/EC2S with CCIN 58FA; #EC2T/EC2U with CCIN 58FB; #EC3L/EC3M with CCIN 2CE; and #EC66/EC67 with CCIN 2CF3.
    Update instructions:  https://www.ibm.com/support/knowledgecenter/en/POWER9/p9efd/p9efd_updating_sriov_firmware.htm.
  • A problem was fixed for certain SR-IOV adapters in SR-IOV shared mode which may cause a network interruption and SRCs B400FF02 and B400FF04 logged.  The problem occurs infrequently during normal network traffic.
    This fix updates the adapter firmware to 11.4.415.38 for the following Feature Codes and CCINs: #EN15/#EN16 with CCIN 2CE3, #EN17/#EN18 with CCIN 2CE4, #EN0H/#EN0J with CCIN 2B93, #EN0M/#EN0N with CCIN 2CC0, #EN0K/#EN0L with CCIN 2CC1, #EL56/#EL38 with CCIN 2B93, and #EL57/#EL3C with CCIN 2CC1.
    Update instructions:  https://www.ibm.com/support/knowledgecenter/en/POWER9/p9efd/p9efd_updating_sriov_firmware.htm.
  • A problem was fixed for Platform Error Logs (PELs) not being logged and shown by the OS if they have an Error Severity code of "critical error".   The trigger is the reporting by a system firmware subsystem of an error log that has set an Event/Error Severity in the 'UH' section of the log to a value in the range, 0x50 to 0x5F.  The following error logs are affected:
    B200308C ==> PHYP ==>  A problem occurred during the IPL of a partition.  The adapter type cannot be determined. Ensure that a valid I/O Load Source is tagged.
    B700F104 ==> PHYP ==> Operating System error.  Platform Licensed Internal Code terminated a partition.
    B7006990 ==> PHYP ==> Service processor failure
    B2005149 ==> PHYP ==>  A problem occurred during the IPL of a partition.
    B700F10B ==> PHYP ==>  A resource has been disabled due to hardware problems
    A7001150 ==> PHYP ==> System log entry only, no service action required. No action needed unless a serviceable event was logged.
    B7005442 ==> PHYP ==> A parity error was detected in the hardware Segment Lookaside Buffer (SLB).
    B200541A ==> PHYP ==> A problem occurred during a partition Firmware Assisted Dump
    B7001160 ==> PHYP ==> Service processor failure.
    B7005121 ==> PHYP ==> Platform LIC failure
    BC8A0604 ==> Hostboot  ==> A problem occurred during the IPL of the system.
    BC8A1E07 ==> Hostboot  ==>  Secure Boot firmware validation failed.
    Note that these error logs are still reported to the service processor and HMC properly.  This issue does not affect the Call Home action for the error logs.
  • A problem was fixed for the Device Description in a System Plan related to Crypto Coprocessors and NVMe cards that were only showing the PCI vendor and device ID of the cards.  This is not enough information to verify which card is installed without looking up the PCI IDs first.  With the fix, more specific/useful information is displayed and this additional information does not have any adverse impact on sysplan operations.  The problem is seen every time a System Plan is created for an installed Crypto Coprocessor or NVMe card.

System firmware changes that affect certain systems

  • For a system with an IBM i partition, a problem was fixed for memory mapped I/O and interrupt resources not being cleaned up for an SR-IOV VF when an IBM i partition is shut down.  This is a rare problem that requires adapters in SR-IOV shared mode being assigned to the partition and certain timings of activity on the adapter prior to a shutdown of the partition.  The lost resources are not available on the next activation of the partition, but in most cases this should not result in a loss of function.  The lost resources are recovered on the next re-IPL of the system.
  • For a system with an IBM i partition. a problem was fixed for an IBM i partition running in P7 or P8 processor compatibility mode failing to boot with SRCs BA330002 and B200A101 logged.  This problem can be triggered as larger configurations for processors and memory are added to the partition.  A circumvention for this problem could be to reduce the number of processors and memory in the partition, or booting in P9 or later compatibility mode will also allow the partition to boot.
  • For a system with an AIX or Linux partition. a problem was fixed for Platform Error Logs (PELs) that are truncated to only eight bytes for error logs created by the firmware and reported to the AIX or Linux OS.  These PELs may appear to be blank or missing on the OS.  This rare problem is triggered by multiple error log events in the firmware occurring close together in time and each needing to be reported to the OS, causing a truncation in the reporting of the PEL.  As a problem workaround, the full error logs for the truncated logs are available on the HMC or using ASMI on the service processor to view them.
  • For a system with an AIX or Linux partition, a problem was fixed for a boot hang in RTAS for a partition that owns I/O which uses MSI-X interrupts.  A BA180007 SRC may be logged prior to the hang.  The frequency of this RTAS hang error is very low. 
VL940_093_027 / FW940.41

09/16/21

Impact: Data       Severity:  HIPER

System firmware changes that affect all systems

  • HIPER:  A problem was fixed which may occur on a target system following a Live Partition Mobility (LPM) migration of an AIX partition utilizing Active Memory Expansion (AME) with 64 KB page size enabled using the vmo tunable: "vmo -ro ame_mpsize_support=1".  The problem may result in AIX termination, file system corruption, application segmentation faults, or undetected data corruption.
    Note:  If you are doing an LPM migration of an AIX partition utilizing AME and 64 KB page size enabled involving a POWER8 or POWER9 system, ensure you have a Service Pack including this change for the appropriate firmware level on both the source and target systems.
  • HIPER/Pervasive:  A problem was fixed for certain  SR-IOV adapters in Shared mode where multicast and broadcast packets were not properly routed out to the physical port.  This may result in network issues such as ping failure or inability to establish TCP connections.  This problem only affects the SR-IOV adapters with the following Feature Codes and CCINs: #EC2R/EC2S with CCIN 58FA; #EC2T/EC2U with CCIN 58FB; #EC3L/EC3M with CCIN 2CE; and #EC66/EC67 with CCIN 2CF3.
    This problem was introduced by a fix delivered in the FW940.40 service pack.
    Update instructions:  https://www.ibm.com/support/knowledgecenter/en/POWER9/p9efd/p9efd_updating_sriov_firmware.htm.
  • A problem was fixed for Live Partition Mobility (LPM) migrations from non-trusted POWER9 systems to POWER10 systems. The LPM migration failure occurs every time a LPM migration is attempted from a non-trusted system source to FW1010 and later.  For POWER9 systems, non-trusted is the default setting.  The messages shown on the HMC for the failure are the following:
     HSCL365C The partition migration has been stopped because platform firmware detected an error (041800AC).
     HSCL365D The partition migration has been stopped because target MSP detected an error (05000127).
     HSCL365D The partition migration has been stopped because target MSP detected an error (05000127).
    A workaround for the problem is to enable the trusted system key on the POWER9 FW940/FW950 source system which can be done using an intricate procedure.  Please contact IBM Support for help with this workaround.

System firmware changes that affect certain systems

  • For a system with a partition running AIX 7.3, a problem was fixed for running Live Update or Live Partition Mobility (LPM).  AIX 7.3 supports Virtual Persistent Memory (PMEM) but it cannot be used with these operations, but the problem was making it appear that PMEM was configured when it was not.  The Live Update and LPM operations always fail when attempted on AIX 7.3.  Here is the failure output from a Live Update Preview:
    "1430-296 FAILED: not all devices are virtual devices.
    nvmem0
    1430-129 FAILED: The following loaded kernel extensions are not known to be safe for Live Update:
    nvmemdd
    ...
    1430-218 The live update preview failed.
    0503-125 geninstall:  The lvupdate call failed.
    Please see /var/adm/ras/liveupdate/logs/lvupdlog for details."
VL940_087_027 / FW940.40

07/08/21

Impact: Availability     Severity:  SPE

New features and functions

  • Support added to Redfish to provide a command to set the ASMI user passwords using a new AccountService schema.   Using this service, the ASMI admin, HMC, and general user passwords can be changed.
  • Support was changed to disable Service Location Protocol (SLP) by default for newly shipped systems or systems that are reset to manufacturing defaults.  This change has been made to reduce memory usage on the service processor by disabling a service that is not needed for normal system operations.  This change can be made manually for existing customers by changing it in ASMI with the options "ASMI -> System Configuration -> Security -> External Services Management" to disable the service.

System firmware changes that affect all systems

  • A problem was fixed for the system going to a "password update required" state on the HMC when downgrading from FW940 to FW930 service packs.  This problem is rare and can only happen if the passwords on the service processor are set to the factory default values.  The workaround to this problem is to update the FSP user password on the HMC.
  • A problem was fixed for Time of Day (TOD) being lost for the real-time clock (RTC) with an SRC B15A3303 logged when the service processor boots or resets.  This is a very rare problem that involves a timing problem in the service processor kernel.  If the server is running when the error occurs, there will be an SRC B15A3303 logged, and the time of day on the service processor will be incorrect for up to six hours until the hypervisor synchronizes its (valid) time with the service processor.  If the server is not running when the error occurs, there will be an SRC B15A3303 logged, and If the server is subsequently IPLed without setting the date and time in ASMI to fix it, the IPL will abort with an SRC B7881201 which indicates to the system operator that the date and time are invalid.
  • A problem was fixed for intermittent failures for a reset of a Virtual Function (VF) for SR-IOV adapters during Enhanced Error Handling (EEH) error recovery.  This is triggered by EEH events at a VF level only, not at the adapter level.  The error recovery fails if a data packet is received by the VF while the EEH recovery is in progress.  A VF that has failed can be recovered by a partition reboot or a DLPAR remove and add of the VF.
  • A problem was fixed for performance degradation of a partition due to task dispatching delays.  This may happen when a processor chip has all of its shared processors removed and converted to dedicated processors. This could be driven by a DLPAR remove of processors or Dynamic Platform Optimization (DPO).
  • A problem was fixed for a logical partition activation error that can occur when trying to activate a partition when the adapter hardware for an SR-IOV logical port has been physically removed or is unavailable due to a hardware issue. This message is reported on the HMC for the activation failure:  "Error:  HSCL12B5 The operation to remove SR-IOV logical port <number> failed because of the following error: HSCL1552 The firmware operation failed with extended error" where the logical port number will vary.  This is an infrequent problem that is only an issue if the adapter hardware has been removed or another problem makes it unavailable.  The workaround for this problem is to physically add the hardware back in or correct the hardware issue.  If that cannot be done, create an alternate profile for the logical partition without the SR-IOV logical port and use that until the hardware issue is resolved.
  • A problem was fixed for incomplete periodic data gathered by IBM Service for #EMXO PCIe expansion drawer predictive error analysis.  The service data is missing the PLX (PCIe switch) data that is needed for the debug of certain errors.
  • A problem was fixed for a rare failure for an SPCN I2C command sent to a PCIe I/O expansion drawer that can occur when service data is manually collected with hypervisor macros "xmsvc -dumpCCData and xmsvc -logCCErrBuffer".   If the hypervisor macro "xmsvc "is run to gather service data and a CMC Alert occurs at the same time that requires an SPCN command to clear the alert, then the I2C commands may be improperly serialized, resulting in an SPCN I2C command failure.  To prevent this problem, avoid using xmsvc -dumpCCData and xmsvc -logCCErrBuffer to collect service data until this fix is applied.
  • A problem was fixed for a system hang or terminate with SRC B700F105 logged during a Dynamic Platform Optimization (DPO) that is running with a partition in a failed state but that is not shut down.  If DPO attempts to relocate a dedicated processor from the failed partition, the problem may occur.  This problem can be avoided by doing a shutdown of any failed partitions before initiating DPO.
  • A problem was fixed for a system crash with HMC message HSCL025D and SRC B700F103 logged on a Live Partition Mobility (LPM) inactive migration attempt that fails.  The trigger for this problem is inactive migration that fails a compatibility check between the source and target systems.
  • A problem was fixed for the Systems Management Services (SMS) menu " I/O Device Information" option being incorrect when displaying the capacity for an NVMe or Fibre Channel (FC) NVMe disk.  This problem occurs every time the data is displayed.
  • A problem was fixed for an infrequent SRC of B7006956 that may occur during a system power off.  This SRC indicates that encrypted NVRAM locations failed to synchronize with the copy in memory during the shutdown of the hypervisor. This error can be ignored as the encrypted NVRAM information is stored in a redundant location, so the next IPL of the system is successful.
  • A problem was fixed for a misleading SRC B7006A20 (Unsupported Hardware Configuration) that can occur for some error cases for PCIes #EMX0 expansion drawers that are connected with copper cables.  For cable unplug errors, the SRC B7006A88 (Drawer TrainError) should be shown instead of the B7006A20.   If a B7006A20 is logged against copper cables with the signature "Prc UnsupportedCableswithFewerChannels" and the message "NOT A 12CHANNEL CABLE", this error should instead follow the service actions for a B7006A88 SRC.
  • A problem was fixed for certain SR-IOV adapters not being able to create the maximum number of VLANs that are supported for a physical port.  There were insufficient memory pages allocated for the physical functions for this adapter type.  The SR-IOV adapters affected have the following Feature Codes and CCINs:  #EC66/#EC67 with CCIN 2CF3.
    Update instructions:  https://www.ibm.com/support/knowledgecenter/en/POWER9/p9efd/p9efd_updating_sriov_firmware.htm.
  • A problem was fixed for certain SR-IOV adapters that can have B400FF02 SRCs logged with LPA dumps during a vNIC remove operation.  The adapters can have issues with a deadlock in managing memory pages.  In most cases, the operations should recover and complete.  This fix updates the adapter firmware to XX.29.2003  for the following Feature Codes and CCINs: #EC2R/EC2S with CCIN 58FA; #EC2T/EC2U with CCIN 58FB; #EC3L/EC3M with CCIN 2CE; and #EC66/EC67 with CCIN 2CF3.
    Update instructions:  https://www.ibm.com/support/knowledgecenter/en/POWER9/p9efd/p9efd_updating_sriov_firmware.htm.
  • The following problems were fixed for certain SR-IOV adapters:
    1) An error was fixed that occurs during a VNIC failover where the VNIC backing device has a physical port down or read port errors with an SRC B400FF02 logged.
    2) A problem was fixed for adding a new logical port that has a PVID assigned that is causing traffic on that VLAN to be dropped by other interfaces on the same physical port which uses OS VLAN tagging for that same VLAN ID.  This problem occurs each time a logical port with a non-zero PVID that is the same as an existing VLAN is dynamically added to a partition or is activated as part of a partition activation, the traffic flow stops for other partitions with OS configured VLAN devices with the same VLAN ID.  This problem can be recovered by configuring an IP address on the logical port with the non-zero PVID and initiating traffic flow on this logical port.  This problem can be avoided by not configuring logical ports with a PVID if other logical ports on the same physical port are configured with OS VLAN devices.
    This fix updates the adapter firmware to 11.4.415.37 for the following Feature Codes and CCINs: #EN15/#EN16 with CCIN 2CE3, #EN17/#EN18 with CCIN 2CE4, #EN0H/#EN0J with CCIN 2B93, #EN0M/#EN0N with CCIN 2CC0, #EN0K/#EN0L with CCIN 2CC1, #EL56/#EL38 with CCIN 2B93, and #EL57/#EL3C with CCIN 2CC1.
    Update instructions:  https://www.ibm.com/support/knowledgecenter/en/POWER9/p9efd/p9efd_updating_sriov_firmware.htm.
  • A problem was fixed for some serviceable events specific to the reporting of EEH errors not being displayed on the HMC.  The sending of an associated call home event, however, was not affected.  This problem is intermittent and infrequent.
  • A problem was fixed for possible partition errors following a concurrent firmware update from FW910 or later. A precondition for this problem is that DLPAR operations of either physical or virtual I/O devices must have occurred prior to the firmware update  The error can take the form of a partition crash at some point following the update. The frequency of this problem is low.  If the problem occurs, the OS will likely report a DSI (Data Storage Interrupt) error.  For example, AIX produces a DSI_PROC log entry.  If the partition does not crash, it is also possible that some subsequent I/O DLPAR operations will fail.
  • A problem was fixed for a missing hardware callout and guard for a processor chip failure with SRC BC8AE540 and signature "ex(n0p0c5) (L3FIR[28]) L3 LRU array parity error".
  • A problem was fixed for a missing hardware callout and guard for a processor chip failure with Predictive Error (PE) SRC BC70E540 and signature "ex(n1p2c6) (L2FIR[19]) Rc or NCU Pb data CE error".  The PE error occurs after the number of CE errors reaches a threshold of 32 errors per day.
  • A problem was fixed for a Live Partition Mobility (LPM) migration that failed with the error  "HSCL3659 The partition migration has been stopped because orchestrator detected an error" on the HMC.  This problem is intermittent and rare that is triggered by the HMC being overrun with unneeded LPM message requests from the hypervisor that can cause a timeout in HMC queries that result in the LPM operation being aborted.  The workaround is to retry the LPM migration which will normally succeed.
  • A problem was fixed for a service processor mailbox ( mbox) timeout error with SRC B182953C during the IPL of systems with large memory configurations and "I/O Adapter Enlarged Capacity" enabled from ASMI.  The error indicates that the hypervisor did not respond quickly enough to a message from the service processor but this may not result in an IPL failure.  The problem is intermittent, so if the IPL does fail, the workaround is to retry the IPL.
  • Problems were fixed for DLPAR operations that change the uncapped weight of a partition and DLPAR operations that switch an active partition from uncapped to capped.  After changing the uncapped weight, the weight can be incorrect.  When switching an active partition from uncapped to capped, the operation can fail.
  • A problem was fixed where the Floating Point Unit Computational Test, which should be set to "staggered" by default, has been changed in some circumstances to be disabled. If you wish to re-enable this option, this fix is required.  After applying this service pack,  do the following steps:
    1) Sign in to the Advanced System Management Interface (ASMI).
    2) Select Floating Point Computational Unit under the System Configuration heading and change it from disabled to what is needed: staggered (run once per core each day) or periodic (a specified time).
    3) Click "Save Settings".
  • A problem was fixed for a system termination with SRC B700F107 following a time facility processor failure with SRC B700F10B.  With the fix, the transparent replacement of the failed processor will occur for the B700F10B if there is a free core, with no impact to the system.
  • A problem was fixed for an SR-IOV adapter in shared mode configured as Virtual Ethernet Port Aggregator (VEPA) where unmatched unicast packets were not forwarded to the promiscuous mode VF.
    Update instructions:  https://www.ibm.com/support/knowledgecenter/en/POWER9/p9efd/p9efd_updating_sriov_firmware.htm.


System firmware changes that affect certain systems

  • On systems with an IBM i partition, a problem was fixed for physical I/O property data not being able to be collected for an inactive partition booted in "IOR" mode with SRC B200A101 logged.   This can happen when making a system plan (sysplan) for an IBM i partition using the HMC and the IBM i partition is inactive.  The sysplan data collection for the active IBM i partitions is successful.
VL940_084_027 / FW940.32

05/25/21

Impact: Availability     Severity:  HIPER

New features and functions

  • Support was added for Samsung DIMMs with part number 01GY853.  If these DIMMs are installed in a system with older FW940 firmware than FW940.32, the DIMMs will fail and be guarded with SRC BC8A090F logged with HwpReturnCode "RC_CEN_MBVPD_TERM_DATA_UNSUPPORTED_VPD_ENCODE".

System firmware changes that affect all systems

  • HIPER/Pervasive: A problem was fixed for a system IPL failure when DIMMs (RDIMMs or NVDIMMs) have mixed configurations with dual populated memory channels and single populated memory channels. This problem occurs If there are dually populated memory channels that precede a single DIMM memory channel for a processor. This causes the IPL to fail with B150BA40 and BC8A090F logged with HwpReturnCode "RC_MSS_CALC_POWER_CURVE_NEGATIVE_OR_ZERO_SLOPE " and HWP Error description "Power curve slope equals 0 or is negative". A workaround for this problem is to reconfigure the memory to have the single DIMM memory channels be in front of memory channels that have both DIMM slots occupied.
  • HIPER/Pervasive:  A problem was fixed for a checkstop due to an internal Bus transport parity error or a data timeout on the Bus. This is a very rare problem that requires a particular SMP transport link traffic pattern and timing. Both the traffic pattern and timing are very difficult to achieve with customer application workloads. The fix will have no measurable effect on most customer workloads although highly intensive OLAP-like workloads may see up to 2.5% impact.
VL940_074_027 / FW940.31

03/24/21

Impact: Availability     Severity:  SPE

System firmware changes that affect all systems

  • A problem was fixed for a partition hang in shutdown with SRC B200F00F logged.  The trigger for the problem is an asynchronous NX accelerator job (such as gzip or NX842 compression) in the partition that fails to clean up successfully.  This is intermittent and does not cause a problem until a shutdown of the partition is attempted.  The hung partition can be recovered by performing an LPAR dump on the hung partition.  When the dump has been completed, the partition will be properly shut down and can then be restarted without any errors.
VL940_071_027 / FW940.30

02/04/21

Impact: Availability     Severity:  HIPER

New features and functions

  • Support added to be able to set the NVRAM variable 'real-base' from the Restricted OF Prompt (ROFP). Prior to the introduction of ROFP, customers had the ability to set 'real-base' from the OF prompt.  This capability was removed in the initial delivery of ROFP in FW940.00. One use for this capability is that, in some cases, OS images (usually Linux) need more memory to load their image for boot. The OS image is loaded in between 0x4000 'load-base' and 0x2800000 'real-base'.
  • Added support  in ASMI for a new panel to do Self -Boot Engine (SBE) SEEPROM validation.  This validation can only be run at the service processor standby state.  
    If the validation detects a problem, IBM recommends the system not be used and that IBM service be called.

System firmware changes that affect all systems

  • HIPER/Pervasive: A problem was fixed to be able to detect a failed PFET sensing circuit in a core at runtime, and prevent a system fail with an incomplete state when a core fails to wake up. The failed core is detected on the subsequent IPL. With the fix. a core is called out with the PFET failure with SRC BC13090F and hardware description "CME detected malfunctioning of PFET headers." to isolate the error better with a correct callout.
  • A problem was fixed for a slow down in PCIe adapter performance or loss of adapter function caused by a reduction in interrupts available to service the adapter.  This problem can be triggered over time by partition activations or DLPAR adds of PCIe adapters to a partition.  This fix must be applied and the system re-IPLed for existing adapter performance problems to be resolved.  However, the fix will prevent future issues without re-ipl if applied before the problem is observed.
  • A problem was fixed for certain PCIe adapters and NVMe U.2 devices. The following feature codes are affected and the presence of these features in the 9009-41A server can result in higher fan speeds and therefore higher acoustic levels after upgrading the firmware (the other models are also affected but to a lesser extent).  Refer to the acoustic levels published in the IBM Knowledge Center currently located at ===> https://www.ibm.com/support/knowledgecenter/9009-41A/p9had/p9had_90x.htm:
    #EC5J witn CCIN 59B4;  #EC5K with CCIN 59B5; #EC5L with CCIN 59B6; #EC5X with CCIN 59B7; #EC7A/#EC7B with CCIN 594A; #EC7C/#EC7D with CCIN 594B; #EC7E/#EC7F with CCIN 594C; #EC7J/#EC7K  with CCIN 594A; #EC7L/#EC7M with CCIN 594B; #EC7N/#EC7P with CCIN 594C; #ES1E/#ES1F with CCIN 59B8; #ES1G/#ES1H with CCIN 59B9;  #EC5V/#EC5W with CCIN 59BA; #EC5G/#EC5B and #EC6U/#EC6V with CCIN 58FC;  #EC5W/#EC5D and #EC6W/#EC6X with CCIN 58FD; and #EC5G/#EC5B and #EC6Y/#EC6Z with CCIN 58FE.
  • A problem was fixed for not logging SRCs for certain cable pulls from the #EMXO PCIe expansion drawer. With the fix, the previously undetected cable pulls are now detected and logged with SRC B7006A8B and B7006A88 errors.
  • A problem was fixed for a system hang and HMC "Incomplete" state that may occur when a partition hangs in shutdown with SRC B200F00F logged. The trigger for the problem is an asynchronous NX accelerator job (such as gzip or NX842 compression) in the partition that fails to clean up successfully.  This is intermittent and does not cause a problem until a shutdown of the partition is attempted.
  • A problem was fixed for a VIOS, AIX, or Linux partition hang during an activation at SRC CA000040. This will occur on a system that has been running for more than 814 days when the boot of the partition is attempted if the partitions are in POWER9_base or POWER9 processor compatibility mode.
    A workaround to this problem is to re-IPL the system or to change the failing partition to POWER8 compatibility mode.
  • A problem was fixed for performance tools perfpmr, tprof and pex that may not be able to collect data for the event based options. 
    This can occur any time an OS thread becomes idle.  When the processor cores are assigned to the next active process, the performance registers may be disabled.
  • A problem was fixed for a rare system hang with SRC BC70E540 logged that may occur when adding processors through licensing or the system throttle state changing (becoming throttled or unthrottled) on an Enterprise Pool system.  The trigger for the problem is a very small timing window in the hardware as the processor loads are changing.
  • A problem was for an intermittent anchor card timeout with Informational SRC B7009020 logged when reading TPM physical storage from the anchor card. There is no customer impact for this problem as long as NVRAM is accessible.
  • A problem was fixed for the On-Chip Controller (OCC) going into safe mode (causes loss of processor performance) with SRC BC702616 logged. This problem can be triggered by the loss of a power supply (an oversubscription event).  The problem can be circumvented by fixing the issue with the power supply.
  • A problem was fixed for error handling of a rare DIMM VPD read that causes incorrect logging of SRC B1232A09, word 8 (00000000), "Error occurred when attempting to read a memory DIMM temperature". Other SRCs seen with this error may include BC23E504 and B1561314. This error results in multiple FRUs being called out such as the system planar, processor, DIMM controller, DIMMs or the memory riser card.
  • A problem was fixed for the error handling of a system with an unsupported memory configuration that exceeds available memory power. Without the fix, the IPL of the system is attempted and fails with a segmentation fault with SRCs B1818611 and B181460B logged that do not call out the incorrect DIMMs.
  • A problem was fixed for the Self Boot Engine (SBE) going to termination with an SRC B150BA8D logged when booting on a bad core. Once this happens, this error will persist as the bad core is not deconfigured. To recover from this error and be able to IPL, the bad core must be manually deconfigured.   With the fix, the failing core is deconfigured and the SBE is reconfigured to use another core so the system is able to IPL.
  • A problem was fixed for certain SR-IOV adapters that have a rare, intermittent error with B400FF02 and B400FF04 logged, causing a reboot of the VF. The error is handled and recovered without any user intervention needed. The SR-IOV adapters affected have the following Feature Codes and CCINs: #EC2R/#EC2S with CCIN 58FA; #EC2T/#EC2U with CCIN 58FB; #EC3L/#EC3M with CCIN 2CE; and #EC66/#EC67 with CCIN 2CF3.
  • A problem was fixed for Live Partition Mobility (LPM) being shown as enabled at the OS when it has been disabled by the ASMI command line using the server processor command of "cfcuod -LPM OFF".  LPM is actually disabled and the status shows correctly on the HMC.  The status on the OS can be ignored (for example as shown by the AIX command "lparstat  -L") as LPM will not be allowed to run when it is disabled.
  • A problem was fixed for an SRC B7006A99 informational log now posted as a Predictive with a call out of the CXP cable FRU,  This fix improves FRU isolation for cases where a CXP cable alert causes a B7006A99 that occurs prior to a B7006A22 or B7006A8B.  Without the fix, the SRC B7006A99 is informational and the latter SRCs cause a larger hardware replacement even though the earlier event identified a probable cause for the cable FRU.

System firmware changes that affect certain systems

  • On systems with an uncapped shared processor partition in POWER9 processor compatibility mode. a problem was fixed for a system hang following Dynamic Platform Optimization (DPO), memory mirroring defragmentation, or memory guarding that happens as part of memory error recovery during normal operations of the system.
  • On systems with a partition using Virtual Persistent Memory (vPMEM) LUNS configured with a 16 MB MPSS (Multiple Page Segment Size) mapping, a problem was fixed for temporary system hangs. The temporary hang may occur while the memory is involved in memory operations such as Dynamic Platform Optimization (DPO), memory mirroring defragmentation, or memory guarding that happens as part of memory error recovery during normal operations of the system.
  • On systems with partitions having user mode enabled for the External Interrupt Virtualization Engine (XIVE), a problem was fixed for a possible system crash and HMC "Incomplete" state when a  force DLPAR remove of a PCIe adapter occurs after a dynamic LPAR (DLPAR) operation fails for that same PCIe adapter. 
VL940_061_027 / FW940.20

09/24/20

Impact: Data       Severity:  HIPER

New features and functions

  • DEFERRED: Host firmware support for anti-rollback protection.  This feature implements firmware anti-rollback protection as described in NIST SP 800-147B "BIOS Protection Guidelines for Servers".  Firmware is signed with a "secure version".  Support added for a new menu in ASMI called "Host firmware security policy" to update this secure version level at the processor hardware.  Using this menu, the system administrator can enable the "Host firmware secure version lock-in" policy, which will cause the host firmware to update the "minimum secure version" to match the currently running firmware. Use the "Firmware Update Policy" menu in ASMI to show the current "minimum secure version" in the processor hardware along with the "Minimum code level supported" information. The secure boot verification process will block installing any firmware secure version that is less than the "minimum secure version" maintained in the processor hardware.
    Prior to enabling the "lock-in" policy, it is recommended to accept the current firmware level.
    WARNING: Once lock-in is enabled and the system is booted, the "minimum secure version" is updated and there is no way to roll it back to allow installing firmware releases with a lesser secure version.

System firmware changes that affect all systems

  • HIPER/Pervasive:  A problem was fixed for certain SR-IOV adapters for a condition that may result from frequent resets of adapter Virtual Functions (VFs), or transmission stalls and could lead to potential undetected data corruption.
    The following additional fixes are also included:
    1) The VNIC backing device goes to a powered off state during a VNIC failover or Live Partition Mobility (LPM) migration.  This failure is intermittent and very infrequent.
    2) Adapter time-outs with SRC B400FF01 or B400FF02 logged.
    3) Adapter time-outs related to adapter commands becoming blocked  with SRC B400FF01 or B400FF02 logged.
    4) VF function resets occasionally not completing quickly enough resulting in SRC B400FF02 logged.
    This fix updates the adapter firmware to 11.4.415.33 for the following Feature Codes and CCINs:  #EN15/#EN16 with CCIN 2CE3, #EN17/#EN18 with CCIN 2CE4, #EN0H/#EN0J with CCIN 2B93, #EN0M/#EN0N with CCIN 2CC0, #EN0K/#EN0L with CCIN 2CC1, #EL56/#EL38 with CCIN 2B93, and #EL57/#EL3C with CCIN 2CC1.
    The SR-IOV adapter firmware level update for the shared-mode adapters happens under user control to prevent unexpected temporary outages on the adapters.  A system reboot will update all SR-IOV shared-mode adapters with the new firmware level.  In addition, when an adapter is first set to SR-IOV shared mode, the adapter firmware is updated to the latest level available with the system firmware (and it is also updated automatically during maintenance operations, such as when the adapter is stopped or replaced).  And lastly, selective manual updates of the SR-IOV adapters can be performed using the Hardware Management Console (HMC).  To selectively update the adapter firmware, follow the steps given at the IBM Knowledge Center for using HMC to make the updates:  https://www.ibm.com/support/knowledgecenter/en/POWER9/p9efd/p9efd_updating_sriov_firmware.htm.
    Note: Adapters that are capable of running in SR-IOV mode, but are currently running in dedicated mode and assigned to a partition, can be updated concurrently either by the OS that owns the adapter or the managing HMC (if OS is AIX or VIOS and RMC is running).
  • A problem was fixed for the REST/Redfish interface to change the success return code for object creation from "200" to "201".  The "200" status code means that the request was received and understood and is being processed.  A "201" status code indicates that a request was successful and, as a result, a resource has been created.  The Redfish Ruby Client, "redfish_client" may fail a transaction if a "200" status code is returned when "201" is expected.
  • A problem was fixed to allow quicker recovery of PCIe links for the #EMXO PCIe expansion drawer for a run-time fault with B7006A22 logged.  The time for recovery attempts can exceed six minutes on rare occasions which may cause I/O adapter failures and failed nodes.  With the fix, the PCIe links will recover or fail faster (in the order of seconds) so that redundancy in a cluster configuration can be used with failure detection and failover processing by other hosts, if available, in the case where the PCIe links fail to recover.
  • A problem was fixed for a concurrent maintenance "Repair and Verify" (R&V) operation for a #EMX0 fanout module that fails with an "Unable to isolate the resource" error message.  This should occur only infrequently for cases where a physical hardware failure has occurred which prevents access to slot power controls.  This problem can be worked around by bringing up the "PCIe Hardware Topology" screen from either ASMI or the HMC after the hardware failure but before the concurrent repair is attempted.  This will avoid the problem with the PCIe slot isolation   These steps can also be used to recover from the error to allow the R&V repair to be attempted again.
  • A problem was fixed for a rare system hang that can occur when a page of memory is being migrated.  Page migration (memory relocation) can occur for a variety of reasons, including predictive memory failure, DLPAR of memory, and normal operations related to managing the page pool resources.
  • A problem was fixed for utilization statistics for commands such as HMC lslparutil and third-party lpar2rrd that do not accurately represent CPU utilization. The values are incorrect every time for a partition that is migrated with Live Partition Mobility (LPM).  Power Enterprise Pools 2.0 is not affected by this problem.  If this problem has occurred, here are three possible recovery options:
    1) Re-IPL the target system of the migration.
    2) Or delete and recreate the partition on the target system.
    3) Or perform an inactive migration of the partition.  The cycle values get zeroed in this case.
  • A problem was fixed for running PCM on a system with SR-IOV adapters in shared mode that results in an "Incomplete" system state with certain hypervisor tasks deadlocked. This problem is rare and is triggered when using SR-IOV adapters in shared mode and gathering performance statistics with PCM (Performance Collection and Monitoring) and also having a low-level error on an adapter.  The only way to recover from this condition is to re-IPL the system.
  • A problem was fixed for an enhanced PCIe expansion drawer FPGA reset causing EEH events from the fanout module or cable cards that disrupt the PCIe lanes for the PCIe adapters.  This problem affects systems with the PCIe expansion drawer enhanced fanout module (#EMXH) and the enhanced cable card (#EJ1R or #EJ20). The error is associated with the following SRCs being logged:
    B7006A8D with PRC 37414123 (XmPrc::XmCCErrMgrBearPawPrime | XmPrc::LocalFpgaHwReset)
    B7006A8E with PRC 3741412A (XmPrc::XmCCErrMgrBearPawPrime | XmPrc::RemoteFpgaHwReset)
    If the EEH errors occur, the OS device drivers automatically recover but with a reset of affected PCIe adapters that would cause a brief interruption in the I/O communications.
  • A problem was fixed for a PCIe3 expansion drawer cable that has hidden error logs for a single lane failure.  This happens whenever a single lane error occurs. Subsequent lane failures are not hidden and have visible error logs. Without the fix, the hidden or informational logs would need to be examined to gather more information for the failing hardware.
  • A problem was fixed for an infrequent issue after a Live Partition Mobility (LPM) operation from a POWER9 system to a POWER8 or POWER7 system.  The issue may cause unexpected OS behavior, which may include loss of interrupts, device time-outs, or delays in dispatching.  Rebooting the affected target partition will resolve the problem.
  • A problem was fixed for a partition crash or hang following a partition activation or a DLPAR add of a virtual processor. For partition activation, this issue is only possible for a system with a single partition owning all resources. For DLPAR add, the issue is extremely rare.
  • A problem was fixed for a DLPAR remove of memory from a partition that fails if the partition contains 65535 or more LMBs. With 16MB LMBs, this error threshold is 1 TB of memory.  With 256 MB LMBs, it is 16 TB of memory.  A reboot of the partition after the DLPAR will remove the memory from the partition.
  • A problem was fixed for an IPL failure with SRC BA180020 logged for an initialization failure on a PCIe adapter in a PCIe3 expansion drawer.  The PCIe adapters that are intermittently failing on the PCIe probe are the PCIe2 4-port Fibre Channel Adapter with feature code #5729 and the PCIe2 4-port 1 Gb Ethernet Adapter with feature code #5899.  The failure can only occur on an IPL or re-IPL and it is very infrequent.  The system can be recovered with a re-IPL.
  • A problem was fixed for a partition configured with a large number (approximately 64) of Virtual Persistent Memory (PMEM) LUNs hanging during the partition activation with a CA00E134 checkpoint SRC posted.  Partitions configured with approximately 64 PMEM LUNs will likely hang and the greater the number of LUNs, the greater the possibility of the hang.  The circumvention to this problem is to reduce the number of PMEM LUNs to 64 or less in order to boot successfully.  The PMEM LUNs are also known as persistent memory volumes and can be managed using the HMC.  For more information on this topic, refer to https://www.ibm.com/support/knowledgecenter/POWER9/p9efd/p9efd_lpar_pmem_settings.htm.
  • A problem was fixed for non-optimal On-Chip Controller (OCC) processor frequency adjustments when system power limits or user power caps are exceeded.  When a workload causes power limits or caps to be exceeded, there can be large frequency swings for the processors and a processor chip can get stuck at minimum frequency.  With the fix, the OCC now waits for new power readings when changing the processor frequency and uses a master power capping frequency to keep all processors at the same frequency.  As a workaround for this problem, do not set a power cap or run a workload that would exceed the system power limit.
  • A problem was fixed for PCIe resources under a deconfigured PCIe Host Bridge (PHB) being shown on the OS host as available resources when they should be shown as deconfigured. While this fix can be applied concurrently, a re-IPL of the system is needed to correct the state of the PCIe resources if a PHB had already been deconfigured.
  • A problem was fixed for incorrect run-time deconfiguration of a processor core with SRC B700F10B. This problem can be circumvented by a reconfiguration of the processor core but this should only be done with the guidance of IBM Support to ensure the core is good.
  • A problem was fixed for certain SR-IOV adapter errors where a B400F011 is reported instead of a more descriptive B400FF02 or B400FF04.  The LPA dump still happens which can be used to isolate to the issue.  The SR-IOV adapters affected have the following Feature Codes and CCINs: #EC2R/#EC2S with CCIN 58FA; #EC2T/#EC2U with CCIN 58FB; #EC3L/#EC3M with CCIN 2CE; and #EC66/#EC67 with CCIN 2CF3.
  • A problem was fixed for mixing modes on the ports of SR-IOV adapters that causes SRC B200A161, B200F011, B2009014 and B400F104 to be logged on boot of the failed adapter. This error happens when one port of the adapter is changed to option 1 with a second port set at either option 0 or option 2.  The error can be cleared by taking the adapter out of SR-IOV shared mode. The SR-IOV adapters affected have the following Feature Codes and CCINs: #EC2R/#EC2S with CCIN 58FA; #EC2T/#EC2U with CCIN 58FB; #EC3L/#EC3M with CCIN 2CE; and #EC66/#EC67 with CCIN 2CF3.
  • A problem was fixed for certain SR-IOV adapters with the following issues:
    1) The VNIC backing device goes to a powered off state during a VNIC failover or Live Partition Mobility (LPM) migration.  This failure is intermittent and very infrequent.
    2) Adapte time-outs with SRC B400FF01 or B400FF02 logged.
    3) Adapter time-outs related to adapter commands becoming blocked with SRC B400FF01 or B400FF02 logged
    4)VF function resets occasionally not completing quickly enough resulting in SRC B400FF02 logged.
    This fix updates the adapter firmware to 11.4.415.33 for the following Feature Codes and CCINs:  #EN15/#EN16 with CCIN 2CE3, #EN17/#EN18 with CCIN 2CE4, #EN0H/#EN0J with CCIN 2B93, #EN0M/#EN0N with CCIN 2CC0, #EN0K/#EN0L with CCIN 2CC1, #EL56/#EL38 with CCIN 2B93, and #EL57/#EL3C with CCIN 2CC1.
    The SR-IOV adapter firmware level update for the shared-mode adapters happens under user control to prevent unexpected temporary outages on the adapters.  A system reboot will update all SR-IOV shared-mode adapters with the new firmware level.  In addition, when an adapter is first set to SR-IOV shared mode, the adapter firmware is updated to the latest level available with the system firmware (and it is also updated automatically during maintenance operations, such as when the adapter is stopped or replaced).  And lastly, selective manual updates of the SR-IOV adapters can be performed using the Hardware Management Console (HMC).  To selectively update the adapter firmware, follow the steps given at the IBM Knowledge Center for using HMC to make the updates:  https://www.ibm.com/support/knowledgecenter/en/POWER9/p9efd/p9efd_updating_sriov_firmware.htm.
    Note: Adapters that are capable of running in SR-IOV mode, but are currently running in dedicated mode and assigned to a partition, can be updated concurrently either by the OS that owns the adapter or the managing HMC (if OS is AIX or VIOS and RMC is running).
  • A problem was fixed for Novalink-created virtual ethernet and vNIC adapters having incorrect SR-IOV Hybrid Network Virtualization (HNV) values. The AIX and other OS hosts may be unable to use the adapters. This happens for all virtual ethernet and vNIC adapters created by Novalink in the FW940 releases up to the FW940.10 service pack. The fix will correct the settings for new Novalink created virtual adapters, but any pre-existing virtual adapters created by Novalink in FW940 must be deleted and recreated.
  • A problem was fixed for partitions configured to run as AIX, VIOS, or Linux partitions that also own specific Fibre Channel (FC) I/O adapters (see below) are subject to a partition  crash during boot if the partition does not already have a boot list. During the initial boot of a new partition (containing 577F, 578E, 578F or 579B adapters), the boot might fail with one of the following reference codes: BA210001, BA218001, BA210003, or BA218003.  This most often occurs on deployments of new partitions that are booting for the first time for either a network install or booting to the Open Firmware prompt or SMS menus for the first time.  The issue requires that the partition owns one or more of the following FC adapters and that these adapters are running at microcode firmware levels older than version 11.4.415.5:
    -  Feature codes #EN1C,/#EN1D and #EL5X/#EL5W with CCIN 578E
    - Feature codes #EN1A/# EN1B and #EL5U/#EL5V with CCIN 578F
    - Feature codes #EN0A,/#EN0B and #EL5B/#EL43 with CCIN 577F
    The frequency of the problem  is somewhat rare because it requires the following:
    - Partition does not already have a default boot list
    - Partition configured with one of the FC adapters listed above
    - The FC adapters must be running a version of microcode with unsigned/unsecure adapter microcode
    The following work around was created for systems having this issue: https://www.ibm.com/support/pages/node/1367103.
    With the fix,  the FC adapters are given a temporary substitute for the FCode on the adapter but not the entire microcode image.  The adapter microcode is not updated.  This workaround is done so the system can boot from the adapter  until the adapter can be updated by the customer with the latest available microcode from IBM Fix Central.  In the meantime, the FCode substitution is made from the 12.4.257.15 level of the microcode.
  • A problem was fixed for mixing memory DIMMs with different timings (different vendors) under the same memory controller that fail with an SRC BC20E504 error and DIMMs deconfigured. This is an "MCBIST_BRODCAST_OUT_OF_SYNC" error.  The loss of memory DIMMs can result in an IPL failure.  This problem can happen if the memory DIMMs have a certain level of timing differences.  If the timings are not compatible, the failure will occur on the IPL during the memory training. To circumvent this problem, each memory controller should have only memory DIMMs from the same vendor plugged.
  • A problem was fixed for the SR-IOV logical port of an  I/O adapter logging a B400FF02 error because of a time-out waiting on a response from the firmware. This rare error requires a very heavily loaded system. For this error, word 8 of the error log is 80090027. No user intervention is needed for this error as the logical port recovers and continues with normal operations.
  • A problem was fixed for a security vulnerability for the Self Boot Engine (SBE).  The SBE can be compromised from the service processor to allow injection of malicious code. An attacker that gains root access to the service processor could compromise the integrity of the host firmware and bypass the host firmware signature verification process. This compromised state can not be detected through TPM attestation.  This is Common Vulnerabilities and Exposures issue number CVE-2021-20487.

System firmware changes that affect certain systems

  • On systems with an IBM i partition, a problem was fixed for a dedicated memory IBM i partition running in P9 processor compatibility mode failing to activate with HSCL1552 "the firmware operation failed with extended error".  This failure only occurs under a very specific scenario - the new amount of desired memory is less than the current desired memory, and the Hardware Page Table (HPT) size needs to grow.
  • On systems with AIX and Linux partitions, a problem was fixed for AIX and Linux partitions that crash or hang when reporting any of the following Partition Firmware RTAS ASSERT rare conditions:
    1) SRC BA33xxxx errors - Memory allocation and management errors.
    2) SRC BA29xxxx errors - Partition Firmware internal stack errors.
    3) SRC BA00E8xx errors - Partition Firmware initialization errors during concurrent firmware update or Live Partition Mobility (LPM) operations.
    This problem should be very rare.  If the problem does occur, a partition reboot is needed to recover from the error.
VL940_050_027 / FW940.10

05/22/20

Impact: Availability         Severity:  SPE

New features and functions

  • DEFERRED:  Maximum Performance mode was enhanced to increase the cap on the maximum processor frequency for some workloads, providing for better performance.  As the workload or active core count decrease, the processor uses less power, which enables the frequency to be increased above nominal.  In the Maximum Performance mode, the allowed socket power is increased to the maximum value, which results in top performance but with increased fan noise and higher power consumption.
  • Enable periodic logging of internal component operational data for the PCIe3 expansion drawer paths. The logging of this data does not impact the normal use of the system.
  • Support added for SR-IOV Hybrid Network Virtualization (HNV) in a production environment (no longer a Technology Preview) for AIX and IBM i.   This capability allows an AIX or IBM i partition to take advantage of the efficiency and performance benefits of SR-IOV logical ports and participate in mobility operations such as active and inactive Live Partition Mobility (LPM) and Simplified Remote Restart (SRR).  HNV is enabled by selecting a new Migratable option when an SR-IOV logical port is configured. The Migratable option is used to create a backup virtual device.  The backup virtual device can be either a Virtual Ethernet adapter or a virtual Network Interface Controller (vNIC) adapter. In addition to this firmware HNV support in a production environment requires HMC 9.1.941.0 or later, AIX Version 7.2 with the 7200-04 Technology Level and Service Pack 7200-04-02-2015 or AIX Version 7.1 with the 7100-05 Technology Level and Service Pack 7100-05-06-2015, IBM i 7.3 TR8 or IBM i 7.4 TR2, and VIOS 3.1.1.20.

System firmware changes that affect all systems

  • DEFERRED:  A problem was fixed for a processor core failure with SRCs B150BA3C and BC8A090F logged that deconfigures the entire processor for the current IPL.  A re-IPL of the system will recover the lost processor with only the bad core guarded.
  • A problem was fixed for Performance Monitor Unit (PMU) events that had the incorrect Alink address (Xlink data given instead) that could be seen in 24x7 performance reports.  The Alink event data is a recent addition for FW940 and would not have been seen at the earlier firmware levels.
  • A problem was fixed for an SR-IOV adapter hang with B400FF02/B400FF04 errors logged during firmware update or error recovery.  The adapter may recover after the error log and dump, but it is possible the adapter VF will remain disabled until the partition using it is rebooted.  This affects the SR-IOV adapters with the following feature codes and CCINs:   #EC2R/EC2S with CCIN 58FA;  #EC2T/EC2U with CCIN 58FB;  #EC3L/EC3M with CCIN 2CEC;  and #EC66/EC67 with CCIN 2CF3.
  • A problem was fixed for the location code of the Removable EXchange (RDX) docking station being incorrectly reported as P1-P3.  The correct location code is Un-P3. This problem pertains only to the S914 (9009-41A), S924 (9009-42A) and the H924 (9223-42H) models.  Please refer to the following IBM Knowledge Center article for more information on the location codes:  https://www.ibm.com/support/knowledgecenter/9009-42A/p9ecs/p9ecs_914_924_loccodes.htm
  • A problem was fixed for extraneous B400FF01 and B400FF02 SRCs logged when moving cables on SR-IOV adapters.  This is an infrequent error that can occur if the HMC performance monitor is running at the same time the cables are moved.  These SRCs can be ignored when accompanied by cable movement.
  • A problem was fixed for certain SR-IOV adapters that can have B400FF02 SRCs logged with LPA dumps during Live Partition Mobility (LPM) migrations or vNIC failovers.  The adapters can have issues with a deadlock on error starts after many resets of the VF and errors in managing memory pages.  In most cases, the operations should recover and complete.  This fix updates the adapter firmware to 1X.25.6100  for the following Feature Codes and CCINs: #EC2R/EC2S with CCIN 58FA; #EC2T/EC2U with CCIN 58FB;  #EC3L/EC3M with CCIN 2CE; and #EC66/EC67 with CCIN 2CF3.
    The SR-IOV adapter firmware level update for the shared-mode adapters happens under user control to prevent unexpected temporary outages on the adapters.  A system reboot will update all SR-IOV shared-mode adapters with the new firmware level.  In addition, when an adapter is first set to SR-IOV shared mode, the adapter firmware is updated to the latest level available with the system firmware (and it is also updated automatically during maintenance operations, such as when the adapter is stopped or replaced).  And lastly, selective manual updates of the SR-IOV adapters can be performed using the Hardware Management Console (HMC).  To selectively update the adapter firmware, follow the steps given at the IBM Knowledge Center for using HMC to make the updates:   https://www.ibm.com/support/knowledgecenter/en/POWER9/p9efd/p9efd_updating_sriov_firmware.htm
    Note: Adapters that are capable of running in SR-IOV mode, but are currently running in dedicated mode and assigned to a partition, can be updated concurrently either by the OS that owns the adapter or the managing HMC (if OS is AIX or VIOS and RMC is running).
  • A problem was fixed where SR-IOV adapter VFs occasionally failed to provision successfully on the low-speed ports (1 Gbps) with SRC B400FF04 logged, or SR-IOV adapter VFs occasionally failed to provision successfully with SRC B400FF04 logged when the RoCE option is enabled.
    This affects the adapters with low speed ports (1 Gbps) with the following  Feature Codes and CCINs:  #EN0H/EN0J with CCIN 2B93, #EN0M/EN0N with CCIN 2CC0,  #EN0K/EN0L with CCIN 2CC1, #EL56/EL38 with CCIN 2B93, and #EL57/EL3C with CCIN 2CC1. And it affects  the adapters with the ROCE option enabled  with the following feature codes and CCINs:   #EC2R/EC2S with CCIN 58FA;  #EC2T/EC2U with CCIN 58FB;  #EC3L/EC3M with CCIN 2CEC;  and #EC66/EC67 with CCIN 2CF3.
  • A problem was fixed for an expired trial or elastic Capacity on Demand ( CoD) memory not warning of the use of unlicensed memory if the memory is not returned. This lack of warning can occur if the trial memory has been allocated as Virtual Persistent Memory (vPMEM).
  • A problem was fixed for a B7006A96 fanout module FPGA corruption error that can occur in unsupported PCIe3 expansion drawer(#EMX0) configurations that mix an enhanced PCIe3 fanout module (#EMXH) in the same drawer with legacy PCIe3 fanout modules (#EMXF, #EMXG, #ELMF, or #ELMG).  This causes the FPGA on the enhanced #EMXH to be updated with the legacy firmware and it becomes a non-working and unusable fanout module.  With the fix, the unsupported #EMX0 configurations are detected and handled gracefully without harm to the FPGA on the enhanced fanout modules.
  • A problem was fixed for possible dispatching delays for partitions running in POWER8, POWER9_base or POWER9 processor compatibility mode.
  • A problem was fixed for system memory not returned after create and delete of partitions, resulting in slightly less memory available after configuration changes in the systems.  With the fix, an IPL of the system will recover any of the memory that was orphaned by the issue.
  • A problem was fixed for failover support for the Mover Service Partition (MSP) where a failover to the MSP partner during an LPM could cause the migration to abort.   This vulnerability is only for a very specific window in the migration process. The recovery is to restart the migration operation.
  • A rare problem was fixed for a checkstop during an IPL that fails to isolate and guard the problem core.  An SRC is logged with B1xxE5xx and an extended hex word 8 xxxxDD90.  With the fix, the failing hardware is guarded and a node is possibly deconfigured to allow the subsequent IPLs of the system to be successful.
  • A problem was fixed for a hypervisor error during system shutdown where a B7000602 SRC is logged and the system may also briefly go "Incomplete" on the HMC but the shutdown is successful.  The system will power back on with no problems so the SRC can be ignored if it occurred during a shutdown.
  • A problem was fixed for certain large I/O adapter configurations having the PCI link information truncated on the PCI-E topology display shown with ASMI and the HMC.  Because of the truncation, individual adapters may be missing on the PCI-E topology screens.
  • A problem was fixed for certain NVRAM corruptions causing a system crash with a bad pointer reference instead of expected Terminate Immediate (TI) with B7005960 logged.
  • A problem was fixed for certain SR-IOV adapters that do not support the "Disable Logical Port" option from the HMC but the HMC was allowing the user to select this, causing incorrect operation.  The invalid state of the logical port causes an "Enable Logical Port" to fail in a subsequent operation.  With the fix, the HMC provides the message that the "Disable Logical Port" is not supported for the adapter.  This affects the adapters with the following  Feature Codes and CCINs: #EN15/EN16 with CCIN 2CE3,  #EN17/EN18 with CCIN 2CE4, #EN0H/EN0J with CCIN 2B93, #EN0M/EN0N with CCIN 2CC0,  #EN0K/EN0L with CCIN 2CC1, #EL56/EL38 with CCIN 2B93, and #EL57/EL3C with CCIN 2CC1.
  • A problem was fixed for the service processor ASMI "Factory Reset" option to disable the IPMI service as part of the factory reset.  Without the fix, the IPMI operation state will be unchanged by the factory reset.
  • A problem was fixed to remove unneeded resets of a VF for SR-IOV adapters, providing for improved performance of the startup or recovery time of the VF.  This performance difference may be noticed during a Live Partition Mobility migration of a partition or during vNIC (Virtual Network Interface Controller) failovers where many resets of VFs are occurring.
  • A problem was fixed for SR-IOV adapters having an SRC B400FF04 logged when a VF is reset.  This is an infrequent issue and can occur for a Live Partition Mobility migration of a partition or during vNIC (Virtual Network Interface Controller) failovers where many resets of VFs are occurring.  This error is recovered automatically with no impact on the system.
  • A problem was fixed for initial configuration of  SR-IOV adapter VFs with certain configuration settings for the following Feature Codes and CCINs:   #EC2R/EC2S with CCIN 58FA; #EC2T/EC2U with CCIN 58FB;  #EC3L/EC3M with CCIN 2CE;  and #EC66/EC67 with CCIN 2CF3.
    These VFs may then fail following an adapter restart, with other VFs functioning normally.  The error causes the VF to fail with an SRC B400FF04 logged.  With the fix, VFs are configured correctly when created.  
    Because the error condition may pre-exist in an incorrectly configured logical port, a concurrent update of this fix may trigger a logical port failure when the VF logical port is restarted during the firmware update.  Existing VFs with the failure condition can be recovered by dynamically removing/adding the failed port and are automatically recovered during a system restart.
  • A problem was fixed for TPM hardware failures not causing SRCs to logged with a call out if the system is configured in ASMI to not require TPM for the IPL.  If this error occurs, the user would not find out about it until they needed to run with TPM on the IPL.  With the fix, the error logs and notifications will occur regardless of how the TPM is configured.

System firmware changes that affect certain systems

  • On systems with an IBM i partition, a problem was fixed for a D-mode IPL failure when using a USB DVD drive in an IBM 7226 multimedia storage enclosure.  Error logs with SRC BA16010E, B2003110, and/or B200308C can occur.  As a circumvention, an external DVD drive can be used for the D-mode IPL.
  • On systems with an IBM i partition, a problem was fixed that occurs after a Live Partition Mobility (LPM) of an IBM i partition that may cause issues including dispatching delays and the inability to do further LPM operations of that partition.  The frequency of this problem is rare.  A partition encountering this error can be recovered with a reboot of the partition.
  • On systems with an IBM i partition in POWER9 processor compatibility mode, a problem was fixed for an MSD in IBM i with SRCs B6000105 or B6000305 logged when a PCIe Host Bridge (PHB) or PCIe Expansion Drawer (#EMX0) is added to the partition.  For this to occur, the adapter had to be previously assigned to a partition (any OS) that was in POWER9 processor compatibility mode and then removed through a DLPAR or partition shut down such that the adapter is taken through recovery.
  • For systems with deconfigured cores and using the default performance and power setting of "Dynamic Performance Mode" or "Maximum Performance Mode", a rare problem was fixed for an incorrect voltage/frequency setting for the processors during heavy workloads with high ambient temperature.  This error could impact power usage, expected performance, or system availability if a processor fault occurs.  This problem can be avoided by using ASMI "Power and Performance Mode Setup"  to disable "All modes" when there are cores deconfigured in the system.
VL940_041_027 / FW940.02

02/18/20

Impact: Function         Severity:  HIPER

System firmware changes that affect all systems

  • A problem was fixed for an HMC "Incomplete" state for a system after the HMC user password is changed with ASMI on the service processor.  This problem can occur if the HMC password is changed on the service processor but not also on the HMC, and a reset of the service processor happens.  With the fix, the HMC will get the needed "failed authentication" error so that the user knows to update the old password on the HMC.

System firmware changes that affect certain systems

  • HIPER/Pervasive:   For systems using PowerVM NovaLink to manage partitions, a problem was fixed for the hypervisor rejecting setting the system to be NovaLink managed.  The following error message is given:   "FATAL pvm_apd[]: Hypervisor encountered an error creating the ibmvmc device. Error number 5."  This always happens in FW940.00 and FW940.01 which prevents a system from transitioning to be NovaLink managed at these firmware levels.  If you were successfully running as NovaLink managed already on FW930 and upgraded to FW940, you would not experience this issue. 
    For more information on PowerVM Novalink, refer to the IBM Knowledge Center article:  https://www.ibm.com/support/knowledgecenter/POWER9/p9eig/p9eig_kickoff.htm.
VL940_034_027 / FW940.01

01/09/20

Impact:  Security      Severity:  SPE

New features and functions

  • Support was added for improved security for the service processor password policy.  For the service processor, the "admin", "hmc" and "general" password must be set on first use for newly manufactured systems and after a factory reset of the system.  The IPMI interface has been changed to be disabled by default in these scenarios.  The REST/Redfish interface will return an error saying the user account is expired.  This policy change helps to enforce the service processor is not left in a state with a well known password.  The user can change from an expired default password to a new password using the Advanced System Management Interface (ASMI).
  • Support was added for real-time data capture for PCIe3 expansion drawer (#EMX0) cable card connection data via resource dump selector on the HMC or in ASMI on the service processor.  Using the resource selector string of "xmfr -dumpccdata" will non-disruptively generate an RSCDUMP type of dump file that has the current cable card data, including data from cables and the retimers.
  • Improvements to link stack algorithms.

System firmware changes that affect all systems

  • A problem was fixed for an intermittent IPMI core dump on the service processor.  This occurs only rarely when multiple IPMI sessions are starting and cleaning up at the same time.  A new IPMI session can fail initialization when one of its session objects is cleaned up.  The circumvention is to retry the IPMI command that failed.
  • A problem was fixed for system hangs or incomplete states displayed by HMC(s) with SRC B182951C logged.  The hang can occur during operations that require a memory relocation for any partition such as Dynamic Platform Optimization (DPO), memory mirroring defragmentation, or memory guarding that happens as part of memory error recovery during normal operations of the system.
  • A problem was fixed for possible unexpected interrupt behavior for partitions running in POWER9 processor compatibility mode.  This issue can occur during the boot of a partition running in POWER9 processor compatibility mode with an OS level that supports the External Interrupt Virtualization Engine (XIVE) exploitation mode.  For more information on compatibility modes, see the following two articles in the IBM Knowledge Center:
    1) Processor compatibility mode overview:   https://www.ibm.com/support/knowledgecenter/POWER9/p9hc3/p9hc3_pcm.htm
    2) Processor compatibility mode definitions:  https://www.ibm.com/support/knowledgecenter/POWER9/p9hc3/p9hc3_pcmdefs.htm
  • A problem was fixed for an intermittent IPL failure with SRC B181E540 logged with fault signature " ex(n2p1c0) (L2FIR[13]) NCU Powerbus data timeout".  No FRU is called out.  The error may be ignored and the re-IPL is successful.  The error occurs very infrequently.  This is the second iteration of the fix that has been released.  Expedient routing of the Powerbus interrupts did not occur in all cases in the prior fix, so the timeout problem was still occurring.

System firmware changes that affect certain systems

  • DEFERRED:  For systems using the Feature Code #EPIM 8-core processor, a problem was fixed for a slightly degraded UtlraTurbo maximum frequency (approximately 3% less) compared to what is expected for this processor chip.  The fix requires new Workload Optimized Frequency (WOF) tables for the processor, so the system must be re-IPLed for the installed fix to be active.  The WOF UltraTurbo maximum frequency can only be achieved when turning cores off or operating below the 50% workload capacity of the system.
  • On systems running IBM i partitions configured as Restricted I/O partitions that are also running in either P7 or P8 processor compatibility mode, a problem was fixed for a likely hang during boot with BA210000 and BA218000 checkpoints and error logs after migrating to FW940.00 level system firmware.  The trigger for the problem is booting IBMi partitions configured as Restricted I/O partitions in P7 or P8 compatibility mode on FW940.00 system firmware.  Such partitions are usually configured this way so that they can be used for live partition migration (LPM) to and
    from P7/P8 systems. Without the fix, the user can do either of the following as circumventions for the boot failure of the IBM i partition:
    1) Move the partition to P9 compatibility mode
    2) Or remove the 'Restricted I/O Partition' property
VL940_027_027 / FW940.00

11/25/19

Impact:  New      Severity:  New

GA Level with key features included listed below

  • All features and fixes from the FW930.11. service pack (and below) are included in this release.  At the time of the FW940.00 release, the FW930.11 is a future FW930 service pack scheduled for the fourth quarter of 2019.

New Features and Functions

  • User Mode NX Accelerator Enablement for PowerVM.  This enables the access of NX accelerators such as the gzip engine through user mode interfaces.  The IBM Virtual HMC (vHMC) 9.1.940 provides a user interface to this feature.  The LPAR must be running in POWER9 compatibility mode to use this feature.  For more information on compatibility modes, see the following two articles in the IBM Knowledge Center:
    1) Processor compatibility mode overview:    https://www.ibm.com/support/knowledgecenter/POWER9/p9hc3/p9hc3_pcm.htm
    2) Processor compatibility mode definitions:  https://www.ibm.com/support/knowledgecenter/POWER9/p9hc3/p9hc3_pcmdefs.htm
  • Support for SR-IOV logical ports in IBM i restricted I/O mode.
  • Support for user mode enablement of the External Interrupt Virtualization Engine (XIVE).  This user mode enables the management of interrupts to move from the hypervisor to the operating system for improved efficiency.  Operating systems may also have to be updated to enable this support.  The LPAR must be running in POWER9 compatibility mode to use this feature.  For more information on compatibility modes, see the following two articles in the IBM Knowledge Center:
    1) Processor compatibility mode overview:    https://www.ibm.com/support/knowledgecenter/POWER9/p9hc3/p9hc3_pcm.htm
    2) Processor compatibility mode definitions:  https://www.ibm.com/support/knowledgecenter/POWER9/p9hc3/p9hc3_pcmdefs.htm
  • Extended support for PowerVM Firmware Secure Boot.  This feature restricts access to the Open Firmware prompt and validates all adapter boot driver code. Boot adapters, or adapters which may be used as boot adapters in the future, must be updated to the latest microcode from IBM Fix Central.  The latest microcode will ensure the adapters support the Firmware Secure Boot feature of Power Systems. This requirement applies when updating system firmware from a level prior to FW940 to levels FW940 and later.  The latest adapter microcode levels include signed boot driver code.  If a boot-capable PCI adapter is not installed with the latest level of adapter microcode, the partition which owns the adapter will boot, but error logs with SRCs BA5400A5 or BA5400A6 will be posted. Once the adapter(s) are updated, the error logs will no longer be posted.
  • Linux OS support was added for PowerVM LPARs for the PCIe4 2x100GbE ConnectX-5 RoCE adapter with feature codes of #EC66/EC67 and CCIN 2CF3.  Linux versions RHEL 7.5 and SLES 12.3 are supported.
  • This server firmware level includes the SR-IOV adapter firmware level 11.4.415.28  for the following Feature Codes and CCINs: #EN15/EN16 with CCIN 2CE3,  #EN17/EN18 with CCIN 2CE4, #EN0H/EN0J with CCIN 2B93, #EN0M/EN0N with CCIN 2CC0,  #EN0K/EN0L with CCIN 2CC1, #EL56/EL38 with CCIN 2B93, and #EL57/EL3C with CCIN 2CC1.
  • This server firmware includes the SR-IOV adapter firmware level 1x.25.6000 for the following Feature Codes and CCINs: #EC2R/EC2S with CCIN 58FA;  #EC2T/EC2U with CCIN 58FB;  #EC3L/EC3M with CCIN 2CEC;  and #EC66/EC67 with CCIN 2CF3.

System firmware changes that affect all systems

  • A problem was fixed for incorrect call outs for PowerVM hypervisor terminations with SRC B7000103 logged.  With the fix, the call outs are changed from SVCDOCS, FSPSP04, and FSPSP06 to FSPSP16.  When this type of termination occurs, IBM support requires the dumps be collected to determine the cause of failure.
  • A problem was fixed for an IPL failure with the following possible SRCs logged:  11007611, 110076x1, 1100D00C, and 110015xx.  The service processor may reset/reload for this intermittent error and end up in the termination state.

VL930

VL930
For Impact, Severity and other Firmware definitions, Please refer to the below 'Glossary of firmware terms' url:
https://www.ibm.com/support/pages/node/6555136
VL930_145_040 / FW930.50

9/17/21

Impact: Data       Severity: HIPER

New features and functions

  • Support was changed to disable Service Location Protocol (SLP) by default for newly shipped systems or systems that are reset to manufacturing defaults.  This change has been made to reduce memory usage on the service processor by disabling a service that is not needed for normal system operations.  This change can be made manually for existing customers by changing it in ASMI with the options "ASMI -> System Configuration -> Security -> External Services Management" to disable the service.

System firmware changes that affect all systems 

  • HIPER: A problem was fixed which may occur on a target system following a Live Partition Mobility (LPM) migration of an AIX partition utilizing Active Memory Expansion (AME) with 64 KB page size enabled using the vmo tunable: "vmo -ro ame_mpsize_support=1".  The problem may result in AIX termination, file system corruption, application segmentation faults, or undetected data corruption.
    Note:  If you are doing an LPM migration of an AIX partition utilizing AME and 64 KB page size enabled involving a POWER8 or POWER9 system, ensure you have a Service Pack including this change for the appropriate firmware level on both the source and target systems.
  • A problem was fixed for a missing hardware callout and guard for a processor chip failure with SRC BC8AE540 and signature "ex(n0p0c5) (L3FIR[28]) L3 LRU array parity error".
  • A problem was fixed for a missing hardware callout and guard for a processor chip failure with Predictive Error (PE) SRC BC70E540 and signature "ex(n1p2c6) (L2FIR[19]) Rc or NCU Pb data CE error".  The PE error occurs after the number of CE errors reaches a threshold of 32 errors per day.
  • A problem was fixed for a rare failure for an SPCN I2C command sent to a PCIe I/O expansion drawer that can occur when service data is manually collected with hypervisor macros "xmsvc -dumpCCData and xmsvc -logCCErrBuffer".   If the hypervisor macro "xmsvc "is run to gather service data and a CMC Alert occurs at the same time that requires an SPCN command to clear the alert, then the I2C commands may be improperly serialized, resulting in an SPCN I2C command failure.  To prevent this problem, avoid using xmsvc -dumpCCData and xmsvc -logCCErrBuffer to collect service data until this fix is applied.
  • A problem was fixed for a system hang or terminate with SRC B700F105 logged during a Dynamic Platform Optimization (DPO) that is running with a partition in a failed state but that is not shut down.  If DPO attempts to relocate a dedicated processor from the failed partition, the problem may occur.  This problem can be avoided by doing a shutdown of any failed partitions before initiating DPO.
  • A problem was fixed for a system crash with HMC message HSCL025D and SRC B700F103 logged on a Live Partition Mobility (LPM) inactive migration attempt that fails.  The trigger for this problem is inactive migration that fails a compatibility check between the source and target systems.
  • A problem was fixed for a Live Partition Mobility (LPM) migration that failed with the error "HSCL3659 The partition migration has been stopped because orchestrator detected an error" on the HMC.  This problem is intermittent and rare that is triggered by the HMC being overrun with unneeded LPM message requests from the hypervisor that can cause a timeout in HMC queries that result in the LPM operation being aborted.  The workaround is to retry the LPM migration which will normally succeed.
  • A problem was fixed for a system becoming unresponsive when a processor goes into a tight loop condition with an SRC B17BE434, indicating that the service processor has lost communication with the hypervisor.  This problem is triggered by an SR-IOV shared mode adapter going through a recovery VF reset for an error condition, without releasing a critical lock.  If a later reset is then needed for the VF, the problem can occur.  The problem is infrequent because a combination of errors needs to occur in a specific sequence for the adapter.
  • A problem was fixed for a misleading SRC B7006A20 (Unsupported Hardware Configuration) that can occur for some error cases for PCIe #EMX0 expansion drawers that are connected with copper cables.  For cable unplug errors, the SRC B7006A88 (Drawer TrainError) should be shown instead of the B7006A20.   If a B7006A20 is logged against copper cables with the signature "Prc UnsupportedCableswithFewerChannels" and the message "NOT A 12CHANNEL CABLE", this error should instead follow the service actions for a B7006A88 SRC.
  • A problem was fixed where the Floating Point Unit Computational Test, which should be set to "staggered" by default, has been changed in some circumstances to be disabled. If you wish to re-enable this option, this fix is required.  After applying this service pack,  do the following steps:
    1) Sign in to the Advanced System Management Interface (ASMI).
    2) Select Floating Point Computational Unit under the System Configuration heading and change it from disabled to what is needed: staggered (run once per core each day) or periodic (a specified time).
    3) Click "Save Settings".
  • A problem was fixed for a system termination with SRC B700F107 following a time facility processor failure with SRC B700F10B.  With the fix, the transparent replacement of the failed processor will occur for the B700F10B if there is a free core, with no impact to the system.
  • A problem was fixed for an incorrect "Power Good fault" SRC logged for an #EMX0 PCIe3 expansion drawer on the lower CXP cable of B7006A85 (AOCABLE, PCICARD).  The correct SRC is B7006A86 (PCICARD, AOCABLE).
  • A problem was fixed for certain SR-IOV adapters that can have B400FF02 SRCs logged with LPA dumps during a vNIC remove operation. In most cases, the operations should recover and complete.  This fix updates the adapter firmware to XX.29.2003 for the following Feature Codes and CCINs: #EC2R/EC2S with CCIN 58FA; #EC2T/EC2U with CCIN 58FB; #EC3L/EC3M with CCIN 2CE; and #EC66/EC67 with CCIN 2CF3.
    Update instructions:  https://www.ibm.com/support/knowledgecenter/en/POWER9/p9efd/p9efd_updating_sriov_firmware.htm.
  • A problem was fixed for certain SR-IOV adapters not being able to create the maximum number of VLANs that are supported for a physical port. The SR-IOV adapters affected have the following Feature Codes and CCINs:  #EC66/#EC67 with CCIN 2CF3.
    Update instructions:  https://www.ibm.com/support/knowledgecenter/en/POWER9/p9efd/p9efd_updating_sriov_firmware.htm.
  • A problem was fixed for an SR-IOV adapter in shared mode configured as Virtual Ethernet Port Aggregator (VEPA) where unmatched unicast packets were not forwarded to the promiscuous mode VF.  This is an updated and corrected version of a similar fix delivered in the FW940.40 service pack that had side effects of network issues such as ping failure or inability to establish TCP connections.
    Update instructions:  https://www.ibm.com/support/knowledgecenter/en/POWER9/p9efd/p9efd_updating_sriov_firmware.htm.
  • The following problems were fixed for certain SR-IOV adapters:
    1) An error was fixed that occurs during a VNIC failover where the VNIC backing device has a physical port down due to an adapter internal error with an SRC B400FF02 logged.  This is an improved version of the fix delivered in earlier service pack  FW930.40 for adapter firmware 11.4.415.36 and it significantly reduces the frequency of the error being fixed.
    2) A problem was fixed for an adapter in SR-IOV shared mode which may cause a network interruption and SRCs B400FF02 and B400FF04 logged.  The problem occurs infrequently during normal network traffic.
    The fixes update the adapter firmware to 11.4.415.38 for the following Feature Codes and CCINs: #EN15/#EN16 with CCIN 2CE3, #EN17/#EN18 with CCIN 2CE4, #EN0H/#EN0J with CCIN 2B93, #EN0M/#EN0N with CCIN 2CC0, #EN0K/#EN0L with CCIN 2CC1, #EL56/#EL38 with CCIN 2B93, and #EL57/#EL3C with CCIN 2CC1.
    Update instructions:  https://www.ibm.com/support/knowledgecenter/en/POWER9/p9efd/p9efd_updating_sriov_firmware.htm.
  • A problem was fixed for the Device Description in a System Plan related to Crypto Coprocessors and NVMe cards that were only showing the PCI vendor and device ID of the cards.  This is not enough information to verify which card is installed without looking up the PCI IDs first.  With the fix, more specific/useful information is displayed and this additional information does not have any adverse impact on sysplan operations.  The problem is seen every time a System Plan is created for an installed Crypto Coprocessor or NVMe card.
  • A problem was fixed for possible partition errors following a concurrent firmware update from FW910 or later. A precondition for this problem is that DLPAR operations of either physical or virtual I/O devices must have occurred prior to the firmware update.  The error can take the form of a partition crash at some point following the update. The frequency of this problem is low.  If the problem occurs, the OS will likely report a DSI (Data Storage Interrupt) error.  For example, AIX produces a DSI_PROC log entry.  If the partition does not crash, it is also possible that some subsequent I/O DLPAR operations will fail.
  • A problem was fixed for some serviceable events specific to the reporting of EEH errors not being displayed on the HMC.  The sending of an associated call home event, however, was not affected.  This problem is intermittent and infrequent.

System firmware changes that affect certain systems

  • On systems with an IBM i partition, a problem was fixed for physical I/O property data not being able to be collected for an inactive partition booted in "IOR" mode with SRC B200A101 logged.   This can happen when making a system plan (sysplan) for an IBM i partition using the HMC and the IBM i partition is inactive.  The sysplan data collection for the active IBM i partitions is successful.
  • For a system with a partition running AIX 7.3, a problem was fixed for running Live Update or Live Partition Mobility (LPM).  AIX 7.3 supports Virtual Persistent Memory (PMEM) but it cannot be used with these operations, but the problem was making it appear that PMEM was configured when it was not.  The Live Update and LPM operations always fail when attempted on AIX 7.3.  Here is the failure output from a Live Update Preview:
    "1430-296 FAILED: not all devices are virtual devices.
    nvmem0
    1430-129 FAILED: The following loaded kernel extensions are not known to be safe for Live Update:
    nvmemdd
    ...
    1430-218 The live update preview failed.
    0503-125 geninstall:  The lvupdate call failed.
    Please see /var/adm/ras/liveupdate/logs/lvupdlog for details."
VL930_139_040 / FW930.41

5/25/21

Impact: Availability      Severity: HIPER

System firmware changes that affect all systems

  • HIPER/Pervasive:  A problem was fixed to be able to detect a failed PFET sensing circuit in a core at runtime, and prevent a system fail with an incomplete state when a core fails to wake up.  The failed core is detected on the subsequent IPL.  With the fix. a core is called out with the PFET failure with SRC BC13090F and hardware description "CME detected malfunctioning of PFET headers." to isolate the error better with a correct callout.
  • HIPER/Pervasive:  A problem was fixed for a system IPL failure when DIMMs (RDIMMs or NVDIMMs) have mixed configurations with dual populated memory channels and single populated memory channels. This problem occurs If there are dually populated memory channels that precede a single DIMM memory channel for a processor. This causes the IPL to fail with B150BA40 and BC8A090F logged with HwpReturnCode "RC_MSS_CALC_POWER_CURVE_NEGATIVE_OR_ZERO_SLOPE " and HWP Error description "Power curve slope equals 0 or is negative".  A workaround for this problem is to reconfigure the memory to have the single DIMM memory channels be in front of memory channels that have both DIMM slots occupied.
  • HIPER/Pervasive:  A problem was fixed for a checkstop due to an internal Bus transport parity error or a data timeout on the Bus. This is a very rare problem that requires a particular SMP transport link traffic pattern and timing.  Both the traffic pattern and timing are very difficult to achieve with customer application workloads. The fix will have no measurable effect on most customer workloads although highly intensive OLAP-like workloads may see up to 2.5% impact.
VL930_134_040 / FW930.40

3/10/21

Impact: Availability      Severity: SPE

New features and functions 

  • Added support  in ASMI for a new panel to do Self -Boot Engine (SBE) SEEPROM validation.  This validation can only be run at the service processor standby state. 
    If the validation detects a problem, IBM recommends the system not be used and that IBM service be called.

System firmware changes that affect all systems

  • A problem was fixed for certain PCIe adapters and NVMe U.2 devices. The following feature codes are affected and the presence of these features in the 9009-41A server can result in higher fan speeds and therefore higher acoustic levels after upgrading the firmware (the other models are also affected but to a lesser extent).  Refer to the acoustic levels published in the IBM Knowledge Center currently located at ===> https://www.ibm.com/support/knowledgecenter/9009-41A/p9had/p9had_90x.htm:
    #EC5J witn CCIN 59B4;  #EC5K with CCIN 59B5; #EC5L with CCIN 59B6; #EC5X with CCIN 59B7; #EC7A/#EC7B with CCIN 594A; #EC7C/#EC7D with CCIN 594B; #EC7E/#EC7F with CCIN 594C; #EC7J/#EC7K  with CCIN 594A; #EC7L/#EC7M with CCIN 594B; #EC7N/#EC7P with CCIN 594C; #ES1E/#ES1F with CCIN 59B8; #ES1G/#ES1H with CCIN 59B9;  #EC5V/#EC5W with CCIN 59BA; #EC5G/#EC5B and #EC6U/#EC6V with CCIN 58FC;  #EC5W/#EC5D and #EC6W/#EC6X with CCIN 58FD; and #EC5G/#EC5B and #EC6Y/#EC6Z with CCIN 58FE.
  • A problem was fixed for the On-Chip Controller (OCC) going into safe mode (causes loss of processor performance) with SRC BC702616 logged.  This problem can be triggered by the loss of a power supply (an oversubscription event).  The problem can be circumvented by fixing the issue with the power supply.
  • A problem was fixed for certain SR-IOV adapters that have a rare, intermittent error with B400FF02 and B400FF04 logged, causing a reboot of the VF.  The error is handled and recovered without any user intervention needed.  The SR-IOV adapters affected have the following Feature Codes and CCINs: #EC2R/#EC2S with CCIN 58FA; #EC2T/#EC2U with CCIN 58FB; #EC3L/#EC3M with CCIN 2CE; and #EC66/#EC67 with CCIN 2CF3.
  • A problem was fixed for not logging SRCs for certain cable pulls from the #EMXO PCIe expansion drawer.  With the fix, the previously undetected cable pulls are now detected and logged with SRC B7006A8B and B7006A88 errors.
  • A problem was fixed for a rare system hang with SRC BC70E540 logged that may occur when adding processors through licensing or the system throttle state changing (becoming throttled or unthrottled) on an Enterprise Pool system.  The trigger for the problem is a very small timing window in the hardware as the processor loads are changing.
  • A problem was fixed for the error handling of a system with an unsupported memory configuration that exceeds available memory power. Without the fix, the IPL of the system is attempted and fails with a segmentation fault with SRCs B1818611 and B181460B logged that do not call out the incorrect DIMMs.
  • A problem was fixed for the Systems Management Services ( SMS) menu "Device IO Information" option being incorrect when displaying the capacity for an NVMe or Fibre Channel (FC) NVMe disk.  This problem occurs every time the data is displayed.
  • A problem was fixed for an unrecoverable UE SRC B181BE12 being logged if a service processor message acknowledgment is sent to a Hostboot instance that has already shutdown.  This is a harmless error log and it should have been marked as an informational log.
  • A problem was fixed for Time of Day (TOD) being lost for the real-time clock (RTC) when the system initializes from AC power off to service processor standby state with an SRC B15A3303 logged.  This is a very rare problem that involves a timing problem in the service processor kernel that can be recovered by setting the system time with ASMI.
  • A problem was fixed for intermittent failures for a reset of a Virtual Function (VF) for SR-IOV adapters during Enhanced Error Handling (EEH) error recovery.  This is triggered by EEH events at a VF level only, not at the adapter level.  The error recovery fails if a data packet is received by the VF while the EEH recovery is in progress.  A VF that has failed can be recovered by a partition reboot or a DLPAR remove and add of the VF.
  • A problem was fixed for performance degradation of a partition due to task dispatching delays.  This may happen when a processor chip has all of its shared processors removed and converted to dedicated processors. This could be driven by DLPAR remove of processors or Dynamic Platform Optimization (DPO).
  • The following problems were fixed for certain SR-IOV adapters:
    1) An error was fixed that occurs during VNIC failover where the VNIC backing device has a physical port down with an SRC B400FF02 logged.
    2) A problem was fixed for adding a new logical port that has a PVID assigned that is causing traffic on that VLAN to be dropped by other interfaces on the same physical port which uses OS VLAN tagging for that same VLAN ID.  This problem occurs each time a logical port with a non-zero PVID that is the same as an existing VLAN is dynamically added to a partition or is activated as part of a partition activation, the traffic flow stops for other partitions with OS configured VLAN devices with the same VLAN ID.  This problem can be recovered by configuring an IP address on the logical port with the non-zero PVID and initiating traffic flow on this logical port.  This problem can be avoided by not configuring logical ports with a PVID if other logical ports on the same physical port are configured with OS VLAN devices.
    This fix updates the adapter firmware to 11.4.415.36 for the following Feature Codes and CCINs:  #EN15/#EN16 with CCIN 2CE3, #EN17/#EN18 with CCIN 2CE4, #EN0H/#EN0J with CCIN 2B93, #EN0M/#EN0N with CCIN 2CC0, #EN0K/#EN0L with CCIN 2CC1, #EL56/#EL38 with CCIN 2B93, and #EL57/#EL3C with CCIN 2CC1.
    The SR-IOV adapter firmware level update for the shared-mode adapters happens under user control to prevent unexpected temporary outages on the adapters.  A system reboot will update all SR-IOV shared-mode adapters with the new firmware level.  In addition, when an adapter is first set to SR-IOV shared mode, the adapter firmware is updated to the latest level available with the system firmware (and it is also updated automatically during maintenance operations, such as when the adapter is stopped or replaced).  And lastly, selective manual updates of the SR-IOV adapters can be performed using the Hardware Management Console (HMC).  To selectively update the adapter firmware, follow the steps given at the IBM Knowledge Center for using HMC to make the updates:  https://www.ibm.com/support/knowledgecenter/en/POWER9/p9efd/p9efd_updating_sriov_firmware.htm.
    Note: Adapters that are capable of running in SR-IOV mode, but are currently running in dedicated mode and assigned to a partition, can be updated concurrently either by the OS that owns the adapter or the managing HMC (if OS is AIX or VIOS and RMC is running).
  • A problem was fixed for incomplete periodic data gathered by IBM Service for  #EMXO PCIe expansion drawer predictive error analysis.  The service data is missing the PLX (PCIe switch) data that is needed for the debug of certain errors.
  • A problem was fixed for a partition hang in shutdown with SRC B200F00F logged.  The trigger for the problem is an asynchronous NX accelerator job (such as gzip or NX842 compression) in the partition that fails to clean up successfully.  This is intermittent and does not cause a problem until a shutdown of the partition is attempted.  The hung partition can be recovered by performing an LPAR dump on the hung partition.  When the dump has been completed, the partition will be properly shut down and can then be restarted without any errors.
VL930_116_040 / FW930.30

10/21/20

Impact: Data       Severity: HIPER

New features and functions

  • DEFERRED: Host firmware support for anti-rollback protection.  This feature implements firmware anti-rollback protection as described in NIST SP 800-147B "BIOS Protection Guidelines for Servers".  Firmware is signed with a "secure version".  Support added for a new menu in ASMI called "Host firmware security policy" to update this secure version level at the processor hardware.  Using this menu, the system administrator can enable the "Host firmware secure version lock-in" policy, which will cause the host firmware to update the "minimum secure version" to match the currently running firmware. Use the "Firmware Update Policy" menu in ASMI to show the current "minimum secure version" in the processor hardware along with the "Minimum code level supported" information. The secure boot verification process will block installing any firmware secure version that is less than the "minimum secure version" maintained in the processor hardware.
    Prior to enabling the "lock-in" policy, it is recommended to accept the current firmware level.
    WARNING: Once lock-in is enabled and the system is booted, the "minimum secure version" is updated and there is no way to roll it back to allow installing firmware releases with a lesser secure version.
  • Enable periodic logging of internal component operational data for the PCIe3 expansion drawer paths.  The logging of this data does not impact the normal use of the system.

System firmware changes that affect all systems

  • HIPER/Pervasive:  A problem was fixed for certain SR-IOV adapters for a condition that may result from frequent resets of adapter Virtual Functions (VFs), or transmission stalls and could lead to potential undetected data corruption.
    The following additional fixes are also included:
    1) The VNIC backing device goes to a powered off state during a VNIC failover or Live Partition Mobility (LPM) migration.  This failure is intermittent and very infrequent.
    2) Adapter time-outs with SRC B400FF01 or B400FF02 logged.
    3) Adapter time-outs related to adapter commands becoming blocked  with SRC B400FF01 or B400FF02 logged.
    4) VF function resets occasionally not completing quickly enough resulting in SRC B400FF02 logged.
    This fix updates the adapter firmware to 11.4.415.33 for the following Feature Codes and CCINs:  #EN15/#EN16 with CCIN 2CE3, #EN17/#EN18 with CCIN 2CE4, #EN0H/#EN0J with CCIN 2B93, #EN0M/#EN0N with CCIN 2CC0, #EN0K/#EN0L with CCIN 2CC1, #EL56/#EL38 with CCIN 2B93, and #EL57/#EL3C with CCIN 2CC1.
    The SR-IOV adapter firmware level update for the shared-mode adapters happens under user control to prevent unexpected temporary outages on the adapters.  A system reboot will update all SR-IOV shared-mode adapters with the new firmware level.  In addition, when an adapter is first set to SR-IOV shared mode, the adapter firmware is updated to the latest level available with the system firmware (and it is also updated automatically during maintenance operations, such as when the adapter is stopped or replaced).  And lastly, selective manual updates of the SR-IOV adapters can be performed using the Hardware Management Console (HMC).  To selectively update the adapter firmware, follow the steps given at the IBM Knowledge Center for using HMC to make the updates:  https://www.ibm.com/support/knowledgecenter/en/POWER9/p9efd/p9efd_updating_sriov_firmware.htm.
    Note: Adapters that are capable of running in SR-IOV mode, but are currently running in dedicated mode and assigned to a partition, can be updated concurrently either by the OS that owns the adapter or the managing HMC (if OS is AIX or VIOS and RMC is running).
  • DEFERRED:  A problem was fixed for a slow down in PCIe adapter performance or loss of adapter function caused by a reduction in interrupts available to service the adapter.  This problem can be triggered over time by partition activations or DLPAR adds of PCIe adapters to a partition.  This fix must be applied and the system re-IPLed for existing adapter performance problems to be resolved.
  • A rare problem was fixed for a checkstop during an IPL that fails to isolate and guard the problem core.  An SRC is logged with B1xxE5xx and an extended hex word 8 xxxxDD90.  With the fix, the suspected failing hardware is guarded.
  • A problem was fixed to allow quicker recovery of PCIe links for the #EMXO PCIe expansion drawer for a run-time fault with B7006A22 logged.  The time for recovery attempts can exceed six minutes on rare occasions which may cause I/O adapter failures and failed nodes.  With the fix, the PCIe links will recover or fail faster (in the order of seconds) so that redundancy in a cluster configuration can be used with failure detection and failover processing by other hosts, if available, in the case where the PCIe links fail to recover.
  • A problem was fixed for system memory not returned after create and delete of partitions, resulting in slightly less memory available after configuration changes in the systems.  With the fix, an IPL of the system will recover any of the memory that was orphaned by the issue.
  • A problem was fixed for certain SR-IOV adapters that do not support the "Disable Logical Port" option from the HMC but the HMC was allowing the user to select this, causing incorrect operation.  The invalid state of the logical port causes an "Enable Logical Port" to fail in a subsequent operation.  With the fix, the HMC provides the message that the "Disable Logical Port" is not supported for the adapter.  This affects the adapters with the following Feature Codes and CCINs: #EN15/#EN16 with CCIN 2CE3,  #EN17/#EN18 with CCIN 2CE4, #EN0H/#EN0J with CCIN 2B93, #EN0M/#EN0N with CCIN 2CC0,  #EN0K/#EN0L with CCIN 2CC1, #EL56/#EL38 with CCIN 2B93, and #EL57/#EL3C with CCIN 2CC1.
  • A problem was fixed for SR-IOV adapters having an SRC B400FF04 logged when a VF is reset.  This is an infrequent issue and can occur for a Live Partition Mobility migration of a partition or during vNIC (Virtual Network Interface Controller) failovers where many resets of VFs are occurring.  This error is recovered automatically with no impact on the system.
  • A problem was fixed to remove unneeded resets of a Virtual Function (VF) for SR-IOV adapters, providing for improved performance of the startup or recovery time of the VF.  This performance difference may be noticed during a Live Partition Mobility migration of a partition or during vNIC (Virtual Network Interface Controller) failovers where many resets of VFs are occurring.
  • A problem was fixed for TPM hardware failures not causing SRCs to logged with a call out if the system is configured in ASMI to not require TPM for the IPL.  If this error occurs, the user would not find out about it until they needed to run with TPM on the IPL.  With the fix, the error logs and notifications will occur regardless of how the TPM is configured.
  • A problem was fixed for PCIe resources under a deconfigured PCIe Host Bridge (PHB) being shown on the OS host as available resources when they should be shown as deconfigured.  While this fix can be applied concurrently, a re-IPL of the system is needed to correct the state of the PCIe resources if a PHB had already been deconfigured
  • A problem was fixed for the REST/Redfish interface to change the success return code for object creation from "200" to "201".  The "200" status code means that the request was received and understood and is being processed.  A "201" status code indicates that a request was successful and, as a result, a resource has been created.  The Redfish Ruby Client, "redfish_client" may fail a transaction if a "200" status code is returned when "201" is expected.
  • A problem was fixed for a concurrent maintenance "Repair and Verify" (R&V) operation for a #EMX0 fanout module that fails with an "Unable to isolate the resource" error message.  This should occur only infrequently for cases where a physical hardware failure has occurred which prevents access to slot power controls.  This problem can be worked around by bringing up the "PCIe Hardware Topology" screen from either ASMI or the HMC after the hardware failure but before the concurrent repair is attempted.  This will avoid the problem with the PCIe slot isolation   These steps can also be used to recover from the error to allow the R&V repair to be attempted again.
  • A problem was fixed for certain large I/O adapter configurations having the PCI link information truncated on the PCI-E topology display shown with ASMI and the HMC.  Because of the truncation, individual adapters may be missing on the PCI-E topology screens.
  • A problem was fixed for a rare system hang that can occur when a page of memory is being migrated.  Page migration (memory relocation) can occur for a variety of reasons, including predictive memory failure, DLPAR of memory, and normal operations related to managing the page pool resources.
  • A problem was fixed for utilization statistics for commands such as HMC lslparutil and third-party lpar2rrd that do not accurately represent CPU utilization.  The values are incorrect every time for a partition that is migrated with Live Partition Mobility (LPM).  Power Enterprise Pools 2.0 is not affected by this problem.  If this problem has occurred, here are three possible recovery options:
    1) Re-IPL the target system of the migration.
    2) Or delete and recreate the partition on the target system.
    3) Or perform an inactive migration of the partition.  The cycle values get zeroed in this case.
  • A problem was fixed for running PCM on a system with SR-IOV adapters in shared mode that results in an "Incomplete" system state with certain hypervisor tasks deadlocked.  This problem is rare and is triggered when using SR-IOV adapters in shared mode and gathering performance statistics with PCM (Performance Collection and Monitoring) and also having a low level error on an adapter.  The only way to recover from this condition is to re-IPL the system.
  • A problem was fixed for an enhanced PCIe expansion drawer FPGA reset causing EEH events from the fanout module or cable cards that disrupt the PCIe lanes for the PCIe adapters.  This problem affects systems with the PCIe expansion drawer enhanced fanout module (#EMXH) and the enhanced cable card (#EJ1R or #EJ20). The error is associated with the following SRCs being logged:
    B7006A8D with PRC 37414123 (XmPrc::XmCCErrMgrBearPawPrime | XmPrc::LocalFpgaHwReset)
    B7006A8E with PRC 3741412A (XmPrc::XmCCErrMgrBearPawPrime | XmPrc::RemoteFpgaHwReset)
    If the EEH errors occur, the OS device drivers automatically recover but with a reset of affected PCIe adapters that would cause a brief interruption in the I/O communications.
  • A problem was fixed for a PCIe3 expansion drawer cable that has hidden error logs for a single lane failure.  This happens whenever a single lane error occurs.  Subsequent lane failures are not hidden and have visible error logs.  Without the fix, the hidden or informational logs would need to be examined to gather more information for the failing hardware.
  • A problem was fixed for mixing modes on the ports of SR-IOV adapters that causes SRCs B200A161, B200F011, B2009014 and B400F104 to be logged on boot of the failed adapter.  This error happens when one port of the adapter is changed to option 1 with a second port set at either option 0 or option 2.  The error can be cleared by taking the adapter out of SR-IOV shared mode.  The SR-IOV adapters affected have the following Feature Codes and CCINs: #EC2R/#EC2S with CCIN 58FA; #EC2T/#EC2U with CCIN 58FB;  #EC3L/#EC3M with CCIN 2CE;  and #EC66/#EC67 with CCIN 2CF3.
  • A problem was fixed for certain SR-IOV adapters with the following issues:
    1) The VNIC backing device goes to a powered off state during a VNIC failover or Live Partition Mobility (LPM) migration.  This failure is intermittent and very infrequent.
    2) Adapter time-outs with SRC B400FF01 or B400FF02 logged.
    3) Adapter time-outs related to adapter commands becoming blocked with SRC B400FF01 or B400FF02 logged
    This fix updates the adapter firmware to 11.4.415.32 for the following Feature Codes and CCINs:  #EN15/#EN16 with CCIN 2CE3,  #EN17/#EN18 with CCIN 2CE4, #EN0H/#EN0J with CCIN 2B93, #EN0M/#EN0N with CCIN 2CC0,  #EN0K/#EN0L with CCIN 2CC1, #EL56/#EL38 with CCIN 2B93, and #EL57/#EL3C with CCIN 2CC1.
    The SR-IOV adapter firmware level update for the shared-mode adapters happens under user control to prevent unexpected temporary outages on the adapters.  A system reboot will update all SR-IOV shared-mode adapters with the new firmware level.  In addition, when an adapter is first set to SR-IOV shared mode, the adapter firmware is updated to the latest level available with the system firmware (and it is also updated automatically during maintenance operations, such as when the adapter is stopped or replaced).  And lastly, selective manual updates of the SR-IOV adapters can be performed using the Hardware Management Console (HMC).  To selectively update the adapter firmware, follow the steps given at the IBM Knowledge Center for using HMC to make the updates:   https://www.ibm.com/support/knowledgecenter/en/POWER9/p9efd/p9efd_updating_sriov_firmware.htm.
    Note: Adapters that are capable of running in SR-IOV mode, but are currently running in dedicated mode and assigned to a partition, can be updated concurrently either by the OS that owns the adapter or the managing HMC (if OS is AIX or VIOS and RMC is running).
  • A problem was fixed for a partition configured with a large number (approximately 64) of Virtual Persistent Memory (PMEM) LUNs hanging during the partition activation with a CA00E134 checkpoint SRC posted.  Partitions configured with approximately 64 PMEM LUNs will likely hang and the greater the number of LUNs, the greater the possibility of the hang.  The circumventionf or this problem is to reduce the number of PMEM LUNs to 64 or less in order to boot successfully.  The PMEM LUNs are also known as persistent memory volumes and can be managed using the HMC.  For more information on this topic, refer to https://www.ibm.com/support/knowledgecenter/POWER9/p9efd/p9efd_lpar_pmem_settings.htm.
  • A problem was fixed for non-optimal On-Chip Controller (OCC) processor frequency adjustments when system power limits or user power caps are exceeded.  When a workload causes power limits or caps to be exceeded, there can be large frequency swings for the processors and a processor chip can get stuck at minimum frequency.  With the fix, the OCC now waits for new power readings when changing the processor frequency and uses a master power capping frequency to keep all processors at the same frequency.  As a workaround for this problem, do not set a power cap or run a workload that would exceed the system power limit.
  • A problem was fixed for mixing memory DIMMs with different timings (different vendors) under the same memory controller that fail with an SRC BC20E504 error and DIMMs deconfigured. This is an "MCBIST_BRODCAST_OUT_OF_SYNC" error.  The loss of memory DIMMs can result in a IPL failure.  This problem can happen if the memory DIMMs have a certain level of timing differences.  If the timings are not compatible, the failure will occur on the IPL during the memory training. To circumvent this problem, each memory controller should have only memory DIMMs from the same vendor plugged .
  • A problem was fixed for the Self Boot Engine (SBE) going to termination with an SRC B150BA8D logged when booting on a bad core.  Once this happens, this error will persist as the bad core is not deconfigured.  To recover from this error and be able to IPL, the bad core must be manually deconfigured.   With the fix, the failing core is deconfigured and the SBE is reconfigured to use another core so the system is able to IPL.
  • A problem was fixed for guard clearing where a specific unguard action may cause other unrelated predictive and manual guards to also be cleared.
  • A problem was fixed to correct some of the register values for the LPADUMP for certain SR-IOV adapters.  The impact is there are some minor changes in the register set when compared to the latest version published by Mellanox Firmware Tools (MFT) Version 4.15.0 that can be found at https://www.mellanox.com/products/adapter-software/firmware-tools.  The mismatch in register values can affect the ability to debug problems.
    The SR-IOV adapters affected have the following Feature Codes and CCINs: #EC2R/#EC2S with CCIN 58FA; #EC2T/#EC2U with CCIN 58FB;  #EC3L/#EC3M with CCIN 2CE;  and #EC66/#EC67 with CCIN 2CF3.
  • A problem was fixed for an infrequent issue after a Live Partition Mobility (LPM) operation from a POWER9 system to a POWER8 or POWER7 system.  The issue may cause unexpected OS behavior, which may include loss of interrupts, device time-outs, or delays in dispatching.  Rebooting the affected target partition will resolve the problem.
  • A problem was fixed for a partition crash or hang following a partition activation or a DLPAR add of a virtual processor.  For partition activation, this issue is only possible for a system with a single partition owning all resources.  For DLPAR add, the issue is extremely rare.
  • A problem was fixed for a DLPAR remove of memory from a partition that fails if the partition contains 65535 or more LMBs.  With 16MB LMBs, this error threshold is 1 TB of memory.  With 256 MB LMBs, it is 16 TB of memory.  A reboot of the partition after the DLPAR will remove the memory from the partition.
  • A problem was fixed for incorrect run-time deconfiguration of a processor core with SRC B700F10B. This problem can be circumvented by a reconfiguration of the processor core but this should only be done with the guidance of IBM Support to ensure the core is good.
  • A problem was fixed for Live Partition Mobility (LPM) being shown as enabled at the OS when it has been disabled by the ASMI command line using the server processor command of "cfcuod -LPM OFF".  LPM is actually disabled and the status shows correctly on the HMC.  The status on the OS can be ignored (for example as shown by the AIX command "lparstat  -L") as LPM will not be allowed to run when it is disabled.
  • A problem was fixed for a VIOS, AIX, or Linux partition hang during an activation at SRC CA000040.  This will occur on a system that has been running more than 814 days when the boot of the partition is attempted if the partitions are in POWER9_base or POWER9 processor compatibility mode.
    A workaround to this problem is to re-IPL the system or to change the failing partition to POWER8 compatibility mode.
  • A problem was fixed for performance tools perfpmr, tprof and pex that may not be able to collect data for the event-based options. 
    This can occur any time an OS thread becomes idle.  When the processor cores are assigned to the next active process, the performance registers may be disabled.
  • A problem was fixed for a system hang and HMC "Incomplete" state that may occur when a partition hangs in shutdown with SRC B200F00F logged.  The trigger for the problem is an asynchronous NX accelerator job (such as gzip) in the partition that fails to clean up successfully.  This is intermittent and does not cause a problem until a shutdown of the partition is attempted.
  • A problem was fixed for an SRC B7006A99 informational log now posted as a Predictive with a call out of the CXP cable FRU,  This fix improves FRU isolation for cases where a CXP cable alert causes a B7006A99 that occurs prior to a B7006A22 or B7006A8B.  Without the fix, the SRC B7006A99 is informational and the latter SRCs cause a larger hardware replacement even though the earlier event identified a probable cause for the cable FRU.
  • A problem was fixed for a security vulnerability for the Self Boot Engine (SBE).  The SBE can be compromised from the service processor to allow injection of malicious code. An attacker that gains root access to the service processor could compromise the integrity of the host firmware and bypass the host firmware signature verification process. This compromised state can not be detected through TPM attestation.  This is Common Vulnerabilities and Exposures issue number CVE-2021-20487.

  System firmware changes that affect certain systems

  • On systems with an IBM i partition, a problem was fixed for a dedicated memory IBM i partition running in P9 processor compatibility mode failing to activate with HSCL1552 "the firmware operation failed with extended error".  This failure only occurs under a very specific scenario - the new amount of desired memory is less than the current desired memory, and the Hardware Page Table (HPT) size needs to grow.
  • On systems with AIX and Linux partitions, a problem was fixed for AIX and Linux partitions that crash or hang when reporting any of the following Partition Firmware RTAS ASSERT rare conditions:
    1) SRC BA33xxxx errors - Memory allocation and management errors.
    2) SRC BA29xxxx errors - Partition Firmware internal stack errors.
    3) SRC BA00E8xx errors - Partition Firmware initialization errors during concurrent firmware update or Live Partition Mobility (LPM) operations.
    This problem should be very rare.  If the problem does occur, a partition reboot is needed to recover from the error.
VL930_101_040 / FW930.20

02/27/20

Impact: Availability       Severity: HIPER

New features and functions

  • Support was added for real-time data capture for PCIe3 expansion drawer (#EMX0) cable card connection data via resource dump selector on the HMC or in ASMI on the service processor.  Using the resource selector string of "xmfr -dumpccdata" will non-disruptively generate an RSCDUMP type of dump file that has the current cable card data, including data from cables and the retimers.
  • Improvements to link stack algorithms.

System firmware changes that affect all systems 

  • HIPER/Pervasive:  A problem was fixed for a possible system crash and HMC "Incomplete" state when a logical partition (LPAR) power off after a dynamic LPAR (DLPAR) operation fails for a PCIe adapter.  This scenario is likely to occur during concurrent maintenance of PCIe adapters or for #EMX0 components such as PCIe3 Cable adapters, Active Optical or copper cables, fanout modules, chassis management cards, or midplanes.  The DLPAR fail can leave page table mappings active for the adapter, causing the problems on the power down of the LPAR.  If the system does not crash, the DLPAR will fail if it is retried until a platform IPL is performed.
  • HIPER/Pervasive:  A problem was fixed for an HMC "Incomplete" state for a system after the HMC user password is changed with ASMI on the service processor.  This problem can occur if the HMC password is changed on the service processor but not also on the HMC, and a reset of the service processor happens.  With the fix, the HMC will get the needed "failed authentication" error so that the user knows to update the old password on the HMC.
  • DEFERRED:  A problem was fixed for a processor core failure with SRCs B150BA3C and BC8A090F logged that deconfigures the entire processor for the current IPL.  A re-IPL of the system will recover the lost processor with only the bad core guarded.
  • A problem was fixed for certain SR-IOV adapters that can have an adapter reset after a mailbox command timeout error.
    This fix updates the adapter firmware to 11.2.211.39  for the following Feature Codes and CCINs: #EN15/EN16 with CCIN 2CE3, #EN17/EN18 with CCIN 2CE4, #EN0H/EN0J with CCIN 2B93, #EN0M/EN0N with CCIN 2CC0, #EN0K/EN0L with CCIN 2CC1, #EL56/EL38 with CCIN 2B93, and #EL57/EL3C with CCIN 2CC1.
    The SR-IOV adapter firmware level update for the shared-mode adapters happens under user control to prevent unexpected temporary outages on the adapters.  A system reboot will update all SR-IOV shared-mode adapters with the new firmware level.  In addition, when an adapter is first set to SR-IOV shared mode, the adapter firmware is updated to the latest level available with the system firmware (and it is also updated automatically during maintenance operations, such as when the adapter is stopped or replaced).  And lastly, selective manual updates of the SR-IOV adapters can be performed using the Hardware Management Console (HMC).  To selectively update the adapter firmware, follow the steps given at the IBM Knowledge Center for using HMC to make the updates:   https://www.ibm.com/support/knowledgecenter/en/POWER9/p9efd/p9efd_updating_sriov_firmware.htm.
    Note: Adapters that are capable of running in SR-IOV mode, but are currently running in dedicated mode and assigned to a partition, can be updated concurrently either by the OS that owns the adapter or the managing HMC (if OS is AIX or VIOS and RMC is running).
  • A problem was fixed for an SR-IOV adapter failure with B400FFxx errors logged when moving the adapter to shared mode.  This is an infrequent race condition where the adapter is not yet ready for commands and it can also occur during EEH error recovery for the adapter.  This affects the SR-IOV adapters with the following feature codes and CCINs:   #EC2R/EC2S with CCIN 58FA; #EC2T/EC2U with CCIN 58FB; #EC3L/EC3M with CCIN 2CEC; and #EC66/EC67 with CCIN 2CF3.
  • A problem was fixed for an IPL failure with the following possible SRCs logged:  11007611, 110076x1, 1100D00C, and 110015xx.  The service processor may reset/reload for this intermittent error and end up in the termination state.
  • A problem was fixed for the location code of the Removable EXchange (RDX) docking station being incorrectly reported as P1-P3.  The correct location code is Un-P3. This problem pertains only to the S914 (9009-41A), S924 (9009-42A) and the H924 (9223-42H) models.  Please refer to the following IBM Knowledge Center article for more information on the location codes:  https://www.ibm.com/support/knowledgecenter/9009-42A/p9ecs/p9ecs_914_924_loccodes.htm
  • A problem was fixed for delayed interrupts on a Power9 system following a Live Partition Mobility operation from a Power7 or Power8 system.  The delayed interrupts could cause device time-outs, program dispatching delays, or other device problems on the target Power9 system.
  • A problem was fixed for processor cores not being able to be used by dedicated processor partitions if they were DLPAR removed from a dedicated processor partition.  This error can occur if there was a firmware assisted dump or a Live Partition Mobility (LPM) operation after the DLPAR of the processor.  A re-IPL of the system will recover the processor cores.
  • A problem was fixed for a B7006A96 fanout module FPGA corruption error that can occur in unsupported PCIe3 expansion drawer(#EMX0) configurations that mix an enhanced PCIe3 fanout module (#EMXH) in the same drawer with legacy PCIe3 fanout modules (#EMXF, #EMXG, #ELMF, or #ELMG).  This causes the FPGA on the enhanced #EMXH to be updated with the legacy firmware and it becomes a non-working and unusable fanout module.  With the fix, the unsupported #EMX0 configurations are detected and handled gracefully without harm to the FPGA on the enhanced fanout modules.
  • A problem was fixed for lost interrupts that could cause device time-outs or delays in dispatching a program process.  This can occur during memory operations that require a memory relocation for any partition such as mirrored memory defragmentation done by the HMC optmem command, or memory guarding that happens as part of memory error recovery during normal operations of the system.
  • A problem was fixed for extraneous informational logging of SRC B7006A10 ("Insufficient SR-IOV resources available") with a 1306 PRC.  This SRC is logged whenever an SR-IOV adapter is moved from dedicated mode to shared mode.  This SRC with the 1306 PRC should be ignored as no action is needed and there is no issue with SR-IOV resources.
  • A problem was fixed for a hypervisor error during system shutdown where a B7000602 SRC is logged and the system may also briefly go "Incomplete" on the HMC but the shutdown is successful.  The system will power back on with no problems so the SRC can be ignored if it occurred during a shutdown.
  • A problem was fixed for possible dispatching delays for partitions running in POWER8, POWER9_base or POWER9 processor compatibility mode.
  • A problem was fixed for extraneous B400FF01 and B400FF02 SRCs logged when moving cables on SR-IOV adapters.  This is an infrequent error that can occur if the HMC performance monitor is running at the same time the cables are moved.  These SRCs can be ignored when accompanied by cable movement.

System firmware changes that affect certain systems

  • DEFERRED:  For systems using the Feature Code #EPIM 8-core processor, a problem was fixed for a slightly degraded UtlraTurbo maximum frequency (approximately 3% less) compared to what is expected for this processor chip.  The fix requires new Workload Optimized Frequency (WOF) tables for the processor, so the system must be re-IPLed for the installed fix to be active.  The WOF UltraTurbo maximum frequency can only be achieved when turning cores off or operating below the 50% workload capacity of the system.
  • On systems with an IBM i partition, a problem was fixed that occurs after a Live Partition Mobility (LPM) of an IBM i partition that may cause issues including dispatching delays and the inability to do further LPM operations of that partition.  The frequency of this problem is rare.  A partition encountering this error can be recovered with a reboot of the partition.
  • On systems with an IBM i partition, a problem was fixed for a D-mode IPL failure when using a USB DVD drive in an IBM 7226 multimedia storage enclosure.  Error logs with SRC BA16010E, B2003110, and/or B200308C can occur.  As a circumvention, an external DVD drive can be used for the D-mode IPL.
VL930_093_040 / FW930.11

12/11/19

Impact: Availability       Severity: SPE

New features and functions

  • Support was added for a new 11-core POWER9 DD2.21 processor module with Feature Code #EP1H and CCIN 5C60.  This support pertains only to the IBM Power System S924 (9009-42A) model.
  • Support was added to allow processor modules to have two different CCINs for modules in the same system.  This is needed to allow DD2.21 modules to co-exist with DD2.3 modules in DD2.2 mode.

System firmware changes that affect all systems

  • DEFERRED: PARTITION_DEFERRED:  A problem was fixed for vHMC having no useable local graphics console when installed on FW930.00 and later partitions.
  • A problem was fixed for an IPMI core dump and SRC B181720D logged, causing the service processor to reset due to a low memory condition.  The memory loss is triggered by frequently using the ipmitool to read the network configuration.  The service processor recovers from this error but if three of these errors occur within a 15 minute time span, the service processor will go to a failed hung state with SRC B1817212 logged.  Should a service processor hang occur, OS workloads will continue to run but it will not be possible for the HMC to interact with the partitions.  This service processor hung state can be recovered by doing a re-IPL of the system with a scheduled outage.
  • A problem was fixed for the Advanced System Management Interface (ASMI) menu for "PCIe Hardware Topology/Reset link" showing the wrong value.  This value is always wrong without the fix.
  • A problem was fixed for PLL unlock error with SRC B124E504 causing a secondary error of PRD Internal Firmware Software Fault with SRC B181E580 and incorrect FRU call outs.
  • A problem was fixed for an initialization failure of certain SR-IOV adapter ports during its boot, causing a B400FF02 SRC to be logged.  This is a rare problem and it recovers automatically by the reboot of the adapter on the error.  This problem affects the SR-IOV adapters with the following feature codes and CCINs: #EC2R/EC2S with CCIN 58FA;  #EC2T/EC2U with CCIN 58FB;  #EC3L/EC3M with CCIN 2CEC;  and #EC66/EC67 with CCIN 2CF3.
  • A problem was fixed for the SR-IOV Virtual Functions (VFs) when the multi-cast promiscuous flag has been turned on for the VF.  Without the fix,  the VF device driver sometimes may erroneously fault when it senses that the multi-cast promiscuous mode had not been achieved although it had been requested.
  • A problem was fixed for SR-IOV adapters to provide a consistent Informational message level for cable plugging issues.  For transceivers not plugged on certain SR-IOV adapters, an unrecoverable error (UE) SRC B400FF03 was changed to an Informational message logged.  This affects the SR-IOV adapters with the following feature codes and CCINs: #EC2R/EC2S with CCIN 58FA;  #EC2T/EC2U with CCIN 58FB;  #EC3L/EC3M with CCIN 2CEC;  and #EC66/EC67 with CCIN 2CF3.
    For copper cables unplugged on certain SR-IOV adapters, a missing message was replaced with an Informational message logged.  This affects the SR-IOV adapters with the following feature codes and CCINs: #EN17/EN18 with CCIN 2CE4, #EN0K/EN0L with CCIN 2CC1, and #EL57/EL3C with CCIN 2CC1.
  • A problem was fixed for incorrect DIMM callouts for DIMM over-temperature errors.  The error log for the DIMM over temperature will have incorrect FRU callouts, either calling out the wrong DIMM or the wrong DIMM controller memory buffer.
  • A problem was fixed for an Operations Panel hang after using it set LAN Console as the console type for several iterations.  After several iterations, the operations panel may hang with "Function 41" displayed on the operations panel.  A hot unplug and plug of the operations panel can be used to recover it from the hang.
  • A problem was fixed for shared processor pools where uncapped shared processor partitions placed in a pool may not be able to consume all available processor cycles.  The problem may occur when the sum of the allocated processing units for the pool member partitions equals the maximum processing units of the pool.
  • A problem was fixed for Novalink failing to activate partitions that have names with character lengths near the maximum allowed character length.  This problem can be circumvented by changing the partition name to have 32 characters or less.
  • A problem was fixed where a Linux or AIX partition type was incorrectly reported as unknown.  Symptoms include: IBM Cloud Management Console (CMC) not being able to determine the RPA partition type (Linux/AIX) for partitions that are not active; and HMC attempts to dynamically add CPU to Linux partitions may fail with a HSCL1528 error message stating that there are not enough Integrated Facility for Linux ( IFL) cores for the operation.
  • A problem was fixed for a hypervisor hang that can occur on the target side when doing a Live Partition Mobility (LPM) migration from a system that does not support encryption and compression of LPM data.  If the hang occurs, the HMC will go to an "Incomplete" state for the target system.  The problem is rare because the data from the source partition must be in a very specific pattern to cause the failure.  When the failure occurs, a B182951C will be logged on the target (destination) system and the HMC for the source partition will issue the following message:  "HSCLA318 The migration command issued to the destination management console failed with the following error: HSCLA228 The requested operation cannot be performed because the managed system <system identifier> is not in the Standby or Operating state.".  To recover, the target system must be re-IPLed.
  • A problem was fixed for performance collection tools not collecting data for event-based options.  This fix pertains to perfpmr and tprof on AIX, and Performance Explorer (PEX) on IBM i.
  • A problem was fixed a Live Partition Mobility (LPM) migration of a large memory partition to a target system that causes the target system to crash and for the HMC to go to the "Incomplete" state.  For servers with the default LMB size (256MB), if the partition is >=16TB and if the desired memory is different than the maximum memory, LPM may fail on the target system.  Servers with LMB sizes less than the default could hit this problem with smaller memory partition sizes.  A circumvention to the problem is to set the desired and maximum memory to the same value for the large memory partition that is to be migrated.
  • A problem was fixed for certain SR-IOV adapters with the following issues:
    1) If the SR-IOV logical port's VLAN ID (PVID) is modified while the logical port is configured, the adapter will use an incorrect PVID for the Virtual Function (VF).  This problem is rare because most users do not change the PVID once the logical port is configured, so they will not have the problem.
    2) Adapters with an SRC of B400FF02 logged.
    This fix updates the adapter firmware to 11.2.211.38  for the following Feature Codes and CCINs: #EN15/EN16 with CCIN 2CE3,  #EN17/EN18 with CCIN 2CE4, #EN0H/EN0J with CCIN 2B93, #EN0M/EN0N with CCIN 2CC0,  #EN0K/EN0L with CCIN 2CC1, #EL56/EL38 with CCIN 2B93, and #EL57/EL3C with CCIN 2CC1.
    The SR-IOV adapter firmware level update for the shared-mode adapters happens under user control to prevent unexpected temporary outages on the adapters.  A system reboot will update all SR-IOV shared-mode adapters with the new firmware level.  In addition, when an adapter is first set to SR-IOV shared mode, the adapter firmware is updated to the latest level available with the system firmware (and it is also updated automatically during maintenance operations, such as when the adapter is stopped or replaced).  And lastly, selective manual updates of the SR-IOV adapters can be performed using the Hardware Management Console (HMC).  To selectively update the adapter firmware, follow the steps given at the IBM Knowledge Center for using HMC to make the updates:   https://www.ibm.com/support/knowledgecenter/en/POWER9/p9efd/p9efd_updating_sriov_firmware.htm.
    Note: Adapters that are capable of running in SR-IOV mode, but are currently running in dedicated mode and assigned to a partition, can be updated concurrently either by the OS that owns the adapter or the managing HMC (if OS is AIX or VIOS and RMC is running).
  • A problem was fixed for certain SR-IOV adapters where after some error conditions the adapter may hang with no messages or error recovery.  This is a rare problem for certain severe adapter errors.  This problem affects the SR-IOV adapters with the following feature codes:   #EC66/EC67 with CCIN 2CF3.  This problem can be recovered by removing the adapter from SR-IOV mode and putting it back in SR-IOV mode, or the system can be re-IPLed.
  • A problem was fixed for an initialization failure of certain SR-IOV adapters when changed into SR-IOV mode.  This is an infrequent problem that most likely can occur following a concurrent firmware update when the adapter also needs to be updated. This problem affects the SR-IOV adapter with the following feature codes and CCINs: #EC2R/EC2S with CCIN 58FA;  #EC2T/EC2U with CCIN 58FB;  #EC3L/EC3M with CCIN 2CEC;  and #EC66/EC67 with CCIN 2CF3.  This problem can be recovered by removing the adapter from SR-IOV mode and putting it back in SR-IOV mode, or the system can be re-IPLed.
  • A problem was fixed for a false memory error that can be logged during the IPL with SRC BC70E540 with the description "mcb(n0p0c1) (MCBISTFIR[12]) WAT_DEBUG_ATTN" but with no hardware call outs.  This error log can be ignored.
  • A problem was fixed for an IPL failure after installing DIMMs of different sizes, causing memory access errors.  Without the fix, the memory configuration should be restored to only use DIMMs of the same size.
  • A problem was fixed for a memory DIMM plugging rule violation that causes the IPL to terminate with an error log with RC_GET_MEM_VPD_UNSUPPORTED_CONFIG IPL that calls out the memory port but has no DIMM call outs and no DIMM deconfigurations are done.  With the fix, the DIMMs that violate the plugging rules will be deconfigured and the IPL will complete.  Without the fix, the memory configuration should be restored to the prior working configuration to allow the IPL to be successful.
  • A problem was fixed for a B7006A22 Recoverable Error for the enhanced PCIe3 expansion drawer (#EMX0) I/O drawer with fanout PCIe Six Slot Fan Out modules (#EMXH) installed. This can occur up to two hours after an IPL from power off.   This can be a frequent occurrence on an IPL for systems that have the PCIe Six Slot Fan Out module (#EMXH).  The error is automatically recovered at the hypervisor level.  If an LPAR fails to start after this error, a restart of the LPAR is needed.
  • A problem was fixed for degraded memory bandwidth on systems with memory that had been dynamically repaired with symbols to mark the bad bits.
  • A problem was fixed for an intermittent IPMI core dump on the service processor.  This occurs only rarely when multiple IPMI sessions are starting and cleaning up at the same time.  A new IPMI session can fail initialization when one of its session objects is inadvertently cleaned up.  The circumvention is to retry the IPMI command that failed.
  • A problem was fixed for an intermittent IPL failure with SRC B181E540 logged with fault signature " ex(n2p1c0) (L2FIR[13]) NCU Powerbus data timeout".  No FRU is called out.  The error may be ignored since the automatic re-IPL is successful.  The error occurs very infrequently.  This is the second iteration of the fix that has been released.  Expedient routing of the Powerbus interrupts did not occur in all cases in the prior fix, so the timeout problem was still occurring.

System firmware changes that affect certain systems

  • On systems with PCIe3 expansion drawers(feature code #EMX0),  a problem was fixed for a concurrent exchange of a PCIe expansion drawer cable card, although successful, leaves the fault LED turned on.
  • On systems with 16GB huge-pages, a problem was fixed for certain SR-IOV adapters with all or nearly all memory assigned to them preventing a system IPL.  This affects the SR-IOV adapters with the following feature codes and CCINs: #EC2R/EC2S with CCIN 58FA;  #EC2T/EC2U with CCIN 58FB;  #EC3L/EC3M with CCIN 2CEC;  and #EC66/EC67 with CCIN 2CF3.  The problem can be circumvented by powering off the system and turning off all the huge-page allocations.
  • On systems running IBM i partitions, a problem was fixed for a NVME (Non-Volatile Memory Express) device load source not being found for an IBM i boot partition.  This can occur on a system with multiple load source candidates when the needed load source is not the first entry in the namespace ID list.  This problem can be circumvented by directing the System License Internal Code (SLIC) Bootloader to a specific namespace and bypassing a search.
  • On systems with IBM i partitions,  a problem was fixed for Live Partition Mobility (LPM) migrations that could have incorrect hardware resource information (related to VPD) in the target partition if a failover had occurred for the source partition during the migration.  This failover would have to occur during the Suspended state of the migration, which only lasts about a second, so this should be rare.  With the fix, at a minimum, the migration error will be detected to abort the migration so it can be restarted.  And at a later IBMi OS level, the fix will allow the migration to complete even though the failover has occurred during the Suspended state of the migration.
  • On systems running IBM i partitions, a problem was fixed for IBM i collection services that may produce incorrect instruction count results.
  • On systems running IBM i partitions, a performance improvement was made for the cache memory management of workloads that utilize heavy I/O operations.
  • On systems with IBM i partitions, a rare problem was fixed for an intermittent failure of a DLPAR remove of an adapter.  In most cases, a retry of the operation will be successful.
  • On systems with IBM i partitions, a problem was fixed that was allowing V7R1 to boot on or be migrated to POWER9 servers.  As documented in the System Software maps for IBM i (https://www-01.ibm.com/support/docview.wss?uid=ssm1platformibmi), V7R1 IBM i software is not supported on POWER9 servers.
  • On systems with IBM i partitions, a problem was fixed for a LPAR restart error after a DLPAR of an active adapter was performed and the LPAR was shut down.  A reboot of the system will recover the LPAR so it will start.
VL930_068_040 / FW930.03

08/22/19

Impact:  Data                  Severity:  HIPER

New features and functions

  • Support was added for processor module DD2.3 with CCIN 5C41 but all processor modules in the system must have the same CCIN.  These new processor modules cannot be mixed with DD2.21 processor modules with CCIN 5C25 until service pack level FW930.10.

System firmware changes that affect all systems 

  • HIPER/Pervasive:  A change was made to fix an intermittent processor anomaly that may result in issues such as operating system or hypervisor termination, application segmentation fault, hang, or undetected data corruption.  The only issues observed to date have been operating system or hypervisor terminations.
  • A problem was fixed for a very intermittent partition error when using Live Partition Mobility (LPM) or concurrent firmware update.  For a mobility operation, the issue can result in a partition crash if the mobility target system is FW930.00, FW930.01 or FW930.02.  For a code update operation, the partition may hang.  The recovery is to reboot the partition after the crash or hang.

System firmware changes that affect certain systems

  • DEFERRED: HIPER/Pervasive:  A change was made to address a problem causing SAS bus errors and failed/degraded SAS paths with SRCs or SRNs such as these logged: 57D83400, 2D36-3109, 57D84040, 2D36-4040, 57D84060, 2D36-4060, 57D84061, 2D36-4061, 57D88130, 2D36-FFFE, xxxxFFFE, xxxx-FFFE, xxxx4061 and xxxx-4061.   These errors occur intermittently for hard drives and solid state drives installed in the two high-performance DASD backplanes (feature codes #EJ1M and #EJ1D).  This problem pertains to the 9009-41A, 9009-42A, and 9223-42H models only.   Without the change, the circumvention is to replace the affected DASD backplane.
VL930_048_040 / FW930.02

06/28/19

Impact: Availability       Severity: SPE

System firmware changes that affect all systems

  • A problem was fixed for a bad link for the PCIe3 expansion drawer (#EMX0) I/O drawer with the clock enhancement causing a system failure with B700F103.  This error could occur during an IPL or a concurrent add of the link hardware.
  • A problem was fixed for On-Chip Controller (OCC) power capping operation time-outs with SRC B1112AD3 that caused the system to enter safe mode, resulting in reduced performance.  The problem only occurred when the system was running with high power consumption, requiring the need for OCC power capping.
  • A problem was fixed for the "PCIe Topology " option to get cable information in the HMC or ASMI that was returning the wrong cable part numbers if the PCIe3 expansion drawer (#EMX0) I/O drawer clock enhancement was configured.  If cables with the incorrect part numbers are used for an enhanced PCIe3 expansion drawer configuration, the hypervisor will log a B7006A20 with PRC 4152 indicating an invalid configuration - https://www.ibm.com/support/knowledgecenter/9080-M9S/p9eai/B7006A20.htm.
  • A problem was fixed for a drift in the system time (time lags and the clock runs slower than the true value of time) that occurs when the system is powered off to the service processor standby state.  To recover from this problem, the system time must be manually corrected using the Advanced System Management Interface (ASMI) before powering on the system.  The time lag increases in proportion to the duration of time that the system is powered off.
VL930_040_040 / FW930.01

05/31/19

Impact: Availability       Severity: HIPER

System firmware changes that affect certain systems

  • HIPER/Pervasive: DISRUPTIVE:  For systems at VL930_035 (FW930.00) with LPARs using a VIOS for virtual adapters or using VIOS Shared Storage Pool (SSP) clusters, a problem was fixed for a possible network hang in the LPAR after upgrading from FW910.XX to FW930.00.  Systems with IBM i partitons and HSM may see extraneous resources, with the old ones non-reporting.  With the problem, the I/O adapter port location code names had changed from that used for the FW910 release, and this difference in the naming convention prevents the VIOS virtual adapters from working in some cases.  For example, NPIV (N_PORT ID Virtualization) Fibre Channel (FC) adapters fail to work because of the I/O mappings are different.  With the fix, the I/O adapter location code naming convention reverts back to that used for FW910.    There could be other impacts not described here to operating system views of IO devices and IBM strongly recommends that customers at FW930.00 update to this new level or higher.  Customers upgrading from FW910.xx levels to FW930.01 or later are not affected by this problem.
    The following is an example of the original location code format and the problem version of the location code name (notice that "001" was replaced by "ND1"):
    Original format: "fcs0 U78D2.001.WZS000W-P1-C6 Available 01-00 8Gb PCI Express Dual Port FC Adapter (df1000f114108a03)"
    Problem format: "fcs0 U78D2.ND1.WZS000W-P1-C6 Available 01-00 8Gb PCI Express Dual Port FC Adapter (df1000f114108a03)"
    If you have a FW930.00 (VL930_035) installed, please follow the steps outlined in the "Important Information applicable only for systems at VL930_035 (FW930.00)  with VIOS partitions or IBM i HSM" section of the README. 
VL930_035_035 / FW930.00

05/17/19

Impact:  New      Severity:  New

All features and fixes from the FW910.30 service pack (and below) are included in this release.

New Features and Functions

  • Support was added to allow the FPGA soft error checking on the PCIe I/O expansion drawer (#EMX0) to be disabled with the help of IBM support using the hypervisor "xmsvc" macro.  This new setting will persist until it it is changed by the user or IBM support.  The effect of disabling FPGA soft error checking is to eliminate the FPGA soft error recovery which causes a recoverable PCIe adapter outage.  Some of the soft errors will be hidden by this change but others may have unpredictable results, so this should be done only under guidance of IBM support.
  • Support for the PCIe3 expansion drawer (#EMX0) I/O drawer clock enhancement so that a reset of the drawer does not affect the reference clock to the adapters so the PCIe lanes for the PCIe adapters can keep running through an I/O drawer FPGA reset.  To use this support, new cable cards, fanout modules, and optical cables are needed after this support is installed: PCIe Six Slot Fan out module(#EMXH) - only allowed to be connected to converter adapter cable card;  PCIe X16 to CXP Optical or CU converter adapter for the expansion drawer (#EJ1R); and new AOC cables with feature/part number of #ECCR/78P6567, #ECCX/78P6568, #ECCY/78P6569, and #ECCZ/78P6570. These parts cannot be install concurrently, so a scheduled outage is needed to complete the migration.
  • Support added for RDMA Over Converged Ethernet (RoCE) for SR-IOV adapters.
  • Support added for SMS menu to enhance the I/O information option to have "vscsi" and "network" options.  The information shown for "vscsi" devices is similar to that provided for SAS and Fibre Channel devices.  The "network" option provides connectivity information for the adapter ports and shows which can be used for network boots and installs.
  • Support for an IBM i system with a SAN boot feature code that can be ordered without an internal DASD backplane, SAS cables, or SAS adapters.
  • Support added to allow integrated USB ports to be disabled.  This is available via an Advanced System Management Interface (ASMI) menu option:  "System Configuration -> Security -> USB Policy".  The USB disable policy, if selected, does not apply to pluggable USB adapters plugged into PCIe slots such as the 4-Port USB adapter (#EC45/#EC46), which are always enabled.

System firmware changes that affect all systems

  • A problem was fixed for a system IPLing with an invalid time set on the service processor that causes partitions to be reset to the Epoch date of 01/01/1970.  With the fix, on the IPL, the hypervisor logs a B700120x when the service processor real time clock is found to be invalid and halts the IPL to allow the time and date to be corrected by the user.  The Advanced System Management Interface (ASMI) can be used to correct the time and date on the service processor.  On the next IPL, if the time and date have not been corrected, the hypervisor will log a SRC B7001224 (indicating the user was warned on the last IPL) but allow the partitions to start, but the time and date will be set to the Epoch value.

VL910

VL910
For Impact, Severity and other Firmware definitions, Please refer to the below 'Glossary of firmware terms' url:
https://www.ibm.com/support/pages/node/6555136
VL910_151_144 / FW910.50

04/08/20

Impact:  Availability      Severity:  HIPER

New features and functions

  • Support was added for real-time data capture for PCIe3 expansion drawer (#EMX0) cable card connection data via resource dump selector on the HMC or in ASMI on the service processor.  Using the resource selector string of "xmfr -dumpccdata" will non-disruptively generate an RSCDUMP type of dump file that has the current cable card data, including data from cables and the retimers.
  • Improvements to link stack algorithms.

System firmware changes that affect all systems 

  • HIPER/Pervasive:  A problem was fixed for a possible system crash and HMC "Incomplete" state when a logical partition (LPAR) power off after a dynamic LPAR (DLPAR) operation fails for a PCIe adapter.  This scenario is likely to occur during concurrent maintenance of PCIe adapters or for #EMX0 components such as PCIe3 Cable adapters, Active Optical or copper cables, fanout modules, chassis management cards, or midplanes.  The DLPAR fail can leave page table mappings active for the adapter, causing the problems on the power down of the LPAR.  If the system does not crash, the DLPAR will fail if it is retried until a platform IPL is performed.
  • HIPER/Pervasive:  A problem was fixed for an HMC "Incomplete" state for a system after the HMC user password is changed with ASMI on the service processor.  This problem can occur if the HMC password is changed on the service processor but not also on the HMC, and a reset of the service processor happens.  With the fix, the HMC will get the needed "failed authentication" error so that the user knows to update the old password on the HMC.
  • A problem was fixed for an intermittent IPMI core dump on the service processor.  This occurs only rarely when multiple IPMI sessions are starting and cleaning up at the same time.  A new IPMI session can fail initialization when one of its session objects is inadvertently cleaned up.  The circumvention is to retry the IPMI command that failed.
  • A problem was fixed for the location code of the Removable EXchange (RDX) docking station being incorrectly reported as P1-P3.  The correct location code is Un-P3. This problem pertains only to the S914 (9009-41A), S924 (9009-42A) and the H924 (9223-42H)  models.  Please refer to the following IBM Knowledge Center article for more information on the location codes:  https://www.ibm.com/support/knowledgecenter/9009-42A/p9ecs/p9ecs_914_924_loccodes.htm
  • A problem was fixed for an initialization failure of certain SR-IOV adapters when changed into SR-IOV mode.  This is an infrequent problem that most likely can occur following a concurrent firmware update when the adapter also needs to be updated. This problem affects the SR-IOV adapter with the following feature codes and CCINs: #EC2R/EC2S with CCIN 58FA;  #EC2T/EC2U with CCIN 58FB;  #EC3L/EC3M with CCIN 2CEC;  and #EC66/EC67 with CCIN 2CF3.  This problem can be recovered by removing the adapter from SR-IOV mode and putting it back in SR-IOV mode, or the system can be re-IPLed.
  • A problem was fixed for a Live Partition Mobility (LPM) migration of a large memory partition to a target system that causes the target system to crash and for the HMC to go to the "Incomplete" state.  For servers with the default LMB size (256MB), if the partition is >=16TB and if the desired memory is different than the maximum memory, LPM may fail on the target system.  Servers with LMB sizes less than the default could hit this problem with smaller memory partition sizes.  A circumvention to the problem is to set the desired and maximum memory to the same value for the large memory partition that is to be migrated.
  • A problem was fixed for lost interrupts that could cause device time-outs or delays in dispatching a program process.  This can occur during memory operations that require a memory relocation for any partition such as mirrored memory defragmentation done by the HMC optmem command, or memory guarding that happens as part of memory error recovery during normal operations of the system.
  • A problem was fixed for delayed interrupts on a Power9 system following a Live Partition Mobility operation from a Power7 or Power8 system.  The delayed interrupts could cause device time-outs, program dispatching delays, or other device problems on the target Power9 system.
  • A problem was fixed for processor cores not being able to be used by dedicated processor partitions if they were DLPAR removed from a dedicated processor partition.  This error can occur if there was a firmware assisted dump or a Live Partition Mobility (LPM) operation after the DLPAR of the processor.  A re-IPL of the system will recover the processor cores.
  • A problem was fixed for a B7006A96 fanout module FPGA corruption error that can occur in unsupported PCIe3 expansion drawer(#EMX0) configurations that mix an enhanced PCIe3 fanout module (#EMXH)  in the same drawer with legacy PCIe3 fanout modules (#EMXF, #EMXG, #ELMF,  or #ELMG).  This causes the FPGA on the enhanced  #EMXH to be updated with the legacy firmware and it becomes a non-working and unusable fanout module.  With the fix, the unsupported #EMX0 configurations are detected and handled gracefully without harm to the FPGA on the enhanced fanout modules.
  • A problem was fixed for an SR-IOV adapter failure with B400FFxx errors logged when moving the adapter to shared mode.  This is an infrequent race condition where the adapter is not yet ready for commands and it can also occur during EEH error recovery for the adapter.  This affects the SR-IOV adapters with the following feature codes and CCINs:   #EC2R/EC2S with CCIN 58FA;  #EC2T/EC2U with CCIN 58FB;  #EC3L/EC3M with CCIN 2CEC;  and #EC66/EC67 with CCIN 2CF3.
  • A problem was fixed for certain SR-IOV adapters with the following issues:
    1) If the SR-IOV logical port's VLAN ID (PVID) is modified while the logical port is configured, the adapter will use an incorrect PVID for the Virtual Function (VF).  This problem is rare because most users do not change the PVID once the logical port is configured, so they will not have the problem.
    2) Adapters with an SRC of B400FF02 logged.
    3) An adapter reset after a mailbox command timeout error.
    These fixes update the adapter firmware to 11.2.211.39  for the following Feature Codes and CCINs: #EN15/EN16 with CCIN 2CE3,  #EN17/EN18 with CCIN 2CE4, #EN0H/EN0J with CCIN 2B93, #EN0M/EN0N with CCIN 2CC0,  #EN0K/EN0L with CCIN 2CC1, #EL56/EL38 with CCIN 2B93, and #EL57/EL3C with CCIN 2CC1.
    The SR-IOV adapter firmware level update for the shared-mode adapters happens under user control to prevent unexpected temporary outages on the adapters.  A system reboot will update all SR-IOV shared-mode adapters with the new firmware level.  In addition, when an adapter is first set to SR-IOV shared mode, the adapter firmware is updated to the latest level available with the system firmware (and it is also updated automatically during maintenance operations, such as when the adapter is stopped or replaced).  And lastly, selective manual updates of the SR-IOV adapters can be performed using the Hardware Management Console (HMC).  To selectively update the adapter firmware, follow the steps given at the IBM Knowledge Center for using HMC to make the updates:   https://www.ibm.com/support/knowledgecenter/en/POWER9/p9efd/p9efd_updating_sriov_firmware.htm.
    Note: Adapters that are capable of running in SR-IOV mode, but are currently running in dedicated mode and assigned to a partition, can be updated concurrently either by the OS that owns the adapter or the managing HMC (if OS is AIX or VIOS and RMC is running).
  • A problem was fixed for extraneous B400FF01 and B400FF02 SRCs logged when moving cables on SR-IOV adapters.  This is an infrequent error that can occur if the HMC performance monitor is running at the same time the cables are moved.  These SRCs can be ignored when accompanied by cable movement.
  • A problem was fixed for a rare IPL failure with SRCs BC8A090F and BC702214 logged caused by an overflow of VPD repair data for the processor cores.  A re-IPL of the system should recover from this problem.
  • A problem was fixed for an intermittent IPL failure with SRC B181E540 logged with fault signature " ex(n2p1c0) (L2FIR[13]) NCU Powerbus data timeout".  No FRU is called out.  The error may be ignored and the re-IPL is successful.  The error occurs very infrequently.  This is the second iteration of the fix that has been released.  Expedient routing of the Powerbus interrupts did not occur in all cases in the prior fix, so the timeout problem was still occurring.
  • A problem was fixed for a memory DIMM plugging rule violation that causes the IPL to terminate with an error log with  RC_GET_MEM_VPD_UNSUPPORTED_CONFIG IPL that calls out the memory port but has no DIMM call outs and no DIMM deconfigurations are done.  With the fix, the DIMMs that violate the plugging rules will be deconfigured and the IPL will complete.  Without the fix, the memory configuration should be restored to the prior working configuration to allow the IPL to be successful.

System firmware changes that affect certain systems

  • DEFERRED:  For systems using the Feature Code #EPIM 8-core processor, a problem was fixed for a slightly degraded UtlraTurbo maximum frequency (approximately 3% less) compared to what is expected for this processor chip.  The fix requires new Workload Optimized Frequency (WOF) tables for the processor, so the system must be re-IPLed for the installed fix to be active.  The WOF UltraTurbo maximum frequency can only be achieved when turning cores off or operating below the 50% workload capacity of the system.
  • On systems with an IBM i partition, a problem was fixed that occurs after a Live Partition Mobility (LPM) of an IBM i partition that may cause issues including dispatching delays and the inability to do further LPM operations of that partition.  The frequency of this problem is rare.  A partition encountering this error can be recovered with a reboot of the partition.
  • On systems with an IBM i partition, a problem was fixed for a D-mode IPL failure when using a USB DVD drive in an IBM 7226 multimedia storage enclosure.  Error logs with SRC BA16010E, B2003110, and/or B200308C can occur.  As a circumvention, an external DVD drive can be used for the D-mode IPL.
VL910_147_144 / FW910.40

09/27/19

Impact:  Availability      Severity:  SPE

System firmware changes that affect all systems 

  • A problem was fixed for an IPMI core dump and SRC B181720D logged, causing the service processor to reset due to a low memory condition.  The memory loss is triggered by frequently using the ipmitool to read the network configuration.  The service processor recovers from this error but if three of these errors occur within a 15 minute time span, the service processor will go to a failed hung state with SRC B1817212 logged.  Should a service processor hang occur, OS workloads will continue to run but it will not be possible for the HMC to interact with the partitions.  This service processor hung state can be recovered by doing a re-IPL of the system with a scheduled outage.
  • A problem was fixed for the Advanced System Management Interface (ASMI) menu for "PCIe Hardware Topology/Reset link" showing the wrong value.  This value is always wrong without the fix.
  • A problem was fixed for a 2-port 100 GbE RoCE En Connectx-5  PCIe4  x16 adapter running hotter than intended and potentially impacting performance of the adapter.  This adapter has feature codes #EC66 and #EC67 with CCIN 2CF3.  With the fix, the fan speeds are adjusted to a higher speed to provide the proper cooling for the adapter.
  • A problem was fixed for false indication of a real time clock (RTC) battery failure with SRC B15A3305 logged.  This error happens infrequently.  If the error occurs, and another battery failure SRC is not logged within 24 hours, ignore the error as it was caused by a timing issue in the battery test.
  • A problem was fixed for shared processor pools where uncapped shared processor partitions placed in a pool may not be able to consume all available processor cycles.  The problem may occur when the sum of the allocated processing units for the pool member partitions equals the maximum processing units of the pool.
  • A problem was fixed for a system IPLing with an invalid time set on the service processor that causes partitions to be reset to the Epoch date of 01/01/1970.  With the fix, on the IPL, the hypervisor logs a B700120x when the service processor real time clock is found to be invalid and halts the IPL to allow the time and date to be corrected by the user.  The Advanced System Management Interface (ASMI) can be used to correct the time and date on the service processor.  On the next IPL, if the time and date have not been corrected, the hypervisor will log a SRC B7001224 (indicating the user was warned on the last IPL) but allow the partitions to start, but the time and date will be set to the Epoch value.
  • A problem was fixed for Novalink failing to activate partitions that have names with character lengths near the maximum allowed character length.  This problem can be circumvented by changing the partition name to have 32 characters or less.
  • A problem was fixed where a Linux or AIX partition type was incorrectly reported as unknown.  Symptoms include: IBM Cloud Management Console (CMC) not being able to determine the RPA partition type (Linux/AIX) for partitions that are not active; and HMC attempts to dynamically add CPU to Linux partitions may fail with a HSCL1528 error message stating that there are not enough Integrated Facility for Linux ( IFL) cores for the operation.
  • A problem was fixed for the Advanced System Management Interface (ASMI) not being able to turn off the Live Partition Mobility (LPM) capability for the system.  As a circumvention, the IBM COD Project Office can be requested to send a 2-key set of VET codes that will turn off the LPM capability.
  • A problem was fixed for a possible system crash with SRC B7000103 if the HMC session is closed while the performance monitor is active.  As a circumvention for this problem, make sure the performance monitor is turned off before closing the HMC sessions.
  • A problem was fixed for an outage of I/O connected to a single PCIe Host Bridge (PHB) with a B7006970 SRC logged.  With the fix, the rare PHB fault will have an EEH event detected and recovered by firmware.
  • A problem was fixed for an initialization failure of certain SR-IOV adapter ports during its boot, causing a B400FF02 SRC to be logged.  This is a rare problem and it recovers automatically by the reboot of the adapter on the error.  This problem affects the SR-IOV adapters with the following feature codes:  EC2S, EC2U, and EC3M.
  • A problem was fixed for SR-IOV adapter Virtual Functions (VFs) that can fail to restore to their configuration after a low-level EEH error, causing loss of function for the adapter.  This problem can occur if the other than the default NIC VF configuration was selected when the VF was created.  The problem will occur all the time for VFs configured as RDMA over Converged Ethernet (RoCE) but be much less frequent and intermittent for other non-default VF configurations.  And since RoCE is only supported for FW930 and later releases, the problem is only intermittent and less frequent at the earlier FW910 levels that do not have the fix
  • A problem was fixed for a concurrent firmware update failure with SRC B7000AFF logged.  This is a rare problem triggered by a power mode change preceding a concurrent firmware update.  To recover from this problem, run the code update again without any power mode changes.
  • A problem was fixed for a concurrent firmware hang with SRC B1813450 logged.  This is a rare problem triggered by an error or power mode change that requires a Power Management (PM) Complex Reset.  To recover from this problem, re-IPL the system and it will be running at the target firmware update level.
  • A problem was fixed for a Operations Panel hang after using it set LAN Console as the console type for several iterations.  After several iterations, the op panel may hang with "Function 41" displayed on the op panel.  A hot unplug and plug of the op panel can be used to recover it from the hang. 

System firmware changes that affect certain systems

  • On systems with PCIe3 expansion drawers(feature code #EMX0),  a problem was fixed for a concurrent exchange of a PCIe expansion drawer cable card, although successful, leaves the fault LED turned on.
  • On systems with IBM i partitions,  a problem was fixed for Live Partition Mobility (LPM) migrations that could have incorrect hardware resource information (related to VPD) in the target partition if a failover had occurred for the source partition during the migration.  This failover would have to occur during the Suspended state of the migration, which only lasts about a second, so this should be rare.  With the fix, at a minimum the migration error will be detected to abort the migration so it can be restarted.  And at a later IBMi OS level, the fix will allow the migration to complete even though the failover has occurred during the Suspended state of the migration.
  • On systems running IBM i partitions, a problem was fixed for IBM i collection services that may produce incorrect instruction count results.
  • On systems with IBM i partitions, a problem was fixed that was allowing V7R1 to boot on or be migrated to POWER9 servers.  As documented in the System Software maps for IBM i (https://www-01.ibm.com/support/docview.wss?uid=ssm1platformibmi), V7R1 IBM i software is not supported on POWER9 servers.
  • On systems with 16GB huge-pages, a problem was fixed for certain SR-IOV adapters with all or nearly all memory assigned to them preventing a system IPL.  This affects the SR-IOV adapters with the following feature codes:  EC2S, EC2U, and EC3M.  The problem can be circumvented by powering off the system and turning off all the huge-page allocations.
  • On systems with IBM i partitions, a rare problem was fixed for an intermittent failure of a DLPAR remove of an adapter.  In most cases, a retry of the operation will be successful.
  • On systems with IBM i partitions, a problem was fixed for a LPAR restart error after a DLPAR of an active adapter was performed and the LPAR was shut down.  A reboot of the system will recover the LPAR so it will start.
VL910_144_144 / FW910.32

08/06/19

Impact:  Data                  Severity:  HIPER

System firmware changes that affect all systems 

  • DISRUPTIVE: HIPER/Pervasive:  A change was made to fix an intermittent processor anomaly that may result in issues such as operating system or hypervisor termination, application segmentation fault, hang, or undetected data corruption.  The only issues observed to date have been operating system or hypervisor terminations.
  • A problem was fixed for a drift in the system time (time lags and the clock runs slower than the true value of time) that occurs when the system is powered off to the service processor standby state.  To recover from this problem, the system time must be manually corrected using the Advanced System Management Interface (ASMI) before powering on the system.  The time lag increases in proportion to the duration of time that the system is powered off.

System firmware changes that affect certain systems

  • DEFERRED: HIPER/Pervasive:  A change was made to address a problem causing SAS bus errors and failed/degraded SAS paths with SRCs or SRNs such as these logged: 57D83400, 2D36-3109, 57D84040, 2D36-4040, 57D84060, 2D36-4060, 57D84061, 2D36-4061, 57D88130, 2D36-FFFE, xxxxFFFE, xxxx-FFFE, xxxx4061 and xxxx-4061.   These errors occur intermittently for hard drives and solid state drives installed in the two high-performance DASD backplanes (feature codes #EJ1M and #EJ1D).  This problem pertains to the 9009-41A, 9009-42A, and 9223-42H models only.   Without the change, the circumvention is to replace the affected DASD backplane.
VL910_135_127 / FW910.30

04/25/19

Impact:  Data                  Severity:  HIPER

New features and functions

  • A option was added to the SMS Remote IPL (RIPL) menus to enable or disable the UDP checksum calculation for any device type.  Previously, this checksum option was only available for logical LAN devices but now it extended to all types.  The default is for the UDP checksum calculation to be done, but if this calculation causes errors for the device, it can be turned off with the new option.

System firmware changes that affect all systems

  • HIPER/Non-Pervasive:  A problem was fixed to address potential scenarios that could result in undetected data corruption.
  • DEFERRED:  A problem was fixed for the USB port having the wrong location code assigned.  The "P1-T4-L1 USB DVD R/RW or RAM Drive" location code should be "P1-T3-L1".   The USB DVD still works correctly but reported location codes such as in error logs will have the wrong location code shown.  A previous fix for this problem in FW910.20 did not have the hypervisor portion of the fix, so the error still occurred after the fix was applied.
    This problem only pertains to IBM Power System models S914(9009-41A), S924(9009-42A), and H924 for SAP HANA (9223-42H).
  • DEFERRED:PARTITION_DEFERRED:  A problem was fixed for repeated CPU DLPAR remove operations by Linux (Ubuntu, SUSE, or RHEL) OSes possibly resulting in a partition crash.  No specific SRCs or error logs are reported.   The problem can occur on any DLPAR CPU remove operation if running on Linux.  The occurrence is intermittent and rare.  The partition crash may result in one or more of the following console messages (in no particular order):
     1) Bad kernel stack pointer addr1 at addr2
     2) Oops: Bad kernel stack pointer
     3) ******* RTAS CALL BUFFER CORRUPTION *******
      4)  ERROR: Token not supported
    This fix does not activate until there is a reboot of the partition.
  • A problem was fixed for an intermittent IPL failure with SRC B181E540 logged with fault signature " ex(n2p1c0) (L2FIR[13]) NCU Powerbus data timeout".  No FRU is called out.  The error may be ignored and the reIPL is successful.  The error occurs very infrequently.
  • A problem was fixed for an IPMI core dump and SRC B1818601 logged intermittenly when an IPMI session is closed.  A flood of B1818A03 SRCs may be logged after the error occurs.  The IPM server is not impacted and a call home is reported for the problem.  There is no service outage for the IPMI users because of this.
  • A problem was fixed for systems which were running at low processor frequencies and voltages because the High Frequency Trading (HFT) Policy had been selected(thereby disabling the On-Chip Controller (OCC)) but without IBM Support assisting to set the core nest frequencies to a maximum level.  Without the extra manual steps to set the core frequencies, the system defaults to Safe mode (lowest frequency and voltage) because it is running without the OCC.  With the fix, the High Frequency Policy menu is hidden in the Advanced System Management Interface (ASMI) so that only the IBM Support representative can set the HFT mode while also setting the core frequencies to the maximum value that can be sustained on that specific system.
  • A problem was fixed for a PCIe Hub checkstop with SRC B138E504 logged that fails to guard the errant processor chip.  With the fix, the problem hardware FRU is guarded so there is not a recurrence of the error on the next IPL.
  • A problem was fixed for an incorrect SRC of B1810000 being logged when a firmware update fails because of Entitlement Key expiration.  The error displayed on the HMC and in the OS is correct and meaningful.  With the fix, for this firmware update failure the correct SRC of B181309D is now logged.
  • A problem was fixed for deconfigured FRUs that showed as Unit Type of "Unknown" in the Advanced System Management Interface (ASMI).  The following FRU type names will be displayed if deconfigured (shown here is a description of the FRU type as well):
    DMI: Processor to Memory Buffer Interface
    MC: Memory Controller
    MFREFCLK: Multi Function Reference Clock
    MFREFCLKENDPT: Muti function reference clock end point
    MI: Processor to Memory Buffer Interface
    NPU:  Nvidia Processing Unit
    OBUS_BRICK: OBUS
    SYSREFCLKENDPT: System reference clock end point
    TPM: Trusted Platform Module
  • A problem was fixed for certain SR-IOV adapters where SRC B400FF01 errors are seen during configuration of the adapter into SR-IOV mode or updating adapter firmware.  This fix updates the adapter firmware to 11.2.211.37  for the following Feature Codes: EN15,  EN17, EN0H, EN0J, EN0M, EN0N, EN0K, EN0L, EL38, EL3C, EL56, and EL57.
    The SR-IOV adapter firmware level update for the shared-mode adapters happens under user control to prevent unexpected temporary outages on the adapters.  A system reboot will update all SR-IOV shared-mode adapters with the new firmware level.  In addition, when an adapter is first set to SR-IOV shared mode, the adapter firmware is updated to the latest level available with the system firmware (and it is also updated automatically during maintenance operations, such as when the adapter is stopped or replaced).  And lastly, selective manual updates of the SR-IOV adapters can be performed using the Hardware Management Console (HMC).  To selectively update the adapter firmware, follow the steps given at the IBM Knowledge Center for using HMC to make the updates:   https://www.ibm.com/support/knowledgecenter/en/POWER9/p9efd/p9efd_updating_sriov_firmware.htm.
    Note: Adapters that are capable of running in SR-IOV mode, but are currently running in dedicated mode and assigned to a partition, can be updated concurrently either by the OS that owns the adapter or the managing HMC (if OS is AIX or VIOS and RMC is running).
  • A problem was fixed for DDR4 2933 MHZ and 3200 MHZ DIMMs not defaulting to the 2666 MHZ speed on a new DIMM plug, thus preventing the system from IPLing.
  • A problem was fixed for IPMI sessions in the service processor causing a flood of B181A803 informational error logs on registry read fails for IPv6 and IPv4 keywords.  These error logs do not represent a real problem and may be ignored.
  • A problem was fixed for the HMC in some instances reporting a VIOS partition as an AIX partition.  The VIOS partition can be used correctly even when it is misidentified.
  • A problem was fixed for shared processor partitions going unresponsive after changing the processor sharing mode of a dedicated processor partition from "allow when partition is active" to either "allow when partition is inactive" or "never".  This problem can be circumvented by avoiding disabling processor sharing when active on a dedicated processor partition.  To recover if the issue has been encountered, enable "processor sharing when active" on the dedicated partition.
  • A problem was fixed for intermittent PCIe correctable errors which would eventually threshold and cause SRC B7006A72 to be logged.   PCIe performance degradation or temporary loss of one or more PCIe IO slots could also occur resulting in SRCs B7006970 or B7006971.
  • A problem was fixed for I/O adapters not recovering from multiple, concurrent low-level EEH errors, resulting in a Permanent EEH error with SRC BA2B000D logged or SRC BA188002 and B7006A22 logged.  The affected adapters can be recovered by a re-IPL of the system.  With the fix, the adapters are able to reset and recover from the simultaneous error conditions.  The problem frequency is low because it requires a second error on a slot that is already frozen with an error and going into a reset.
  • A problem was fixed for hypervisor error logs issued during the IPL missing the firmware version.  This happens on every IPL for logs generated during the early part of the IPL.
  • A problem was fixed for a continuous logging of B7006A28 SRCs after the threshold limit of PCIe Advanced Error Reporting (AER) correctable errors.  The error log flooding can cause error buffer wrapping and other performance issues.
  • A problem was fixed for an error in deleting a partition with the virtualized Trusted Platform Module (vTPM) enabled and SRC B7000602 logged.  When this error occurs, the encryption process in the hypervisor may become unusable.  The problem can be recovered from with a re-IPL of the system.
  • A problem was fixed in Live Partition Mobility (LPM) of a partition to a shared processor pool, which results in the partition being unable to consume uncapped cycles on the target system.  To prevent the issue from occurring, partitions can be migrated to the default shared processor pool and then dynamically moved to the desired shared processor pool.  To recover from the issue, use DLPAR to add or remove a virtual processor to/from the affected partition, dynamically move the partition between shared processor pools, reboot the partition, or re-IPL the system.
  • A problem was fixed for informational (INF) errors for the PCIe Hub (PHB) at a threshold limit causing the I/O slots to go non-operational.   The system I/O can be recovered with a re-IPL.
  • A problem was fixed for errors in the PCIe Host Bridge (PHB) performance counters collected by the 24x7 performance monitor.
  • A problem was fixed for partitions becoming unresponsive or the HMC not being able to communicate with the system after a processor configuration change or a partition power on and off.
  • A new SRC of B7006A74 was added for PHB LEM 62 errors that had surpassed a threshold in the path of the #EMX0 expansion drawer.  This replaces the SRC B7006A72 to have a correct callout list.  Without the fix, when B7006A72 is logged against a PCIe slot in the CEC containing a cable card, the FRUs in the full #EMX0 expansion drawer path should be considered (use the B7006A8B FRU callout list as a reference).
  • A problem was fixed for eight or more simultaneous Live Partition Mobility (LPM) migrations to the same system possibly failing in validation with the HMC error message of "HSCL0273 A command that was targeted to the managed system has timed out".  The problem can be circumvented by doing the LPM migrations to the same system in smaller batches.
  • A problem was fixed for a boot failure using a N_PORT ID Virtualization (NPIV) LUN for an operating system that is installed on a disk of 2 TB or greater, and having a device driver for the disk that adheres to a non-zero allocation length requirement for the "READ CAPACITY 16".  The IBM partition firmware had always used an invalid zero allocation length for the return of data and that had been accepted by previous device drivers.  Now some of the newer device drivers are adhering to the specification and needing an allocation length of non-zero to allow the boot to proceed.
  • A problem was fixed for a possible boot failure from a ISO/IEC 13346 formatted image, also known as Universal Disk Format (UDF).
    UDF is a profile of the specification known as ISO/IEC 13346 and is an open vendor-neutral file system for computer data storage for a broad range of media such as DVDs and newer optical disc formats.  The failure is infrequent and depends on the image.  In rare cases, the boot code erroneously fails to find a file in the current directory.  If the boot fails on a specific image, the boot of that image will always fail without the fix.
  • A problem was fixed for an intermittent IPL failure with B181345A, B150BA22, BC131705,  BC8A1705, or BC81703 logged with a processor core called out.  This is a rare error and does not have a real hardware fault, so the processor core can be unguarded and used again on the next IPL.
  • A problem was fixed for informational logs flooding the error log if a "Get Sensor Reading" is not working.
  • A problem was fixed which caused network traffic failures for Virtual Functions (VFs) operating in non-promiscuous multicast mode.  In non-promiscuous mode, when a VF recieves a frame, it will drop it unless the frame is addressed to the VF's MAC address, or is a broadcast or multcast addressed frame.  With the problem, the VF drops the frame even though it is multicast, thereby blocking the network traffic, which can result in ping failures and impact other network operations.  To recover from the issue, turn multicast promiscuous on.  This may cause some unwanted multicast traffic to flow to the partition.
  • A problem was fixed for a hypervisor task getting deadlocked if partitions are powered on at the same time that SR-IOV is being configured for an adapter.  With this problem, workloads will continue to run but it will not be possible to change the virtualization configuration or power partitions on and off.  This error can be recovered by doing a re-IPL of the system with a scheduled outage.
  • A problem was fixed for hypervisor tasks getting deadlocked that cause the hypervisor to be unresponsive to the HMC ( this shows as an incomplete state on the HMC) with SRC B200F011 logged.  This is a rare timing error.  With this problem,  OS workloads will continue to run but it will not be possible for the HMC to interact with the partitions.  This error can be recovered by doing a re-IPL of the system with a scheduled outage.
  • A problem was fixed for broadcast bootp installs or boots that fail with a UDP checksum error.
  • A problem was fixed for failing to boot from an AIX mksysb backup on a USB RDX drive with SRCs logged of BA210012, AA06000D, and BA090010.  The boot error does not occur if a serial console is used to navigate the SMS menus.
  • A problem was fixed error recovery from loss of VPD for FRUs caused by a stuck I2C bus.  When this problem occurs, there is a flood of B1561312 SRCs with fault signature " IVPD_REASON_IIC_FDAL_READ_FAIL errno 72". This is a rare problem that occurs if the I2C slave gets stuck low for some reason.  To recover from this problem, A/C power cycle the system.  With the fix, the I2C bus is reset so the VPD reads for the FRU can be retried without user intervention until successful.
  • A security bypass vulnerability problem was fixed in the service processor secure socket layer (SSL) which could allow an attacker to make unauthorized reads on a rejected SSL connection. The Common Vulnerabilities and Exposures issue number is CVE-2017-3737.
  • A security problem was fixed in the service processor Network Security Services (NSS) services which, with a man-in-the-middle attack, could provide false completion or errant network transactions or exposure of sensitive data from intercepted SSL connections to ASMI, Redfish, or the service processor message server.  The Common Vulnerabilities and Exposures issue number is CVE-2018-12384.
  • A security problem was fixed in the service processor OpenSSL support that could cause secured sockets to hang, disrupting HMC communications for system management and partition operations.  The Common Vulnerabilities and Exposures issue number is CVE-2018-0732.
  • A security problem was fixed in the service processor TCP stack that would allow a Denial of Service (DOS) attack with TCP packets modified to trigger time and calculation expensive calls.  By sending specially modified packets within ongoing TCP sessions with the Management Consoles,  this could lead to a CPU saturation and possible reset and termination of the service processor.   The Common Vulnerabilities and Exposures issue number is CVE-2018-5390.
  • A security problem was fixed in the service processor TCP stack that would allow a Denial of Service (DOS) attack by allowing very large IP fragments to trigger time and calculation expensive calls in packet reassembly.  This could lead to a CPU saturation and possible reset and termination of the service processor.   The Common Vulnerabilities and Exposures issue number is CVE-2018-5391.  With the fix, changes were made to lower the IP fragment thresholds to invalidate the attack.
VL910_127_127 / FW910.21

03/18/19

Impact: Data            Severity:  HIPER

System firmware changes that affect all systems

  • HIPER/Pervasive: DISRUPTIVE:  A problem was fixed where, under certain conditions, a Power Management Reset (PM Reset) event may result in undetected data corruption.  PM Resets occur under various scenarios such as power management mode changes between Dynamic Performance and Maximum Performance, Concurrent FW updates, power management controller recovery procedures, or system boot.
  • A problem was fixed for a system terminating if there was even one predictive or recoverable SRC.  For this problem, all hardware SRCs logged are treated as terminating SRCs.  For this behavior to occur, the initial service processor boot from the AC power off state failed to complete cleanly, instead triggering an internal reset (a rare error),  leaving some parts of the service processor not initialized.  This problem can be recovered by doing an AC power cycle, or concurrently on an active system with the assistance of IBM support.
VL910_122_089 / FW910.20

12/12/18

Impact:  Data                  Severity:  HIPER

New features and functions

  • Support was enabled for eRepair spare lane deployment for fabric and memory buses.

System firmware changes that affect all systems

  • HIPER/Non-Pervasive:DEFERRED:   A problem was fixed for a potential problem with I/O that could result in undetected data corruption.
  • DEFERRED:  A problem was fixed for DASD VRM reduced stability margins leading to a possible system shutdown due to temperature component aging over a long period of time.  The DASD VRM is not updated with the fix until after the system IPLs from a powered off state.  It is recommended that this fix be activated as soon as possible but fix activation should not be delayed for more than three months maximum.
  • DEFERRED:  A problem was fixed for PCIe and SAS adapters in slots attached to a PLX (PCIe switch) failing to initialize and not being found by the Operating System.  The problem should not occur on the first IPL after an AC power cycle, but subsequent IPLs may experience the problem.
  • DEFERRED:  A problem was fixed for the PCIe3 I/O expansion drawer (#EMX0) links to improve stability.   Intermittent training failures on the links occurred during the IPL with SRC B7006A8B logged.  With the fix, the link settings were changed to lower the peak link signal amplification to bring the signal level into the middle of the operating range, thus improving the high margin to reduce link training failures.  The system must be re-IPLed for the fix to activate. Without the fix, the system can be powered off and the re-IPLed to restore the PCIe links.
  • DEFERRED:   A problem was fixed for concurrent maintenance operations for PCIe expansion drawer cable cards and PCI adapters that could cause loss of system hardware information in the hypervisor with these side effects:  1) partition secure boots could fail with SRC BA540100 logged.; 2) Live Partition Mobility (LPM) migrations could be blocked; 3) SR-IOV adapters could be blocked from going into shared mode; 4) Power Management services could be lost; and 5) warm re-IPLs of the system can fail.  The system can be recovered by powering off and then IPLing again.
  • DEFERRED:  A problem was fixed for predictive error logs occurring on the IPL following a DIMM error recovery.  These logs, related to failed memory scrubbing, have the following "Signature Description":  "mba(n0p15c1) () ERROR: command complete analysis failed".  These error logs do not indicate a hardware problem and may be ignored.
  • A problem was fixed for link speed for PCIe Generation 4 adapters showing as "unknown"  in the Advanced System Management Interface (ASMI) PCIe Hardware Topology menu.
  • A problem was fixed for differential memory interface (DMI) lane sparing to prevent shutting down a good lane on the TX side of the bus when a lane has been spared on the RX side of the bus.  If the XBUS or DMI bus runs out of spare lanes, it can checkstop the system, so the fix helps use these resources more efficiently.
  • A problem was fixed for IPL failures with SRC BC50090F when replacing Xbus FRUs.  The problem occurs if VPD has a stale bad lane record and that record does not exist on both ends of the bus.
  • A problem was fixed for a firmware update concurrent remove and activate that fails in the hypervisor during the activate with SRC B7000AFF.  To recover the system, do a re-IPL and it will be at the correct firmware level that is expected for the remove operation.
  • A problem was fixed for a flood of BC130311 SRCs that could occur when changing Energy Scale Power settings, if the Power Management is in a reset loop because of errors.
  • A problem was fixed for SR-IOV adapter workloads being suspended with SRC B400FF01 logged while an internal reset of SR-IOV virtual function in the hypervisor occurs.  This problem is infrequent and caused by heavy workloads for the adapter or vNIC failovers.  The workloads resume after the virtual function reset without user intervention.
  • A problem was fixed for SR-IOV VFs, where a VF configured with a PVID priority may be presented to the OS with an incorrect priority value.
  • A problem was fixed for the creation of a vNIC adapter that may show the MAC address twice and cause confusion.  For the AIX OS, the duplicate MAC address shows on the entstat output.  No recovery is needed for this error except to ignore the extra MAC address in the ethernet adapter status.
  • A problem was fixed to reduce the time to reach a "failed" status on an SR-IOV adapter for certain persistent errors.  Without the fix, adapter spends an extended period of time in the "not ready" state, eventually reaching the "failed" state.   With the fix, the adapter is able to go to the "failed" state in less than 30 seconds for the persistent fault.
  • A problem was fixed for a SR-IOV Virtual Function (VF) configured with a PVID that fails to function correctly after a VF reset.  It will allow the receiving of untagged frames but not be able to transmit the untagged frames.
  • A problem was fixed for a SMS ping failure for a SR-IOV adapter Virtual Function (VF) with a non-zero Port VLAN ID (PVID).  This failure may occur after the partition with the adapter has been booted to AIX, and then rebooted back to SMS.  Without the fix, residue information from the AIX boot is retained for the VF that should have been cleared.
  • A problem was fixed for SRCs B400FF01 and B200F011 experienced for false SR-IOV adapter errors during Live Partition Mobility (LPM) migrations of a logical partition with vNIC clients.  The SR-IOV adapter does recover from the errors but there is delay in the adapter communications while the adapter recovers.  These errors can be ignored when evaluating the outcome of a LPM migration.
  • A problem was fixed for partition SMS menus to display certain network adapters that were unviewable and not usable as boot and install devices after a microcode update.  The problem network adapter is still present and usable at the OS.  The adapters with this problem have the following feature codes:  EN0A, EN0B, EN0H, EN0J, EN0K, EN0L, EN15,  EL5B, EL38, EL3C, EL56, and EL57.
  • A problem was fixed for a Logical LAN (l-lan) device failing to boot when there is a UDP packet checksum error.  With the fix, there is a new option when configuring a l-lan port in SMS to enable or disable the UDP checksum validation.  If the adapter is already providing the checksum validation, then the l-lan port needs to have its validation disabled.
  • A problem was fixed for Hostboot error log IDs (EID) getting reused from one IPL to the next, resulting in error logs getting suppressed (missing)  for new problems on the subsequent IPLs if they have a re-used EID that was already present in the service processor error logs.
  • A problem was fixed for error log truncation with SRC B1818A12 logged for the error.  This problem occurs only rarely when creating a combined error log entry that exceeds the error log entry maximum size.  With the fix, these type of combinations are not done if too large, and two error logs are written instead.
  • A problem was fixed for coherent accelerator processor proxy (CAPP) unit errors being called out as CEC hardware Subsystem instead of PROCESSOR_UNIT.
  • A problem was fixed for a Self Boot Engine (SBE) recoverable error at runtime causing the system to go into Safe Mode.
  • A problem was fixed for an IPL that ends with the HMC in the "Incomplete" state with SRCs B182951C and A7001151 logged.  Partitions may start and can continue to run without the HMC services available.  In order to recover the HMC session,  a re-IPL of the system is needed (however, partition workloads could continue running uninterrupted until the system is intentionally re-IPLed at a scheduled time).  The frequency of this problem is very low as it rarely occurs.
  • A problem was fixed for a system failure with SRC B700F103 that can occur if a shared-mode SR-IOV adapter is moved from a high-performance slot to a lower performance slot.   This problem can be avoided by disabling shared mode on the SR-IOV adapter; moving the adapter;  and then re-enabling shared mode.
  • A problem was fixed for a rare Live Partition Mobility migration hang with the partition left in VPM (Virtual Page Mode) which causes performance concerns.  This error is triggered by a migration failover operation occurring during the migration state of "Suspended" and there has to be insufficent VASI buffers available to clear all partition state data waiting to be sent to the migration target.  Migration failovers are rare and the migration state of "Suspended" is a migration state lasting only a few seconds for most partitions, so this problem should not be frequent.  On the HMC, there will be an inability to complete either a migration stop or a recovery operation.  The HMC will show the partition as migrating and any attempt to change that will fail.  The system must be re-IPLed to recover from the problem.
  • A problem was fixed for Linux or AIX partitions crashing during a firmware assisted dump or when using Linux kexec to restart with a new kernel.  This problem was more frequent for the Linux OS with kdump failing with "Kernel panic - not syncing: Attempted to kill init" in some cases.
  • A problem was fixed for a SR-IOV adapter vNIC configuration error that did not provide a proper SRC to help resolve the issue of the boot device not pinging in SMS due to maximum transmission unit (MTU) size mismatch in the configuration.  The use of a vNIC backing device does not allow configuring VFs for jumbo frames when the Partition Firmware configuration for the adapter (as specified on the HMC) does not support jumbo frames.  When this happens, the vNIC adapter will fail to ping in SMS and thus cannot be used as a boot device.  With the fix,  the vNIC driver configuration code is now checking the vNIC login (open) return code so it can issue an SRC when the open fails for a MTU issue (such as jumbo frame mismatch) or for some other reason.  A jumbo frame is an Ethernet frame with a payload greater than the standard MTU of 1,500 bytes and can be as large as 9,000 bytes.
  • A problem was fixed for the USB port having the wrong location code assigned.  The "P1-T4-L1 USB DVD R/RW or RAM Drive" location code should be "P1-T3-L1".   The USB DVD still works correctly but reported location codes such as in error logs will have the wrong location code shown.
    This problem only pertains to IBM Power System models S914(9009-41A), S924(9009-42A), and H924 for SAP HANA (9223-42H).
  • A problem was fixed for SR-IOV adapter dumps hanging with low-level EEH events causing failures on VFs of other non-target SR-IOV adapters.
  • A problem was fixed for preventing loss of function on an SR-IOV adapter with an 8MB adapter firmware image if it is placed into SR-IOV shared mode.  The 8MB image is not supported at the FW910.20 firmware level.  With the fix, the adapter with the 8MB image is rejected with an error without an attempt to load the older 4MB image on the adapter which could damage it.  This problem affects the following SR-IOV adapters:  #EC2R/#EC2S with CCIN 58FA;  #EC2T/#EC2U with CCIN 58FB; and #EC3L/#EC3M with CCIN 2CEC.
  • A problem was fixed for adapters in slots attached to a PLX (PCIe switch) failing with SRCs B7006970 and BA188002  when a second and subsequent errors on the PLX failed to initiate PLX recovery.  For this infrequent problem to occur, it requires a second error on the PLX after recovery from the first error.
  • A problem was fixed for an intermittent IPL failure with SRCs B150BA40 and B181BA24 logged.  The system can be recovered by IPLing again.  The failure is caused by a memory buffer misalignment, so it represents a transient fault that should occur only rarely.
  • A problem was fixed for system termination for a re-IPL with power on with SRC B181E540 logged.  The system can be recovered by powering off and then IPLing.  This problem occurs infrequently and can be avoided by powering off the system between IPL.

System firmware changes that affect certain systems

  • On a system with a Cloud Management Console and a HMC Cloud Connector, a problem was fixed for memory leaks in the Redfish server causing Out of Memory (OOM) resets of the service processor.
  • On a system witn an IBM i partition, A problem was fixed for a DLPAR force-remove of a physical IO adapter from an IBM i partition and a simultaneous power off of the partition causing the partition to hang during the power off.  To recover the partition from the error, the system must be re-IPLed.  This problem is rare because there is only a 2-second timing window for the DLPAR and power off to interfere with each other.
  • For systems with a shared memory partition,  a problem was fixed for Live Partition Mobility (LPM) migration hang after a Mover Service Partition (MSP) failover in the early part of the migration.  To recover from the hang, a migration stop command must be given on the HMC.  Then the migration can be retried.
  • For a shared memory partition,  a problem was fixed for Live Partition Mobility (LPM) migration failure to an indeterminate state.  This can occur if the Mover Service Partition (MSP)  has a failover that occurs when the migrating partition is in the state of "Suspended."  To recover from this problem, the partition must be shutdown and restarted.
  • On a system with an AMS partition, a problem was fixed for a Live Partition Mobility (LPM) migration failure when migrating from P9 to a pre-FW860 P8 or P7 system.  This failure can occur if the P9 partition is in dedicated memory mode, and the Physical Page Table (PPT) ratio is explicitly set on the HMC (rather than keeping the default value) and the partition is then transitioned to Active Memory Sharing (AMS) mode prior to the migration to the older system.  This problem can be avoided by using dedicated memory in the partition being migrated back to the older system.
  • On a system with an active IBM i partition, a problem was fixed for a SPCN firmware download to the PCIe3 I/O expansion drawer (feature #EMX0) Chassis Management Card (CMC) that could possibly get stuck in a pending state.  This failure is very unlikely as it would require a concurrent replacement of the CMC card that is loaded with a SPCN level that is older than 2015 (01MEX151012a).  The failure with the SPCN download can be corrected by a re-IPL of the system.
VL910_115_089 / FW910.11

10/17/18

Impact:  Availability      Severity:  SPE

System firmware changes that affect all systems 

  • DEFERRED:   A problem was fixed for an incorrect power on sequence for the PCI PERST signal for I/O adapters.  This signal is used to indicate to the I/O adapters when the reference clock for the device has become valid and, with the problem, that valid indication may arrive before the clock is ready.  In rare cases, this could intermittently result in unexpected behavior from the I/O devices such as adapter PCIe links not training or the adapter not being available for the Operating System after an IPL.  This problem can be recovered from by a re-IPL of the system.
  • A problem was fixed for system dumps failing with a kernel panic on the service processor because of an out of memory condition.  Without the fix, the system dump may be tried again after the reset of the service processor as the reset would have cleaned up the memory usage.
  • A problem was fixed for recovered (correctable) errors during the IPL being logged as Predictive Errors.  There are no customer actions required for the recovered errors.  With the fix, the corrected errors are marked as "RECOVERED" and logged as Informational.
  • A problem was fixed for an Emergency Power Off Warning (EPOW) IPL failure that would occur on a loss of a power supply or a missing power supply.  With the fix, the EPOW error will not occur on the IPL as long as there is one functional power supply available for the system.

System firmware changes that affect certain systems

  • On systems which do not have an HMC attached,  a problem was fixed for a firmware update initiated from the Operating System (OS) from FW910.00,  FW910.01 or FW910.10 to FW910.11 that caused a system crash one hour after the code update completed.  This does not fix the case of the OS initiated firmware update back to earlier FW910.XX levels from FW910.11 which can stilll result in a crash of the system.  Do not initiate a code update from FW910.11  to a lesser FW910 level via the OS.  Use only HMC or USB methods of code update for this case.  If an HMC or USB code update is not an option,  please contact IBM support.
  • On a system with a partition that has had processors dynamically removed, a problem was fix for a partition dump IPL that may experience unexpected behavior including system crashes.  This problem may be circumvented by stopping and re-starting the partition with the removed processors prior to requesting a partition dump.
VL910_107_089 / FW910.10

09/05/18

Impact:  Availability      Severity:  SPE

System firmware changes that may require customer actions prior to the firmware update

  • DEFERRED: On a system with a partition with dedicated processors that are set to allow processor sharing with "Allow when partition is active" or "Allow always", a problem was fixed for a concurrent firmware update from FW910.01 that may cause the system to hang.  This fix is deferred, so it is not active until after the next IPL of the system, so precautions must be taken to protect the system.  Perform the following steps to determine if your system has a partition with dedicated processors that are set to share.  If these partitions exist, change them to not share processors while active; or shut down the affected partitions; or do a disruptive update to put on this service pack.
    1) From the HMC command line, Run:  lssyscfg -r sys -F name
    2) For each system you intend to update firmware, issue the following HMC command:
    lshwres -m <System Name> --level lpar -r proc -F lpar_name,curr_sharing_mode,pend_sharing_mode
    replacing <System Name> with the name as displayed by the first command.
    3) Scan the output for "share_idle_procs_active" or "share_idle_procs_always".  This identifies the affected partitions.
    4) You need to take one of the three options below to install this firmware level:
    a) if affected partitions found, change the lapr to "never allow" or "allow when partition is inactive" on the lpar settings, and set back the value to its original value after the code update.  These changes are concurrent when performed on the lpar settings and not in the profile.
    b) Or,  shut down partitions identified in step 3.  Proceed with concurrent code update.  Then restart the partitions.
    c) Or,  apply the firmware update disruptively (power off system and install) to prevent a possible system hang.

New features and functions

  • A change was made to improve IPL performance for a system with a new DIMM installed or for a system doing its first IPL.  The performance is gained by decreasing the amount of time used in memory diagnostics, reducing IPL time by as much as 15 minutes, depending on the amount of memory installed.
  • Support was added for 24x7 data collection from the On-Chip Controller sensors.
  • Support was added to correctly isolate TOD faults with appropriate callouts and failover to the backup topology, if needed.  And to do a reconfiguration of a backup topology to maintain TOD redundancy.
  • Support was disabled for erepair spare lane deployment for fabric and memory buses.  By not using the FRU spare hardware for an erepair, the affected FRUs may have to be replaced sooner.  Prior to this change, the spare lane deployment caused extra error messages during runtime diagnostics.  When the problems with spare lane deployment are corrected, this erepair feature will be enabled again in a future service pack.

System firmware changes that affect all systems 

  • A security problem was fixed in the DHCP client on the service processor for an out-of-bound memory access flaw that could be used by a malicious DHCP server to crash the DHCP client process.  The Common Vulnerabilities and Exposures issue number is CVE-2018-5732.
  • DEFERRED: A problem was fixed for PCIe link stability errors during the IPL for the PCIe3 I/O Expansion Drawer (Feature code #EMX0) with Active Optical Cables (AOCs).  One or more of the following SRCs may be logged at completion of IPL: B7006A72, B7006A8B, B7006971, and 10007900.  The fix improves PCIe link stability for this feature.
  • DEFERRED: A problem was fixed for an erroneous SRC 11007610 being logged when hot-swapping CEC fans.  This SRC may be logged if there is more than a two-minute delay between removing the old fan and installing the new fan.  The error log may be ignored.
  • DEFERRED: A problem was fixed for a hot plug of a new 1400W power supply that fails to turn on.  The problem is intermittent, occurring more frequently for the cases where the hot plug insertion action was too slow and maybe at a slight angle (insertion not perfectly straight).  Without the fix,  after a hot plug has been attempted, ensure the power supply LEDs are on.  If the LEDs are not on, retry the plug of the power supply using a faster motion while keeping the angle of insertion straight.
  • DEFERRED: A problem was fixed for a host reset of the Self Boot Engine (SBE).  Without the fix, the reset of the SBE will hang during error recovery and that will force the system into Safe Mode.  Also,  a post dump IPL of the system after a system Terminate Immediate will not work with a hung SBE, so a re-IPL of the system will be needed to recover it.
  • A problem was fixed for an enclosure LED not being lit when there is a fault on a FRU internal to closure that does not have an LED of its own.  With the fix, the enclosure LED is lit if any FRUs within the enclosure have a fault.
  • A problem was fixed for DIMMs that have VPP shorted to ground not being called out in the SRC 11002610 logged for the power fault.  The frequency of this problem should be rare.
  • A problem was fixed for the Advanced System Management Interface (ASMI) option for resetting the system to factory configuration for not returning the Speculative Execution setting to the default value.  The reset to factory configuration does not change the current value for Speculative Execution.  To restore the default, ASMI must be used manually to set the value.  This problem only pertains to the IBM Power System H922 for SAP HANA (9223-22H) and the IBM Power System H924 for SAP HANA (9223-42H).
  • A problem was fixed for the system early power warning (EPOW) to be issued when only three of the four power supplies are operation (instead of waiting for all four power supplies to go down).
  • A problem was fixed for a failing VPP voltage regulator possibly damaging DIMM with too high of a voltage level.  With the fix, the voltage to the DIMMs is shutdown if there is a problem with voltage regulator to protect the DIMMs.
  • A problem was fixed for an unplanned power down of the system with SRC UE 11002600 logged when a unsupported device was plugged into the service processor USB 2.0 ports on either of the slots P1-C1-T1 or P1-C1-T2.  This happened when a USB 3.0 DVD drive was plugged into the USB 2.0 slot and caused an overcurrent condition.  The USB 3.0 device was incorrectly not downward compatible with the USB 2.0 slot.  With the fix, such incompatible devices will cause an informational log but will not cause a power off of the system.
  • A problem was fixed for the On-Chip Controller being able to sense the current draw for the 12V PCIE adapters that are plugged into channel 0 (CH0) of the APSS.  CH0 was not enabled meaning anything plugged into those connectors would not be included in the total server power calculation which could impact power capping.  The system could run at higher power than expected without CH0 being monitored.
  • A problem was fixed for the TPM card LED so that it is activated correctly.
  • A problem was fixed for VRMs drawing current over the specification.  This occurred whenever heavy work loads went above 372 amps with WOF enabled.  At 372 amps, a rollover to value "0" for the current erroneously occurred and this allowed the frequency of the processors in the system to exceed the normally expected values.
  • A problem was fixed for Dynamic Memory Deallocation (DMD) failing for memory configurations of 3 or 6 Memory Controller (MC) channels per group.  An error message of "Invalid MCS per group value" is logged with SRC BC23E504 for the problem.  If DMD was working correctly for the installed memory but then began failing at a later time, it may have been triggered by a guard of a DIMM which resulted in a memory configuration that is susceptible to the problem with DMD.
  • A problem was fixed for a system with CPU part number 2CY058 and CCIN 5C25 to achieve a slightly more optimum frequency for one specific EnergyScale Mode, Dynamic Performance Mode.
  • A problem was fixed for a missing memory throttle initialization that in a rare case could lead to an emergency shutdown of the system.  The missing initialization could cause the DIMMs to oversubscribe to the power supplies in the rare failure mode where the On-Chip Controller (OCC) fails to start and the Safe Mode default memory throttle values are too high to stop the memory from overusing the power from the power supplies.  This could cause a power fault and an emergency shutdown of the system.
  • A problem was fixed for a memory translation error that causes a request for a page of memory to be de-allocated to be ignored in Dynamic Memory Deallocation (DMD).  This misses the opportunity to proactively relocate a partition to good memory and running on bad memory may eventually cause a crash of the partition.
  • A problem was fixed for an extraneous error log with SRC BC50050A that has no deconfgured FRU.  There was a recovered error for a single bit in memory that requires no user action.  The BC50050A error log should be ignored.
  • A problem was fixed for Hostboot error logs reusing EID numbers for each IPL.  This may cause a predictive error log to go missing for a bad FRU that is guarded during the IPL.  If this happens, the FRU should be replaced based on the presence of the guard record.
  • A problem was fixed for a rare non-correctable memory error in the service processor Self Boot Engine (SBE) causing a Terminate Immediate (TI) for the system instead of recovering from the error.  With the fix, the SBE is working such that all SBE errors are recoverable and do not affect the system work loads.  This SBE memory provides support for On-Chip Controller (OCC) tasks to the service processor SBE but it is not related to the system memory used for the hypervisor and host partition tasks.
  • A problem was fixed for extraneous Predictive Error logs of SRC B181DA96 and SRC BC8A1A39 being logged if the Self Boot Engine (SBE) halts and restarts when the system host OS is running,   These error logs can be ignored as the SBE recovers without user intervention.
  • A problem was fixed for error logging for rare Low Pin Count (LPC) link errors between the Host processor and the Self Boot Engine (SBE).  The LPC was changed to timeout instead of hanging on a LPC error, providing helpful debug data for the LPC error instead of system checkstop and Hostboot crash.
  • A problem was fixed for the reset of the Self Boot Engine (SBE)  at run time to resolve SBE errors without impacting the hypervisor or the running partitions.
  • A problem was fixed for the ODL link in Open CAPI in the case where ODL Link 1 (ODL1) is used and ODL Link 0 (ODL0) is not used.  As a circumvention, the errors are resolved if ODL 0 is used instead, or in conjunction with the ODL1.
  • A problem was fixed for the wrong DIMM being called out on an over-temperature error with a SRC B1xx2A30 error log.
  • A problem was fixed for adding a non-cable PCIe card into a slot that was previously occupied by a PCIe3 Optical or Copper Cable Adapter for the PCIe3 Expansion Drawer.  The PCIe new card could fail with a I2C error with SRC BC100706 logged.
  • A problem was fixed for call home data for On-Chip Controller (OCC) error log sensor data being off in alignment by one sensor.  By visually shifting the data, the valid data values can still be determined from the call home logs.
  • A problem was fixed for slow hardware dumps that include failed processor cores that have no clock signal.  The dump process was waiting for core responses and had to wait for a time-out for each chip operation, causing dumps to take several hours.  With the fix, the core is checked for a proper clock, and if one does not exist, the chip operations to that core are skipped to speed up the hardware dump process significantly.
  • A problem was fixed for ipmitool not being able to set the system power limit when the power limit is not activated with the standard option.  With the fix, the ipmitool user can activate the power limit "dcmi power activate" and then set the power limit "dcmi power set _limit xxxx"  where "xxxx" in the new power limit in Watts.
  • A problem was fixed for the OBUS to make it OpenCAPI capable by increasing its frequency from 1563 Mhz to 1611 Mhz.
  • A problem was fixed for a Workload Optimized Frequency (WOF) reset limit failure not providing an Unrecoverable Error (UE) and a callout for the problem processor.  When the WOF reset limit is reached and failed, WOF is disabled and the system is not running at optimized frequencies.
  • A problem was fixed for the callout of SRC BA188002 so it does display three trailing extra garbage characters in the location code for the FRU.  The string is correct up to the line ending white space, so the three extra characters after that should be ignored.  This problem is intermittent and does not occur for all BA188002 error logs.
  • A problem was fixed for the callout of scan ring failures with SRC BC8A285E and SRC BC8A2857 logged but with no callout for the bad FRU.
  • A problem was fixed for the On-Chip Controller (OCC) possibly timing out and going to Safe Mode when a system is changed from the default maximum performance mode (Workload Optimized Frequency (WOF) enabled) to nominal mode (WOF disabled) and then back to maximum performance (WOF enabled again).  Normal performance can be recovered with a re-IPL of the system.
  • A problem was fixed for the periodic guard reminder causing a reset/reload of the service processor when it found a symbolic FRU with no CCIN value in the list of guarded FRUs for the system.   Periodically as periodic guard reminder is run, every 30 days by default, this problem can cause recoverable errors on the service processor but with no interruption to the workloads on the running partitions.
  • A problem was fixed for a wrong SubSystem being logged in the SRC B7009xxxx for Secure Boot Errors.  "I/O Subsystem" is displayed instead of the correct SubSystem value of "System Hypervisor Firmware".
  • A problem was fixed for the lost recovery of a failed Self Boot Engine (SBE).  This may happen if the SBE recovery occurs during a reset of the service processor.  Not only is the recovery lost, but the error log data for the SBE failure may also be not be written to the error log.  If the SBE is failed and not recovered, this can cause the post-dump IPL after a system Terminate Immediate (TI) error to not be able to complete.  To recover, power off the system and IPL again.
  • A problem was fixed for a missing SRC at the time runtime diagnostics are lost and the Hostboot runtime services (HBRT) are put into the failed state.
    A B400F104 SRC is logged each time the HBRT hypervisor adjunct crashed.  On the fourth crash in one hour, HBRT is failed with no further retries but no SRC is logged.  Although a unique SRC is not logged to indicate loss of runtime diagnostic capability, the B400F104 SRC does include the HBRT adjunct partition ID for Service to identify the adjunct.
  • A problem was fixed for a Novalink enabled partition not being able to release master from the HMC that results in error HSCLB95B.  To resolve the issue, run a rebuild managed server operation on the HMC and then retry the release.  This occurs when attempting to release master from HMC after the first boot up of a Novalink enabled partition if Master Mode was enforced prior to the boot.
  • A problem was fixed for an UE memory error causing an entire LMB of memory to deallocate and guard instead of just one page of memory.
  • A problem was fixed for all variants (this was partially fixed in an earlier release) for the SR-IOV firmware adapter updates using the HMC GUI or CLI to only reboot one SR-IOV adapter at a time.  If multiple adapters are updated at the same time, the HMC error message HSCF0241E may occur:  "HSCF0241E Could not read firmware information from SR-IOV device ...".  This fix prevents the system network from being disrupted by the SR-IOV adapter updates when redundant configurations are being used for the network.  The problem can be circumvented by using the HMC GUI to update the SR-IOV firmware one adapter at a time using the following steps:  https://www.ibm.com/support/knowledgecenter/en/POWER9/p9efd/p9efd_updating_sriov_firmware.htm.
  • A problem was fixed for a rare hypervisor hang caused by a dispatching deadlock for two threads of a process.  The system hangs with SRC B17BE434 and SRC B182951C logged.   This failure requires high interrupt activity on a program thread that is not currently dispatched.
  • A problem was fixed for a Virtual Network Interface Controller (vNIC) client adapter to prevent a failover when disabling the adapter from the HMC.  A failover to a new backing device could cause the client adapter to erroneously appear to be active again when it is actually disabled.  This causes confusion and failures on the OS for the device driver.  This problem can only occur when there is more than a single backing device for the vNIC adapter and if a commands are issued from the HMC to disable the adapter and enable the adapter.
  • A possible performance problem was fixed for workloads that have a large memory footprint.
  • A problem was fixed for error recovery in the timebase facility to prevent an error in the system time.  This is an infrequent secondary error when the timebase facility has failed and needs recovery.
  • A problem was fixed for the HMC GUI and CLI interfaces incorrectly showing SR-IOV updates as being available for certain SR-IOV adapters when no updates are available.  This affects the following PCIe adapters:  #EC2R/#EC2S with CCIN 58FA;  #EC2T/#EC2U with CCIN 58FB; and #EC3L/#EC3M with CCIN 2CEC.  The "Update Available" indication in the HMC can be ignored if updates have already been applied.
  • A problem was fixed for the recovery of certain SR-IOV adapters that fail with SRC B400FF05.  This is triggered by infrequent EEH errors in the adapter.  In the recovery process,  the Virtual Function (VF)  for the adapter is rebuilt into the wrong state, preventing the adapter from working.  An HMC initiated disruptive resource dump of the adapter can recover it.  This problem affects the following PCIe adapters:  #EC2R/#EC2S with CCIN 58FA;  #EC2T/#EC2U with CCIN 58FB; and #EC3L/#EC3M with CCIN 2CEC.
  • A problem was fixed for SR-IOV Virtual Functions (VFs) halting transmission with a SRC B400FF01 logged when many logical partitions with VFs are shutdown at the same time the adapter is in highly-active usage by a workload.  The recovery process reboots the failed SR-IOV adapter, so no user intervention is needed to restore the VF.
  • A problem was fixed for VLAN-tagged frames being transmitted over SR-IOV adapter VFs when the packets should have instead have been discarded for some VF configuration settings on certain SR-IOV adapters.  This affects the following PCIe adapters:  #EC2R/#EC2S with CCIN 58FA;  #EC2T/#EC2U with CCIN 58FB; and #EC3L/#EC3M with CCIN 2CEC.   
  • A problem was fixed for SR-IOV adapter hangs with a possible SRC B400FF01 logged.  This may cause a temporary network outage while the SR-IOV adapter VF reboots to recover from thje adapter hang.   This problem has been observed on systems with high network traffic and with many VFs defined.
    This fix updates adapter firmware to 1x.22.4021 for the following Feature Codes: EC2R, EC2S,  EC2T, EC2U, EC3L and EC3M.
    The SR-IOV adapter firmware level update for the shared-mode adapters happens under user control to prevent unexpected temporary outages on the adapters.  A system reboot will update all SR-IOV shared-mode adapters with the new firmware level.  In addition, when an adapter is first set to SR-IOV shared mode, the adapter firmware is updated to the latest level available with the system firmware (and it is also updated automatically during maintenance operations, such as when the adapter is stopped or replaced).  And lastly, selective manual updates of the SR-IOV adapters can be performed using the Hardware Management Console (HMC).  To selectively update the adapter firmware, follow the steps given at the IBM Knowledge Center for using HMC to make the updates:   https://www.ibm.com/support/knowledgecenter/en/POWER9/p9efd/p9efd_updating_sriov_firmware.htm.
    Note: Adapters that are capable of running in SR-IOV mode, but are currently running in dedicated mode and assigned to a partition, can be updated concurrently either by the OS that owns the adapter or the managing HMC (if OS is AIX or VIOS and RMC is running).
  • A problem was fixed for a large number (approximately 16,000) of DLPAR adds and removes of SR-IOV VFs to cause a subsequent DLPAR add of the VF to fail with the newly-added VF not usable.  The large number of allocations and deallocations caused a leak of a critical SR-IOV adapter resource.  The adapter and VFs may be recovered by an SR-IOV adapter reset.
  • A problem was fixed for a system boot hanging when recoverable attentions occur on the non-master processor.  With the fix, the attentions on the non-master processor are deferred until Symmetric multiprocessing (SMP) mode has been established (the point at which the system is ready for multiple processors to run).  This allows the boot to complete but still have the non-master processor errors recovered as needed.
  • A problem was fixed for certain hypervisor error logs being slow to report to the OS.  The error logs affected are those created by the hypervisor immediately after the hypervisor is started and if there is more than 128 error logs from the hypervisor to be reported.  The error logs at the end of the queue take a long time to be processed, and may make it appear as if error logs are not being reported to the OS.
  • A problem was fixed for a Self Boot Engine (SBE) reset causing the On-Chip Controller (OCC) to force the system into Safe Mode with a flood of SRC B150DAA0 and SRC B150DA8A written to the error log as Information Events.
  • A problem was fixed for the Redfish "Manager" request returning duplicate object URIs for the same HMC.  This can occur if the HMC was removed from the managed system and then later added back in.  The Redfish objects for the earlier instances of the same HMC were never deleted on the remove.
  • A problem was fixed for a possible failure to the service processor stop state when performing a platform dump.  This problem is specific to dumps being collected for HWPROC checkstops, which are not common.
  • A problem was fixed for SMS menus to limit reporting on the NPIV and vSCSI configuration to the first 511 LUNs.  Without the fix, LUN 512 through the last configured LUN report with invalid data.  Configurations in excess of 511 LUNs are very rare, and it is recommended for performance reasons (to be able search for the boot LUN more quickly) that the number of LUNs on a single targeted be limited to less than 512.
  • The following two errors in the SR-IOV adapter firmware were fixed:  1)  The adapter resets and there is a B400FF01 reference code logged. This error happens in rare cases when there are multiple partitions actively running traffic through the adapter.  System firmware resets the adapter and recovers the system with no user-intervention required; 2) SR-IOV VFs with defined VLANs and an assigned PVID are not able to ping each other.
    This fix updates adapter firmware to 11.2.211.26 for the following Feature Codes: EN15,  EN17, EN0H, EN0J, EN0M, EN0N, EN0K, EN0L, EL38, EL3C, EL56, and EL57.
    The SR-IOV adapter firmware level update for the shared-mode adapters happens under user control to prevent unexpected temporary outages on the adapters.  A system reboot will update all SR-IOV shared-mode adapters with the new firmware level.  In addition, when an adapter is first set to SR-IOV shared mode, the adapter firmware is updated to the latest level available with the system firmware (and it is also updated automatically during maintenance operations, such as when the adapter is stopped or replaced).  And lastly, selective manual updates of the SR-IOV adapters can be performed using the Hardware Management Console (HMC).  To selectively update the adapter firmware, follow the steps given at the IBM Knowledge Center for using HMC to make the updates:   https://www.ibm.com/support/knowledgecenter/en/POWER9/p9efd/p9efd_updating_sriov_firmware.htm.
    Note: Adapters that are capable of running in SR-IOV mode, but are currently running in dedicated mode and assigned to a partition, can be updated concurrently either by the OS that owns the adapter or the managing HMC (if OS is AIX or VIOS and RMC is running).
  • A problem was fixed for Field Core Override (FCO) cores being allocated from a deconfigured processor, causing an IPL failure with unusable cores.  This problem only occurs during the Hostboot reconfiguration loop in the presence of other processor failures.
  • A problem was fixed for a failure in DDR4 RCD (Register Clock Driver) memory initialization that causes half of the DIMM memory to be unusable after an IPL.  This is an intermittent problem where the memory can sometimes be recovered by doing another IPL.  The error is not a hardware problem with the DIMM but it is an error in the initialization sequence needed get the DIMM ready for normal operations.  This supercedes an earlier fix delivered in FW910.01 that intermittently failed to correct the problem.
  • A problem was fixed for IBM Product Engineering and Support personnel not being able to easily determine planar jumper settings in a machine in order to determine the best mitigation strategies for various field problems that may occur.  With the fix, an Information Error log is provided on every IPL to provide the planar jumper settings.
  • A problem was fixed for the periodic guard reminder function to not re-post errorlogs of failed FRUs on each IPL.  Instead, a reminder SRC is created to call home the list of FRUs that have failed and require service.  This puts the system to back to original behavior of only posting one error log for each FRU that has failed.
  • For a HMC managed system, a problem was fixed for a rare, intermittent NetsCMS core dump that could occur whenever the system is doing a deferred shutdown power off.  There is no impact to normal operations as the power off completes, but there are extra error logs with SRC B181EF88  and a service processor dump.
  • A problem was fixed for a Hostboot hang due to deadlock that can occur if there is a SBE dump in progress that fails.  A failure in the SBE dump can trigger a second SBE dump that deadlocks.
  • A problem was fixed for dump performance by decreasing the amount of time needed to perform dumps by 50%.
  • A problem was fixed for an IPL hang that can occur for certain rare processor errors, where the system is in a loop trying to isolate the fault.
  • A problem was fixed for an enclosure fault LED being stuck on after a repair of a fan.  This problem only occurs after the second concurrent repair of a fan.
  • A problem was fixed for SR-IOV adapters not showing up in the device tree for a partition that autotboots or starts within a few seconds of the hypervisor going ready.  This problem can be circumvented by delaying the boot of the partition for at least a minute after the hypervisor has reached the standby state.  If the problem is encountered, the SR-IOV adapter can be recovered by rebooting the partition, or DLPAR and remove and add the SR-IOV adapter to the partition.
  • A problem was fixed for a system crash with SRC B700F103 when there are many consecutive configuration changes in the LPARs to delete old vNICs and create new vNICs, which exposed an infrequent problem with lock ownership on a virtual I/O slot.  There is a one-to-one mapping or connection between vNIC adapter in the client LPAR and the backing logical port in the VIOS, and the lock management needs to ensure that the LPAR accessing the port has ownership to it.  In this case, the LPAR was trying to make usable a device it did not own.   The system should recover on the post dump IPL.
  • A problem was fixed for a possible DLPAR add failure of a PCIe adapter if the adapter is in the planar slot C7 or slot C6 on any PCIe Expansion drawer fanout module.  The problem is more common if there are other devices or Virtual Functions (VFs) in the same LPAR that use four interrupts, as this is a problem with the processing order of the PCIe LSI interrupts.
  • A problem was fixed for resource dumps that use the selector "iomfnm" and options "rioinfo" or "dumpbainfo".  This combination of options for resource dumps always fails without the fix.
  • A problem was fixed for missing FFDC data for SR-IOV Virtual Function (VF) failures and for not allowing the full architected five minute limit for a recovery attempt for the VF, which should expand the number of cases where the VF can be recovered.
  • A problem was fixed for missing error recovery for memory errors in non-mirrored memory when reading the SR-IOV adapter firmware, which could prevent the SR-IOV VF from booting.
  • A problem was fixed for a possible system crash if an error occurs at runtime that requires a FRU guard action.  With the fix, the guard action is restricted to the IPL where it is supported.
  • A problem was fixed for a extremely rare IPL hang on a false communications error to the power supply.  Recovery is to retry the IPL.
  • A problem was fixed for the dump content type for HBMEM (Hostboot memory) to be recognized instead of displaying "Dump Content Type: not found".
  • A problem was fixed for a system crash when an SR-IOV adapter is changed from dedicated to shared mode with SRC B700FFF and SRC B150DA73 logged.  This failure requires that hypervisor memory relocation be in progress on the system. This affects the following PCIe adapters:  #EC2R/#EC2S with CCIN 58FA;  #EC2T/#EC2U with CCIN 58FB; and #EC3L/#EC3M with CCIN 2CEC.
  • A problem was fixed for a Live Partition Mobility (LPM) migration of a partition with shared processors that has an unusable shared processor that can result in failure of the target partition or target system.  This problem can be avoided by making sure all shared processors are functional in the source partition before starting the migration.  The target partition or system can be rebooted to recover it.
  • A problem was fixed for hypervisor memory relocation and Dynamic DMA Window (DDW) memory allocation used by I/O adapter slots for some adapters where the DDW memory tables may not be fully initialized between uses.  Infrequently,  this can cause an internal failure in the hypervisor when moving the DDW memory for the adapters.  Examples of programs using memory relocation are Live Partition Mobility (LPM) and the Dynamic Platform Optimizer (DPO).
  • A problem was fixed for a partition or system termination that may occur when shutting down or deleting a partition on a system with a very large number of partitions (more than 400) or on a system with fewer partitions but with a very large number of virtual adapters configured.
  • A problem was fixed for when booting a large number of LPARs with Virtual Trusted Platform Module (vTPM) capability, some partitions may post a SRC BA54504D time-out for taking too long to start.  With the fix, the time allowed to boot a vTPM LPAR is increased.  If a time-out occurs, the partition can be booted again to recover.  The problem can be avoided by auto-starting fewer vTPM LPARs, or booting them a couple at a time to prevent flooding the vTPM device server with requests that will slow the boot time while the LPARs wait on the vTPM device server responses.
  • A problem was fixed for a possible system crash.
  • A problem was fixed for a UE B1812D62 logged when a PCI card is removed between system IPLs.  This error log can be ignored.
  • A problem was fixed for USB code update failure if the USB stick is plugged during an AC power cycle.  After the power cycle completes, the code update will fail to start from the USB device.  As a circumvention, the USB device can be plugged in after the service processor is in its ready state.
  • A problem was fixed for a possible slower migration during the Live Partition Mobility (LPM) resume stage.   For a migrating partition that does not have a high demand page rate, there is minimal impact on performance.  There is no need for customer recovery as the migration completes successfully.
  • A problem was fixed for firmware assisted dumps (fadump) and Linux kernel crash dumps (kdump) where dump data is missing.  This can happen if the dumps are set up with chunks greater than 1 Gb in size.  This problem can be avoided by setting up fadump or kdump with multiple 1 Gb chunks.
  • A problem was fixed for the I2C bus error logged with SRC BC500705 and SRC BC8A0401 where the I2C bus was locked up.  This is an infrequent error. In rare cases, the TPM device may hold down the I2C clock line longer than allowed, causing an error recovery that times out and prevents the reset from working on all the I2C engine's ports.  A power off and power on of the system should clear the bus error and allow the system to IPL.
  • A problem was fixed for an intra-node, inter-processor communication lane failure marked in the VPD, causing a secure boot blocklist violation on the IPL and a processor to be deconfigured with an SRC BC8A2813 logged.
  • A problem was fixed to capture details of failed FRUs into the dump data by delaying the deconfiguration of the FRUs for checkstop and TI attentions.
  • A problem was fixed for failed processor cores not being guarded on a memory preserving IPL (re-IPL with CEC powered on).
  • A problem was fixed for debug data missing in dumps for cores which are off-line.
  • A problem was fixed for L3 cache calling out a LRU Parity error too quickly for hardware that is good.  Without the fix, ignore the L3FIR[28] LRU Parity errors unless they are very persistent with 30 or more occurrences per day.
  • A problem was fixed for not having a FRU callout when the TPM card is missing and causes an IPL failure.
  • A problem was fixed for the Advanced System Management Interface (ASMI) displaying the IPv6 network prefix in decimal instead of hex character values.  The service processor command line "ifconfig" can be used to see the IPv6 network prefix value in hex as a circumvention to the problem.
  • A problem was fixed for an On-Chip Controller (OCC) cache fault causing a loss of the OCC for the system without the system dropping into Safe mode.
  • A problem was fixed for system dump failing to collect the pu.perv SCOMs for chiplets c16 and above which correspond to EQ and EC chiplets.
    Also fixed was the missing SCOM data for the interrupt unit related "c_err_rpt" registers.
  • A problem was fixed for the PCIe topology reports having slots missing in the "I/O Slot Locations" column in the row for the bus representing a PCIe switch.  This only occurs when the C49 or C50 slots are bifurcated (a slot having two channels).  Bifurcation is done if an NVME module is in the slot or if the slot is empty (for certain levels of backplanes).
  • A problem was fixed for Live Partition Mobility (LPM) failing along with other hypervisor tasks, but the partitions continue to run.  This is an extremely rare failure where a re-IPL is needed to restore HMC or Novalink connections to the partitions, or to do any system configuration changes.
  • A problem was fixed for a system termination during a concurrent exchange of a SR-IOV adapters that had VFs assigned to it.  For this problem, the OS failed to release the VFs but the error was not returned to the HMC.  With the fix, the FRU exchange gracefully aborts without impacting the system for the case where the VFs on the SR-IOV adapter remain active.
  • A possible performance problem was fixed for partitions with shared processors that had latency in the handling of the escalation interrupts used to switch the processor between tasks.  The effect of this is that, while the processor is kept busy, some tasks might hold the processor longer than they should because the interrupt is delayed, while others run slower than normal.
  • A problem was fixed for a system termination that may occur with B111E504 logged when starting a partition on a system with a very large number of partitions (more than 400) or on a system with fewer partitions but with a very large number of virtual adapters configured.
  • A problem was fixed for a system termination that may occur with a B150DA73 logged when a memory UE is encountered in a partition when the hypervisor touches the memory.  With the fix, the touch of memory by the hypervisor is a UE tolerant touch and the system is able to continue running.
  • A problem was fixed for fabric errors such as cable pulls causing checkstops.  With the fix, the PBAFIR are changed to recoverable atentions, allowing the OCC to be reset to recover from such faults.

System firmware changes that affect certain systems

  • A problem was fixed to remove a SAS battery LED from ASMI that does not exist.   This problem only pertains to the S914(9009-41A), S924 (9009-42A) and H924 for SAP HANA (9223-42H) models.
  • On a system with an AIX partition,  a problem was fixed for a partition time jump that could occur after doing an AIX Live Update.  This problem could occur if the AIX Live Update happens after a Live Partition Mobility (LPM) migration to the partition.  AIX applications using the timebase facility could observe a large jump forwards or backwards in the time reported by the timebase facility.   A circumvention to this problem is to reboot the partition after the LPM operation prior to doing the AIX Live Update.  An AIX fix is also required to resolve this problem.  The issue will no longer occur when this firmware update is applied on the system that is the target of the LPM operation and the AIX partition performing the AIX Live Update has the appropriate AIX updates installed prior to doing the AIX Live Update.
  • On a Linux or IBM i partition which has just completed a Live Partition Mobility (LPM) migration, a problem was fixed for a VIO adapter hang when it stops processing interrupts.  For this problem to occur, prior to the migration the adapter must have had a interrupt outstanding where the interrupt source was disabled.
  • On systems with an IBM i partition,  support was added for multipliers for IBM i MATMATR fields that are limited to four characters.  When retrieving Server metrics via IBM MATMATR calls, and the system contains greater than 9999 GB, for example, MATMATR has an architected "multiplier" field such that 10,000 GB can be represented by 5,000 GB * Multiplier of 2, so '5000' and '2' are returned in the quantity and multiplier fields, respectively, to handle these extended values.  The IBM i OS also requires a PTF to support the MATMATR field multipliers.
  • On a system with a IBM i partition with more than 64 shared processors assigned to it,  a problem was fixed for a system termination or other unexpected behavior that may occur during a partition dump.  Without the fix, the problem can be avoided by limiting the IBM i partition to 64 or fewer shared processors.
VL910_089_089 / FW910.01

05/30/18

Impact:  Security      Severity:  HIPER

Response for Recent Security Vulnerabilities

  • HIPER/Pervasive: DISRUPTIVE:  In response to recently reported security vulnerabilities, this firmware update is being released to address Common Vulnerabilities and Exposures issue number CVE-2018-3639. In addition, Operating System updates are required in conjunction with this FW level for CVE-2018-3639.

System firmware changes that affect all systems 

  • HIPER/Pervasive:   A firmware change was made to address a rare case where a memory correctable error on POWER9 servers may result in an undetected corruption of data.
  • A problem was fixed for Live Partition Mobility (LPM) to prevent an error in the hardware page translation table for a migrated page that could result in an invalid operation on the target system.  This is a rare timing problem with the logic used to invalidate an entry in the memory page translation table.
  • A problem was fixed for a hung ethernet port on the service processor.  This hang prevents TCP/IP network traffic from the management console and the Advanced System Management Interface (ASMI) browsers.  It makes it appear as if the service processor is unresponsive and can be confused with a service processor in the stopped state.  An A/C power cycle would recover a hung ethernet adapter.
  • A problem was fixed for partition hangs or aborts during a Live Partition Mobility (LPM) or Dynamic Platform Optimizer (DPO) operation.  This is a rare problem with a small timing window for it to occur in the hypervisor task dispatching.  The partition can be rebooted to recover from the problem.
  • A problem was fixed for service processor static IP configurations failing after several minutes with SRC B1818B3E.  The IP address will not respond to pings in the ethernet adapter failed state.  This problem occurs any time a static IP configuration is enabled on the service processor.  Dynamic Host Control Protocol (DHCP) dynamic IPs can be used to provide the service processor network connections.  To recover from the problem, the other ethernet adapter (either eth0 or eth1) should be in the default DHCP configuration and allow the failing adapter to be reconfigured with a dynamic IP.
  • A problem was fixed for the system going to ultra turbo mode after an On-Chip Controller (OCC) reset.  This could result in a power supply over current condition.  This problem can happen when the system is running a heavy workload and then a power mode change is requested or some error happens that takes the OCC into a reset.
  • A problem was fixed for Workload Optimized Frequency (WOF) where parts may have been manufactured with bad IQ data that requires filtering to prevent WOF from being disabled.
  • A problem was fixed for transactional memory that could result in a wrong answer for processes using it.  This is a rare problem requiring L2 cache failures that can affect the process determining correctly if a transaction has completed.
  • A problem was fixed for a change in the IP address of the service processor causing the system On-Chip Controller (OCC) to go into Safe mode with SRC B18B2616 logged.  In Safe mode, the system is running with reduced performance and with fans running at high speed.  Normal performance can be restored concurrently by a reset/reload of the service processor using the ASMI soft reset option.  Without the fix, whenever the IP address of the service processor is changed, a soft reset of the service processor should be done to prevent OCC from going into Safe mode.
  • A problem was fixed for the recovery for optical link failures in the PCIe expansion drawer with feature code #EMX0.  The recovery failure occurs when there are multiple PCIe link failures simultaneously with the result that the I/O drawers become unusable until the CEC is restarted.  The hypervisor will have xmfr entries with "Sw Cfg Op FAIL" identified.  With the fix, the errors will be isolated to the PCIe link and the I/O drawer will remain operational.
  • A problem was fixed for a system aborting with SRC B700F105 logged when starting a partition that the user had changed from P8 or P9 compatiblity mode to P7 compatibility mode.  This problem is intermittent and the partition in question had to have an immediate shutdown done prior to the change in compatibility mode for the problem to happen.  To prevent this problem when it is known that a compatibility mode is going to change to P7 mode, allow the partition to shut down normally before making the change.  If an immediate shut down of the partition is necessary and the compatibility mode has to be changed to P7, then the CEC should be powered off and then re-IPLed before making the change to prevent an unplanned outage of the system.
  • A problem was fixed for a logical partition hang or unpredictable behavior due to lost interrupts with BCxxE504 logged when memory is relocated by the hypervisor because of predictive memory failures.  This problem is not frequent because it requires memory failing and the recovery action of relocating memory away from the failing DIMMs being taken.  To circumvent this failure, if memory failure has occurred, the system may be re-IPLed to allow normal memory allocations from the available memory, so the partitions do not have to run on relocated memory.
  • A problem was fixed for a failure in DDR4 RCD (Register Clock Driver) memory initialization that causes half of the DIMM memory to be unusable after an IPL.  This is an intermittent problem where the memory can sometimes be recovered by doing another IPL.  The error is not a hardware problem with the DIMM but it is an error in the initialization sequence needed get the DIMM ready for normal operations.

System firmware changes that affect certain systems

  • DEFERRED:   On a system with only a single processor core configured, a problem was fixed for poor I/O performance.  This problem may be circumvented by configuring a second processor core.  This additional processor core does not have to be used by the partition.
  • On systems that are not managed by a HMC, a problem was fixed to enable FSP call home.  The problem always happens when the service processor attempts to call home an error log as it will fail.
  • A problem was fixed for Dynamic Power Saver Mode on a system with low-CPU utilization that had reduced performance unexpectedly.  This only happens for workloads that are restricted to a single core or using just a single core because of low-CPU utilization.  This problem can be circumvented by running the system in maximum performance mode by using ASMI to enable Fixed Maximum Frequency mode.
VL910_073_059 / FW910.00

03/20/18

Impact:  New      Severity:  New

New Features and Functions

  • GA Level

[{"Type":"MASTER","Line of Business":{"code":"LOB68","label":"Power HW"},"Business Unit":{"code":"BU070","label":"IBM Infrastructure"},"Product":{"code":"TI000B2","label":"IBM Power System S914 (9009-41G)"},"ARM Category":[{"code":"a8m0z000000bpKLAAY","label":"Firmware"}],"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"All Versions"},{"Type":"MASTER","Line of Business":{"code":"LOB68","label":"Power HW"},"Business Unit":{"code":"BU070","label":"IBM Infrastructure"},"Product":{"code":"TI0005E","label":"Power System S914 Server (9009-41A)"},"ARM Category":[{"code":"a8m0z000000bpKLAAY","label":"Firmware"}],"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"All Versions"},{"Type":"MASTER","Line of Business":{"code":"LOB68","label":"Power HW"},"Business Unit":{"code":"BU070","label":"IBM Infrastructure"},"Product":{"code":"TI0005G","label":"Power System S922 Server (9009-22A)"},"ARM Category":[{"code":"a8m0z000000bpKLAAY","label":"Firmware"}],"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"All Versions"},{"Type":"MASTER","Line of Business":{"code":"LOB68","label":"Power HW"},"Business Unit":{"code":"BU070","label":"IBM Infrastructure"},"Product":{"code":"TI000B1","label":"Power System S922 Server (9009-22G)"},"ARM Category":[{"code":"a8m0z000000bpKLAAY","label":"Firmware"}],"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"All Versions"},{"Type":"MASTER","Line of Business":{"code":"LOB68","label":"Power HW"},"Business Unit":{"code":"BU070","label":"IBM Infrastructure"},"Product":{"code":"TI0005F","label":"Power System S924 Server (9009-42A)"},"ARM Category":[{"code":"a8m0z000000bpKLAAY","label":"Firmware"}],"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"All Versions"},{"Type":"MASTER","Line of Business":{"code":"LOB68","label":"Power HW"},"Business Unit":{"code":"BU070","label":"IBM Infrastructure"},"Product":{"code":"TI000B3","label":"Power System S924 Server (9009-42G)"},"ARM Category":[{"code":"a8m0z000000bpKLAAY","label":"Firmware"}],"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"All Versions"},{"Type":"MASTER","Line of Business":{"code":"LOB68","label":"Power HW"},"Business Unit":{"code":"BU070","label":"IBM Infrastructure"},"Product":{"code":"TI00072","label":"Power System H922 Server (9223-22H)"},"ARM Category":[{"code":"a8m0z000000bpKLAAY","label":"Firmware"}],"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"All Versions"},{"Type":"MASTER","Line of Business":{"code":"LOB68","label":"Power HW"},"Business Unit":{"code":"BU070","label":"IBM Infrastructure"},"Product":{"code":"TI000B4","label":"Power System H922 Server (9223-22S)"},"ARM Category":[{"code":"a8m0z000000bpKLAAY","label":"Firmware"}],"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"All Versions"},{"Type":"MASTER","Line of Business":{"code":"LOB68","label":"Power HW"},"Business Unit":{"code":"BU070","label":"IBM Infrastructure"},"Product":{"code":"TI00073","label":"Power System H924 Server (9223-42H)"},"ARM Category":[{"code":"a8m0z000000bpKLAAY","label":"Firmware"}],"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"All Versions"},{"Type":"MASTER","Line of Business":{"code":"LOB68","label":"Power HW"},"Business Unit":{"code":"BU070","label":"IBM Infrastructure"},"Product":{"code":"TI000B5","label":"Power System H924 Server (9223-42S)"},"ARM Category":[{"code":"a8m0z000000bpKLAAY","label":"Firmware"}],"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"All Versions"},{"Type":"MASTER","Line of Business":{"code":"LOB68","label":"Power HW"},"Business Unit":{"code":"BU070","label":"IBM Infrastructure"},"Product":{"code":"TI0005H","label":"Power System L922 Server (9008-22L)"},"ARM Category":[{"code":"a8m0z000000bpKLAAY","label":"Firmware"}],"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"All Versions"}]

Document Information

Modified date:
08 January 2026

UID

ibm16955591