IBM Support

SE70094 - OSP-OTHER-UNPRED CLUSTER MONITOR DOES NOT PREVENT PARTITION
STATUS

Subscribe to this APAR

By subscribing, you receive periodic emails alerting you to the status of the APAR, along with a link to the fix after it becomes available. You can track this item individually or track all items by product.

Notify me when this APAR changes.

Notify me when an APAR for this component changes.

 APAR (Authorized Program Analysis Report)

Abstract

OSP-OTHER-UNPRED CLUSTER MONITOR DOES NOT PREVENT PARTITION
STATUS

Error Description

When using cluster monitors for advanced cluster node failure  
detection, it is expected that when the HMC recognizes an      
event, it will inform the cluster and the cluster will          
appropriately treat the event.                                  
It has been discovered that when a partition status of "not    
availble" was provided from the HMC to the cluster, cluster    
code did not treat that as a failure immediately. By the time  
the cluster did get an expected error condition from the HMC,  
cluster code had already deemed that there was a network        
connection lost and changed the node status to "Partition"      
rather than the preferred status of "Failed" in this case.      
At that point, manual efforts are required to complete failover
events.                                                        
                                                               
The condition has been recognized and code is being enhanced to
to properly handle the condition.  Other conditions are being  
explored respectively.                                          

Problem Summary

When using HMC REST cluster monitors for advanced cluster node  
failure detection, it is expected that when a partition fails  
the cluster monitor will detect the failure and trigger a      
failover, preventing the cluster from going to "Partition"      
status.                                                        
It has been discovered that when a partition status of "not    
available" was provided from the HMC to the cluster, and the    
state of the CEC owning the partition was 'error', 'error -    
terminated', or 'error - dump in progress', the cluster code did
not trigger a failover. By the time the cluster did get an      
expected error condition from the HMC, cluster code had already
deemed that there was a network connection lost and changed the
cluster status to "Partition" rather than triggering a failover.
Additionally, a problem was discovered where the cluster monitor
was not properly ending HMC sessions.  This resulted in multiple
stale sessions on the HMC REST server.  Enough stale sessions  
can cause performance problems in the HMC REST server, which may
result in late notifications to the HMC REST cluster monitor.  
This can cause the cluster to detect a "Partition" before      
receiving a notification from the HMC.                          
When the cluster status becomes "Partition" then manual efforts
required to cause a failover.                                  

Problem Conclusion

The condition has been recognized and code has been changed to  
to properly trigger a failover when a partition reports a 'not  
available' status and the CEC associated with the partition is  
not active.                                                    
The HMC REST cluster monitor also properly ends HMC sessions,  
which prevents associated performance problems in the HMC REST  
server.                                                        

Temporary Fix

                                                               

Comments

                                                               

Circumvention


PTFs Available

R720 SI68530 PTF Cover Letter   9123
R730 SI68594 PTF Cover Letter   9116

Affected Modules

         
         

Affected Publications

Summary Information

Status............................................ CLOSED PER
HIPER........................................... No
Component.................................. 5770SS100
Failing Module.......................... RCHMGR
Reported Release................... R720
Duplicate Of..............................




IBM i Support

IBM disclaims all warranties, whether express or implied, including, but not limited to, the implied warranties of merchantability and fitness for a particular purpose. By furnishing this document, IBM grants no licenses to any related patents or copyrights. Copyright © 1996,1997,1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019 IBM Corporation. Any trademarks and product or brand names referenced in this document are the property of their respective owners. Consult the Terms of use link for trademark information

[{"Type":"MASTER","Line of Business":{"code":"LOB57","label":"Power"},"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"SWG60","label":"IBM i"},"Platform":[{"code":"PF012","label":"IBM i"}],"Version":"7.1.0"},{"Business Unit":{"code":"BU054","label":"Systems w\/TPS"},"Product":{"code":"SG15Q","label":"APARs - OS\/400 General"},"Component":"","ARM Category":[],"Platform":[{"code":"PF012","label":"IBM i"}],"Version":"V7R2M0;V7R3M0;V7R4M0","Edition":"","Line of Business":{"code":"","label":""}}]

Document Information

Modified date:
17 May 2019