IBM®
Skip to main content
    Country/region [select]      Terms of use
 
 
    
     Home      Products      Services & solutions      Support & downloads      My account     
 
developerworks > My developerWorks >  Dashboard > HPC Central Wiki > HPC Central > RSCT
developerWorks
Log In   View a printable version of the current page.
Overview Connect Spaces Forums Wikis
RSCT
Added by parkes, last edited by parkes on Oct 23, 2008  (view change)
Labels: 
(None)

Known Issues

Date Added: October 23, 2008

Incorrect handling of networkID for IB membership groups in cthagsglsm

Issues:
A problem with the handling of the networkID for IB membership groups in cthagsglsm could cause membership information in those groups to be updated incorrectly. This could lead to incorrect ml0 adapter group membership calculations.

The mechanism used to detect hangs in RSCT NAM API clients may cause deadlocks within cthagsglsm, which could result in a hang of that subsystem. This will result in clients (such as pnsd) no longer receiving expected updates.

Solution:
For AIX, order the following APARs, when available:
IZ34059 rsct.basic.rte 2.5.2.1
IZ34058 rsct.basic.rte 2.4.10.1

ifixes are currently available for rsct.basic.rte 2.5.1.3 for AIX 6.1 and Linux and rsct.basic.rte 2.4.9.3 for AIX 5.3

Date Added: September 2, 2008

Disk Heartbeat networks unreliable at RSCT levels 2.4.9.0 through 2.4.9.3

Systems Affected:
Disk Heartbeat networks at RSCT levels 2.4.9.0 through 2.4.9.3 (AIX 5.3 only).

Issue:
Disk Heartbeat networks at RSCT levels 2.4.9.0 through 2.4.9.3 (AIX 5.3 only) will be unable to properly use their assigned disks, due to a build-time bug in how the target sector is determined.

If the DiskHeartbeat network connects nodes at different RSCT levels, communication across the disks will be completely broken, as the two sides will not be talking across the same sector.

If the DiskHeartbeat network connects nodes all within the 2.4.9.0-2.4.9.3 range, communication will be unpredictable, as the incorrect sector that both sides will be using is not reserved for us, so other OS facilities could be trying to use it at the same time.

NIM-related errors in the errpt should be watched for in these cases, such as:

  LABEL: TS_NIM_ERROR_RDWR_E
  Resource Name: topsvcs

  1: read operation 0: write operation
                  1
  Error detailed information
  DHB NIM value1: sector without magic number

Solution:
Order the following APAR, when available:
IZ26175 rsct.basic.rte 2.4.9.4

Date Added: October 29, 2007
Date Last Updated: November 19, 2007

hags daemon core dumps on RSCT 2.3.11.4, 2.3.11.5, 2.4.7.4 or 2.4.7.5

Users Affected:
A cluster with more than 150 nodes, or any HACMP clusters, at rsct.basic.rte 2.3.11.4, rsct.basic.rte 2.3.11.5, rsct.basic.rte 2.4.7.4 or rsct.basic.rte 2.4.7.5.

Issue:
The hags daemon may core in clusters with more than 150 nodes or any HACMP clusters.  This core may be in check_lost_lines_by_logthread ("TraceStream.C").

Solution:
Reject rsct.basic.rte 2.3.11.4, rsct.basic.rte 2.3.11.5, rsct.basic.rte 2.4.7.4 and rsct.basic.rte 2.4.7.5 from all nodes in the cluster where it has been applied.

For AIX, order the following APARs:
IZ07443 rsct.basic.rte 2.3.12.1
IZ06869 rsct.basic.rte 2.4.8.1

For Linux, download the following, as part of the CSM 1.7.0.1 update (http://www14.software.ibm.com/webapp/set2/sas/f/csm/download/home.html):
rsct.basic.rte 2.4.8.1

Date Added: September 18, 2007

Peer Domain resiliency enhancements added in RSCT 2.4.7.4 (APAR IZ01378) and RSCT 2.3.11.4 (APAR IZ01379)

In heavily-loaded systems, contention for resources like memory, I/O, or CPU may result in RSCT daemons not being able to make progress in a timely manner. That may result in false node failures, or in RSCT daemons being recycled. To minimize the possibility that the daemons be prevented from accessing system resources, the Topology Services, Group Services, and Configuration Resource Manager daemons now run with a fixed realtime CPU priority, which should allow them to access CPU resources even when several other processes in the system are running.

Note that the use of a realtime fixed CPU priority will not result in the RSCT daemons using additional CPU resources. The priority will only ensure that the daemons will be allowed to access the CPU whenever needed.

The second step in improving the daemons' resilience to resource contention involves locking ("pinning") their pages in real memory. Once the pages are brought to physical memory, they are not allowed to be paged out, thus minimizing the possibility that daemons become blocked or delayed during periods of high paging activity.

Because the daemons' pages are locked in memory, the corresponding physical pages are dedicated to the daemons and cannot be used by other processes in the system. Therefore the amount of physical memory available for other processes is slightly reduced.

By default, the daemons will use a fixed CPU priority and lock the pages in memory. This behavior can be changed, with the following commands:

/usr/sbin/rsct/bin/cthatstune -p 0
will direct the RSCT daemons not to use a fixed CPU priority.
For the Group Services daemon, the setting will only take effect the next time RSCT Peer Domain is onlined on the node.

CT_MANAGEMENT_SCOPE=2 chrsrc -c IBM.RSCTParameters TSPinnedRegions=256
will direct the RSCT daemons not to lock their pages in memory.
The setting will only take effect the next time RSCT Peer Domain is onlined on the node.

Date Added: June 29, 2007

hats packet incompatibility with RSCT 2.3.11.2 or RSCT 2.4.7.2

Users Affected:
A cluster with some, but not all, nodes at rsct.basic.rte 2.3.11.2 or rsct.basic.rte 2.4.7.2

Issue:
Any nodes in a cluster at rsct.basic.rte 2.3.11.2 or rsct.basic.rte 2.4.7.2, will be unable to communicate to other nodes at lower levels of rsct.basic.rte via hatsd.

Symptoms of this problem may include:

  • A partitioned RSCT peer domain
  • HACMP - no heartbeating to lower level nodes over the IP networks
  • a lost of host responds in PSSP

Solution:
Reject rsct.basic.rte 2.3.11.2 and rsct.basic.rte 2.4.7.2 from all nodes in the cluster where it has been applied.

Order the following APARs when available:
IZ00913 rsct.basic.rte 2.3.11.3
IZ00912 rsct.basic.rte 2.4.7.3

Date Added: June 15, 2007

IBM.ConfigRMd core dump after migrating nodes in existing peer domains to RSCT 2.3.11.0 or RSCT 2.4.7.0

Users Affected:
Nodes in a peer domain migrated to rsct.core.rmc 2.3.11.0 or rsct.core.rmc 2.4.7.0.

Issue:
IBM.ConfigRMd will dump core every 24 hours on a node in a peer domain that was migrated to rsct.core.rmc 2.3.11.0 or rsct.core.rmc 2.4.7.0.

IBM.ConfigRM will restart automatically after the core dump.

Solution:
Apply IY99078 (rsct.core.rmc 2.3.11.1) or higher.
Apply IY99077 (rsct.core.rmc 2.4.7.1) or higher.

Date Added: June 15, 2007

Potential deadlock within certain Resource Managers

Users Affected:
Users with rsct.core.rmc 2.3.9.4, rsct.core.rmc 2.3.10.0, rsct.core.rmc 2.4.5.4 or rsct.core.rmc 2.4.6.0

Issue:
A potential deadlock exists within certain Resource Managers. The Resource Managers that are known to possibly be affected are IBM.WLMRM and IBM.HostRM.

Issuing lssrc against these Resource Managers, or lsrsrc against a resource class managed by one of the Resource Managers (e.g. IBM.Program) may hang.

A hang of IBM.HostRM may cause gui dlpar functions to hang and lspartition -debug to not display all of the lpars.

Solution:
Apply IY90698 (rsct.core.rmc 2.3.10.1) or higher.
Apply IY90697 (rsct.core.rmc 2.4.6.1) or higher.




 
    About IBM Privacy Contact