Limitations and known issues

As of releasing IBM® Spectrum Cluster Foundation Community Edition Version 4.2.2, the following are the product limitations or known issues.

Review the following list of limitations for IBM Spectrum Cluster Foundation Community Edition 4.2.2.

Table 1. IBM Spectrum Cluster Foundation Community Edition Version 4.2.2 limitations
Limitation Affects Description
Installation errors are found in the pcmconfig.log installation log during SLES installation.

Reference #216371

Installation During IBM Spectrum Cluster Foundation Community Edition installation on SLES, one of the following error messages are found in the pcmconfig.log installation log found in the /opt/pcm/log directory:
device node not found
Root device (/dev/sda2) not found
GetBootloader(): Cannot determine the loader type
There was an error generating the initrd (1)
These error message can be ignored.
Image profiles support the highest version of a custom package only.

Reference #210000

Custom packages Image profiles support the highest version of a custom package only. If you add two custom packages with the same name, but a different version, IBM Spectrum Cluster Foundation Community Edition installs only the highest version of the custom package.

You cannot choose between the higher version and lower version in the image profile. If you want to install the lower version, you must make sure that the /install/contrib/OS_release/arch directory, where OS_release is the operating system release and arch is the architecture, contains only the lower version. For example, ensure that the lowest version is found in the /install/contrib/rhels6.4/x86_64/ directory.

Report data is truncated when the requested report period does not match the configured reporting data collection time interval.

Reference #213677

Reports The reporting data loader collects data for a specific time interval. Data collection is not rescheduled by the requested report period, even after you restart the reporting service.

For example, if you request a report from 22:00 to 22:43 and the default time interval is 300 seconds, the data is truncated. The data is truncated because the data collection stops at 22:40.

Blade and compute node location information in chassis is not supported in Rack View.

Reference #220740

Resource Dashboard Some blade and compute node location information is defined in the node information file using the slot ID number. The location definitions that are defined in the node information file using the slot ID number are not displayed in the Rack View of the Web Portal.
The network bridge interface is configured with the IP address of its Ethernet interface

Reference #241898

Network bridge Using the custom script, xHRM bridgeprereq <Ethernet>:<bridge>, the network bridge interface is configured with the IP address of its Ethernet interface.

After the compute node is provisioned, the Ethernet interface ignores its IP address and uses the Ethernet interface IP address, instead of the one assigned to network bridge.

The network bridge inherits the MAC address and IP address from the port added to it.

In a RHEL 7.1 on Power® BE environment, compute nodes cannot access resources such as LDAP and NFS servers.

Reference #36221

NAT In a RHEL 7.1 on Power BE environment, there is an issue with LDAP integration and problems within an LSF® cluster using an external NFS. This is a result of a RHEL 7.1 issue (Reference number #116824) that exists on Power BE. Compute nodes cannot access resources, such as LDAP and NFS servers.

To resolve this issue, make sure that IBM Spectrum Cluster Foundation Community Edition resources are reachable by compute nodes.

LDAP cannot be enabled on Ubuntu node.

Reference #110936

LDAP LDAP cannot be enabled on Ubuntu node. Any attempt to enable LDAP (using enableLDAP.sh) on Ubuntu nodes results in libnss-ldap package errors and system failure.
IBM Spectrum Scale™ cluster deployment only supports a deployment where compute node host names only contain lowercase and alphanumeric characters.

Reference #112317

IBM Spectrum Scale To deploy an IBM Spectrum Scale cluster, make sure that compute node host names only contain alphanumeric characters. For example: node1.

This section details the known issues in version 4.2.2, along with possible workarounds.

Table 2. IBM Spectrum Cluster Foundation Community Edition Version 4.2.2 known issues
Issue Affects Description Resolution or action
NIC name is incorrect on compute nodes.

Reference #218194

Devices During stateless provisioning, the node status is displayed correctly but the NIC name is incorrect. A compute node's original NIC name might be changed to a new NIC name. Reload the NIC device to fix the NIC name to the original name specified.
  1. Check the network driver for the NIC:
    # ethtool -i eth0-rename
    driver: bnx2
    version: 2.2.1
    firmware-version: bc 1.8.0
    bus-info: 0000:05:00.0
  2. Log in to the compute node's console and reload the network driver:
    # rmmod bnx2
    # modprobe bnx2
Adding an OS update from an official website causes a node operation error.

Reference #212472

OS updates Adding an OS update from an official RHEL, CentOS, or SLES website can cause a node operation error.

Some RPM packages on a distributors official website might have unresolved package dependency issues.

When a package with a resolved dependency is available, you can readd the package and apply the OS update again.

To recover from this issue:
  1. In the Web Portal, disassociate the OS update from the image profile.
  2. Reprovision the nodes:
    • For a stateful node: reinstall the node.
    • For a stateless node: rebuild the image and then reinstall the node.
Failed to uninstall OS update RPM packages on stateful nodes after OS update is removed from image profile.

Reference #211463

OS updates After an OS update is removed from an image profile and the nodes are synchronized, the OS update RPM packages remain installed on a stateful node. To uninstall OS update RPM packages on a stateful node, you must complete the following steps:
  1. Remove the OS update from the image profile.
  2. Reprovision the compute nodes.
The Web Portal does not display the total memory of a node after the node is provisioned or replaced.

Reference #216589

Web Portal After you provision a compute node, the compute node's free memory does not include the total memory. At this time, the total memory is less than the used memory and a dash (-) is displayed in the Web Portal. The Web Portal refreshes after the next integer clock. After a node is provisioned or replaced, you can refresh the Web Portal from the command line to immediately update the Web Portal display.
plcclient.sh -d 
hostconfigloader
Node discovery fails and displays an "no free leases" error in the /var/log/messages file.

Reference #211896

Node discovery After you add compute nodes to the cluster using node discovery, the nodes are not added to the cluster even after the compute nodes are powered on. The following error is found in the /var/log/messages file on one line:
mgmt dhcpd: 
DHCPDISCOVER from 
c8:0a:a9:c8:a4:55 
via eth1: network 
eth1: no free leases
You must set the subnet IP or netmask values for the provision interface in the Web Portal. The subnet IP or netmask values must be the same as the values that are used by the provisioning interface on the management node.

For example, if the provision interface has an IP address of 11.0.0.1/24, then you must create a network in IBM Spectrum Cluster Foundation Community Edition that has a subnet IP of 11.0.0.0 and a netmask of 255.255.255.0.

After a node is provisioned, some non-IBM machines might fail on bootup.

Reference #213598

Node provisioning During node provisioning, the OS is installed on the node and the node reboots. The node reboot might fail with the following error:
 No more network devices
One possible resolution is to complete the following steps:
  1. Backup files:
    # cp /opt/xcat/lib/perl/
    xCAT_plugin/xnba.pm /tmp/
    xnba.pm.ORIG
  2. Apply the patch:
     # sed -i -e 's/exit\\n/
    sanboot --no-describe 
    --drive 0x80\\n/'  /opt/
    xcat/lib/perl/xCAT_plugin/
    xnba.pm
  3. Restart the xCAT daemon.
    # service xcatd restart
  4. Add compute nodes and provision as normal.
Note: This resolution works on some hardware models
Node status set to defined after node was provisioned successfully

Reference #243956

Node provisioning Node status does not reflect the actual node status after a node was successfully provisioned. Node status remains set to defined. Ensure that the resolv.conf configuration file in the /etc directory specifies the correct private network and the IP address of the management node in the provisioning network.
For example, edit the resolv.conf file, then restart the xCAT daemon.
# cat /etc/resolv.conf
search private.dns.zone  
nameserver 192.168.1.100 
# service xcatd stop
# service xcatd start
On some non-IBM machines, the default network profile cannot be used since the network device eth0 does not exist.

Reference #58209 (207606)

Network profiles By default, the default network profile assumes that the compute nodes uses eth0 as the provision network interface. If not, you must create a new network profile or edit the default network profile according to its real used network interface.

For example, some servers use em1 to connect to the provision network, in that case the network profile must use em1 instead of the default eth0 naming convention.

Create a network profile that uses the correct naming convention.
Monitoring Agent status is sometimes incorrect.

Reference #220992

Monitoring Agent After a node is provisioned, the monitoring agent and is not started correctly and the monitoring agent status is Unavailable. The monitoring agent cannot be started if the time on the BIOS is set to a different time then the real current time. To resolve this issue, run the following command on one line to restart the monitoring agent:
xdsh noderange "source 
/opt/pcm/ego/profile.platform; 
egosh ego shutdown -f; egosh 
ego start -f"
where noderange is a list of nodes or node groups.
NFS server error in the NFS log file.

Reference #220799

NFS server The following NFS server error is found in the NFS configuration file:
kernel: nfsd: too many 
open connections, consider 
increasing the number 
of threads.
This error is caused by too many connections for the number of threads.

To resolve this error, update the number of threads running on the NFS server. The number of threads that are running must match the scale of the cluster.

Note: If the cluster has 300 nodes that are provisioned, and 200 nodes are to be synchronized at the same time, then the NFS thread number must be at least 200.
For RHEL:
  1. Update the thread number definition in the NFS configuration file /etc/sysconfig/nfs. To update the number of threads, change the value of the RPCNFSDCOUNT parameter. For example:
    (RPCNFSDCOUNT = 32) 
  2. Restart the NFS server.
    /etc/init.d/nfs restart
For SLES:
  1. Update the thread number definition in the NFS log file /etc/sysconfig/nfs. To update the number of threads, change the value of the USE_KERNEL_NFSD_NUMBER parameter. For example:
    USE_KERNEL_NFSD_NUMBER = 32
  2. Restart the NFS server.
    /etc/init.d/nfsserver restart
Browser is unresponsive.

Reference #225460

Web Portal Adding many nodes (2500 nodes and greater) can cause the Web Portal to be unresponsive in Internet Explorer (IE). Close all IE processes, and open a new browser. If the problem persists, use a different supported browser such as Firefox.
Using the Web Portal in Internet Explorer 9, nodes do not synchronize after updating an image profile.

Reference #226841

Image profiles Using the Web Portal in Internet Explorer 9, nodes do not synchronize if the automatic synchronization option is selected. Using the Web Portal in Internet Explorer 9, after the image profile is updated, synchronize the nodes from the nodes list page using the More > Synchronize option.
Compute nodes cannot reach external networks through network address translation (NAT).

Reference #244354

NAT forwarding On RHEL7 PPC64, compute nodes cannot reach external networks through NAT which is setup on management node. This disables compute nodes to SSH to an external server, such as an LDAP server. In the case of an LDAP server, it disables users from logging into compute nodes. First, check if RHEL has any updates that include IP forwarding.

If not, then make sure to configure the system using a network topology where compute nodes can access external networks directly, and not through NAT forwarding on a management node.

The LSF master node is reinstalled and the LSF compute nodes do not rejoin the LSF cluster.

Reference #241161

LSF cluster template If compute node sharing is enabled in the LSF cluster this can cause problems when the LSF master node is reinstalled. Node sharing is enabled by setting the LSF_SHARE_CN variable to Y in the cluster template.

When the LSF master node is reinstalled and the post-provision script is executed by the pcm-run-cluster-script-layers command, the compute nodes cannot join the LSF cluster because they cannot mount LSF from the NFS server.

To have the compute nodes rejoin the LSF cluster, reboot the compute nodes.

To reboot the LSF compute nodes, from the Web Portal, go to Resources tab and click Infrastructure > Nodes. Select the LSF compute nodes and click Power > Reset.

If you are creating a secure VLAN network and you specify multiple NICs to the same switch in a node information file, node provisioning using the switch discovery method fails.

Reference #245489

VLAN Node provisioning fails if you are setting up a secure VLAN network and the nodes that you are provisioning have multiple NICs all connected to the same switch and are provisioned using the switch discovery method using a copy of the default RHEL 6.5 image profile for x86 or Power systems. In this case, the provisioning failure is caused by an error found in the kickstart configuration template. To resolve this issue, resolve the errors with the kickstart configuration template.
  1. In the image profile copy, replace the kickstart configuration file with the kickstart configuration file found in the original image profile. For example, copy the original kickstart configuration found here: /opt/xcat/share/xcat/install/rh/compute.rhels6.tmpl, to the location of your image profile copy found here: /install/osimages/rhels6.5-x86_64-stateful-compute_Copy_1/compute.tmpl.
  2. Retry node provisioning using the switch discovery method using the updated image profile copy.
IBM Spectrum Cluster Foundation Community Edition installation error occurs when installing IBM Spectrum Cluster Foundation Community Edition on a Ubuntu management node.

Reference #27760

Installation An installation error occurs when installing IBM Spectrum Cluster Foundation Community Edition on a Ubuntu management node. The following error messages are displayed:

Failed to remove package xcat-server. Use the “rpm -e --nodeps xcat-server” command on Linux or the “dpkg --purge --force-all xCAT-server” command on Ubuntu to remove this package, and restart the installation.

Failed to remove package xcat-client. Use the “rpm -e --nodeps xcat-client” command on Linux or the “dpkg --purge --force-all xcat-client” command on Ubuntu to remove this package, and restart the installation.

Use the rpm -e --nodeps xcat-server command on Linux or the dpkg --purge --force-all xCAT-server command on Ubuntu to remove this package, and restart the installation.

If you restart the installation and get the same errors, reboot the management node and install IBM Spectrum Cluster Foundation Community Edition again.

Failed to build Ubuntu stateless image profile.

Reference #36725

Image profiles After removing OS packages for a stateless Ubuntu compute node, the following error message is displayed in the Web Portal:

Cannot build image for target image profile.

The Ubuntu compute node hangs while provisioning and fails to reprovision.
To resolve this issue, and provision the Ubuntu compute node, remove the aide package from the Ubuntu stateless image profile and reprovision the node.
Ubuntu kernel packages cannot be updated.

Reference #31336

OS updates IBM Spectrum Cluster Foundation Community Edition cannot update Ubuntu kernel-related packages such as linux-image-extra-<version> or linux-headers-<version>. To update Ubuntu kernel-related packages on compute nodes, do the following:
  1. Log in to the management node using the command-line interface.
  2. Create an OS distribution update.
    pcmosdistroupdate -c ubuntu<version>
    -arch -p <pkgdir>
    For example:
    pcmosdistroupdate -c ubuntu14.04.2
    -x86_64 -p pkgdir
  3. Log in to the IBM Spectrum Cluster Foundation Community Edition Web Portal.
  4. Navigate to the Image Profile page and associate the OS update you just created to the image profile the compute nodes are using. Make sure to leave the Automatically synchronize nodes check box checked in the confirmation dialogue.
  5. Log in to the management node using SSH and run the following command:
     xdsh <noderange> apt-get -y --force
    -yes dist-upgrade
  6. After Step 3 is completed and the nodes are updated, reboot the compute nodes. For stateful compute nodes, the reboot operation updates the kernel version. For stateless compute nodes, the reboot operation reprovisions the compute node and updates the kernel version.
A segmentation fault error occurs when adding an OS distribution.

Reference #37266

OS distribution When adding an OS distribution, the following segmentation fault error appears:

PAM adding faulty module: /usr/lib64/security/pam_fprintd.so

segfault at 7a40 ip 0000000000007a40 sp 00007fff6b5469d8 error 14 in libattr.so.1.1.0[7fb95466b000+4000]

By default the fingerprint service is disabled, however it is possible for it to be enabled after installation. If this occurs, the fingerprint service should be disabled.
To resolve this issue, disable the fingerprint service. Run the following command:
authconfig --disablefingerprint --update
Problems are encountered when adding a host name that starts with a number.

Reference #38757

Host name Host names cannot start with a number. If a host starts with a number there can be various error messages found in the pcmd log file (/opt/pcm/pcmd/log/pcmd.log), such as:

Execution resource does not belong to the allocation.

Remove the node from IBM Spectrum Cluster Foundation Community Edition, and readd the node using a host name that begins with a letter.
In a high availability environment, a RHEL 7.x stateless image profile failed to build.

Reference #43196

High availability In a RHEL 7 and later high availability environment, stateless image profile creation fails as a result of a defect in the RHEL 7.x operating system. (Reference number #124177)

For each stateless image profile copy that is created, the stateless image profile needs to be regenerated using the genimage command.

To resolve this issue, regenerate the stateless image using the genimage command.
The genimage command generates the stateless image profile using the rootimg directory. This directory needs to be deleted and re-created for every stateless image profile following these steps:
  1. Create a directory in the /tmp directory.
  2. Change the rootimgdir property of the stateless image profile to the new directory created in step 1.
  3. Run the genimage command.
  4. Run the packimage command.
  5. Change the rootimgdir property back to its original value.
  6. Move the new packimage packages to the original rootimage directory.
For example:
# mkdir -p /tmp/rootimage/rhels7.0-ppc64-stateless-compute

# chdef -t osimage -o rhels7.0-ppc64-stateless-compute rootimgdir=/tmp/rootimage/rhels7.0-ppc64-stateless-compute

# genimage -t 512m --ignorekernelchk -n e1000,e1000e,igb,ibmveth,ehea rhels7.0-ppc64-stateless-compute

# packimage rhels7.0-ppc64-stateless-compute

# chdef -t osimage -o rhels7.0-ppc64-stateless-compute rootimgdir=/install/osimages/rhels7.0-ppc64-stateless-compute/rootimage

# mv /tmp/rootimage/rhels7.0-ppc64-stateless-compute/*  /install/osimages/rhels7.0-ppc64-stateless-compute/rootimage
For each image profile that you copy, the image profile uses the old link to the local directory and needs to be re-created to use the new local directory.
Authentication error found in pcmd log.

Reference #43399

Authentication The following authentication error is found in the pcmd log:

ERROR [Resource Monitor] pcmd - Resource Monitor error while retrieving server information due to: Authentication failed, credential expired

ERROR [Resource Monitor] pcmd - Authentication failed, credential expired

To resolve this issue, restart the PCMD service:
pcmadmin service restart --service=PCMD
Locked out of IBM Spectrum Cluster Foundation Community Edition after 5 failed log in attempts.

Reference #105644

Web Portal When logging into IBM Spectrum Cluster Foundation Community Edition whether by Web Portal or SSH into a node, the user credentials are entered incorrectly and after 5 attempts, access is locked. To resolve this issue, simply wait 5 minutes and try again. After 5 minutes, the account is unlocked.
IBM Spectrum Cluster Foundation Community Edition uses the same PAM login authentication as the operating system. To change the behaviour of the account lock and unlock capabilities, do the following steps on the management node:
  1. Login to the management node as user root.
  2. Update the PAM configuration file.
    # cd /etc/pam.d
    # cp -p password-auth-ac password-auth-ac.orig
    # cp -p system-auth-ac system-auth-ac.orig
  3. Edit the password-auth-ac and system-auth-ac files to include the following lines:
    auth required pam_env.so
    auth required pam_faillock.so preauth silent audit deny=5 even_deny_root unlock_time=300
    
    ...
    auth sufficient pam_unix.so nullok try_first_pass
    auth [default=die] pam_faillock.so authfail audit deny=5 even_deny_root unlock_time=300
    ...
    
    account required pam_faillock.so
    account required pam_unix.so broken_shadow
MPI installation fails on CentOS 7.2 x86.

Reference #110866

MPI When deploying an LSF cluster that enables MPI installation, MPI installation fails on CentOS 7.2 x86 nodes.

MPI installation fails with the following error: ERROR Failed to install prerequisite package glibc.i686.

To resolve this issue:
  1. Download the CentOS 7.2 Everything ISO from CentOS official site.
  2. Extract the following files from the ISO:
    • glibc-2.17-105.el7.i686.rpm
    • libgcc-4.8.5-4.el7.i686.rpm
    • nss-softokn-freebl-3.16.2.3-13.el7_1.i686.rpm
  3. Add these three packages to the CentOS 7.2 image profile
    1. Copy the 3 packages to the /install/contrib/centos7.2/x86_64 directory.
    2. Load the custom packages.
      plcclient.sh -d pcmimageprofileloader 
    3. Add the 3 packages to the image profile.
      1. Select the Resources tab, go to Node Provisioning > Provisioning Templates > Image Profiles.
      2. Select the image profile that you want to add a package to, and click Modify.
      3. Go to the Packages tab
      4. Add a package by selecting the custom packages that you want to add
      5. Click OK to save the changes
  4. Recreate the LSF cluster with MPI enabled in the cluster template.
Error message encountered when adding RHEL 7.2 Power BE to IBM Spectrum Cluster Foundation Community Edition using the Web Portal.

Reference #110688

OS distribution The following error message is encountered when adding RHEL 7.2 Power BE to IBM Spectrum Cluster Foundation Community Edition using the Web Portal.

Can not create the default image profiles for this OS distribution. To resolve this error, remove the OS distribution and add it again. If the image profiles for this OS distribution exist, remove the image profiles before removing the OS distribution.

Error message can be ignored. The OS distribution is added and the corresponding image profile is created.
In a high availability environment, the EGO services cannot be started after services are started on the standby management node.

Reference #96646

High availability In a high availability environment that uses IBM Spectrum Scale shared storage, when the management node is powered off or rebooted, a failover to the standby management node is triggered.

After the standby management node is started, services resume after a few minutes. To check that all services are running, run the pcmhatool check command.

If the EGO service failed to start automatically, it is listed in failed state and must be started manually.

To start EGO service, on the standby management node, run the following command:
# service pcm start