IBM Support

Readme and Release notes for release 5.0.4.4 IBM Spectrum Scale 5.0.4.4 Spectrum_Scale_Data_Management-5.0.4.4-ppc64LE-Linux Readme

Fix Readme


Abstract

xxx

Content

Readme file for: Spectrum Scale
Product/Component Release: 5.0.4.4
Update Name: Spectrum_Scale_Data_Management-5.0.4.4-ppc64LE-Linux
Fix ID: Spectrum_Scale_Data_Management-5.0.4.4-ppc64LE-Linux
Publication Date: 30 April 2020
Last modified date: 30 April 2020

Installation information

Download location

Below is a list of components, platforms, and file names that apply to this Readme file.

Fix Download for Linux

Product/Component Name: Platform: Fix:
IBM Spectrum Scale Linux PPC64LE RHEL
Linux PPC64LE SLES
Linux PPC64LE Ubuntu
Spectrum_Scale_Data_Management-5.0.4.4-ppc64LE-Linux

Prerequisites and co-requisites

  • - Prerequisites

    You may use this 5.0.4.4 package to perform a First Time Install or to upgrade from an existing 4.2.0.0 - 5.0.4.3 (If upgrading node by node ) or 3.5.0 - 5.0.4.3 (If you shutdown and upgrade the entire cluster).

Known issues

  • - Problems discovered in IBM Spectrum Scale releases

    None.

Installation information

  • - Downloading Images

    Choose the download option "Download using Download Director" to download the new Spectrum Scale package and place it in any location desired on the install node.
    Note, if you must (not recommended) use download option "Download using your browser (HTTPS)", instead of clicking on the down arrow to the left of the package name, you must right-click on the package name and select the Save Link As.. option. If you just click on the download arrow, the browser will likely hang.

  • - Installing IBM Spectrum Scale update for Linux on Power Little Endian Systems

    After you have downloaded the IBM Spectrum Scale 5.0.4.4 update, follow the steps below to install the fix package:

    1. Ensure the package is executable using ls -l command.

      You should see something similar with permissions like:
           -rwx r--r-- l root root 110885866 Apr 27 15:52 /download_dir/package_name.

      If it's not executable, you can always make the package executable using the following command:
           chmod +x /download_dir/package_name
    2. Extract RPMs and Debian Linux Packages from Self Extracting Package downloaded using following commands:

      For Standard Edition:

      ./Spectrum_Scale_Standard-5.0.4.4-ppc64LE-Linux-install

      For Advanced Edition:

      ./Spectrum_Scale_Advanced-5.0.4.4-ppc64LE-Linux-install

      For Data Management Edition:

      ./Spectrum_Scale_Data_Management-5.0.4.4-ppc64LE-Linux-install

      For Data Access Edition:

      ./Spectrum_Scale_Data_Access-5.0.4.4-ppc64LE-Linux-install

      For further installation instructions refer to: https://www.ibm.com/support/knowledgecenter/STXKQY_5.0.4/com.ibm.spectr…

      Optional Package for SLES and RedHat Enterprise Linux:

      • gpfs.docs-5.0.4-4.noarch.rpm
      • gpfs.gss.pmcollector-5.0.4-4.xxx.*.rpm (where xxx is the OS version)
      • gpfs.gss.pmsensors-5.0.4-4.xxx.*.rpm (where xxx is the OS version)
      • gpfs.gui-5.0.4-4.noarch.rpm
      • gpfs.java-5.0.4-4.*.rpm
      • gpfs.kafka-5.0.4-4.*.rpm (x86_64 only)
      • gpfs.librdkafka-5.0.4-4.*.rpm (x86_64 only)
      • gpfs.hdfs-protocol-3.0.0-0.*.rpm (x86_64, ppc64, and ppc64le only)
      • gpfs.hdfs-protocol-2.7.3-0.*.rpm (x86_64, ppc64, and ppc64le only)
      • gpfs.hdfs-protocol-3.1.0-0.*.rpm (x86_64, ppc64, and ppc64le only)
      • gpfs.hdfs-protocol-3.1.1-0.*.rpm (x86_64, ppc64, and ppc64le only)
      • gpfs.tct.client-1.1.7*.rpm (IBM Spectrum Scale Advanced or Data Management Edition only, x86_64 and ppc64le only)
      • gpfs.tct.server-1.1.7*.rpm (IBM Spectrum Scale Advanced or Data Management Edition only, x86_64 and ppc64le only)

      Optional Package for Ubuntu Linux:

      • gpfs.docs_5.0.4-4_all.deb
      • gpfs.gui_5.0.4-4_all.deb
      • gpfs.java_5.0.4-4_*.deb
      • gpfs.kafka_5.0.4-4_*.deb (x86_64 only)
      • gpfs.librdkafka_5.0.4-4_*.deb (x86_64 only)
      • gpfs.gss.pmcollector_5.0.4-4.xxx_*.deb (where xxx is the OS version)
      • gpfs.gss.pmsensors_5.0.4-4.xxx_*.deb (where xxx is the OS version)
      • gpfs.tct.client-1.1.7*.deb (IBM Spectrum Scale Advanced or Data Management Edition only, x86_64 only)
    3. Follow the installation instructions in your IBM Spectrum Scale Installing and upgrading.
  • - Upgrading GPFS nodes

    In the below instructions, node-by-node upgrade cannot be used to migrate from GPFS 4.1 or prior releases. For example, upgrading from 4.1.1.16 to 5.0.4.4 requires complete cluster shutdown, upgrade install on all nodes and then cluster startup.

    Upgrading GPFS may be accomplished by either upgrading one node in the cluster at a time or by upgrading all nodes in the cluster at once. When upgrading GPFS one node at a time, the below steps are performed on each node in the cluster in a sequential manner. When upgrading the entire cluster at once, GPFS must be shutdown on all nodes in the cluster prior to upgrading.

    When upgrading nodes one at a time, you may need to plan the order of nodes to upgrade. Verify that stopping each particular machine does not cause quorum to be lost or that an NSD server might be the last server for some disks. Upgrade the quorum and manager nodes first. When upgrading the quorum nodes, upgrade the cluster manager last to avoid unnecessary cluster failover and election of new cluster managers.

    1. Prior to upgrading GPFS on a node, all applications that depend on GPFS (e.g. DB2) must be stopped. Any GPFS file systems that are NFS exported must be unexported prior to unmounting GPFS file systems.
    2. Stop GPFS on the node. Verify that the GPFS daemon has terminated and that the kernel extensions have been unloaded (mmfsenv -u). If the command mmfsenv -u reports that it cannot unload the kernel extensions because they are "busy", then the install can proceed, but the node must be rebooted after the install. By "busy" this means that some process has a "current directory" in some GPFS filesystem directory or has an open file descriptor. The freeware program lsof can identify the process and the process can then be killed. Retry mmfsenv -u and if that succeeds then a reboot of the node can be avoided.
    3. For upgrade instructions refer to: https://www.ibm.com/support/knowledgecenter/STXKQY_5.0.4/com.ibm.spectr…

Additional information

  • - Notices
  • - Package information

    The images listed below and contained in the Self Extracting Package (SE-Package) are maintenance packages for IBM Spectrum Scale. The images are a mix of normal RPM or DEB images that can be directly applied to your system.

    The packages can be used for new install or update from a prior level of IBM Spectrum Scale.

    After all RPMs or DEBs are installed, you have successfully updated your IBM Spectrum Scale product.

    Before installing IBM Spectrum Scale, it is necessary to verify that you have the correct levels of the prerequisite software installed on each node in the cluster. If the correct level of prerequisite software is not installed, see the appropriate installation manual before proceeding with your IBM Spectrum Scale installation.

    For the most up-to-date list of prerequisite software, see the IBM Spectrum Scale FAQ in the IBM® Knowledge Center .

    Update to Version:

    5.0.4.4

    Update from Version:

    4.2.0.0 - 5.0.4.3 (If upgrading node by node )
    3.5.0 - 5.0.4.3 (If you shutdown and upgrade the entire cluster)

    SE Package Content (SLES and RHEL Linux):

    • gpfs.msg.en_US-5.0.4-4.noarch.rpm
    • gpfs.base-5.0.4-4.*.rpm
    • gpfs.gpl-5.0.4-4.noarch.rpm
    • gpfs.docs-5.0.4-4.noarch.rpm
    • gpfs.compression-5.0.4-4.*.rpm
    • gpfs.gskit-8.0.50-86.*.rpm
    • gpfs.gui-5.0.4-4.noarch.rpm
    • gpfs.hdfs-protocol-*.rpm (x86_64, ppc64, and ppc64le only)
    • gpfs.java-5.0.4-4.*.rpm
    • gpfs.license.xxx-5.0.4-4.*.rpm (where xxx is the license type)
    • gpfs.gss.pmcollector-5.0.4-4.xxx.*.rpm (where xxx is the OS type)
    • gpfs.gss.pmsensors-5.0.4-4.xxx.*.rpm (where xxx is the OS type)
    • gpfs.kafka-5.0.4-4.*.rpm (x86_64 only)
    • gpfs.librdkafka-5.0.4-4.*.rpm (x86_64 only)
    • gpfs.adv-5.0.4-4.*.rpm (IBM Spectrum Scale Advanced or Data Management Edition only)
    • gpfs.crypto-5.0.4-4.*.rpm (IBM Spectrum Scale Advanced or Data Management Edition only)
    • gpfs.tct.client-1.1.7.*.rpm (IBM Spectrum Scale Advanced or Data Management Edition only, x86_64 and ppc64le only)
    • gpfs.tct.server-1.1.7.*.rpm (IBM Spectrum Scale Advanced or Data Management Edition only, x86_64 and ppc64le only)

    SE Package Content (Ubunut Linux):

    • gpfs.msg.en_us-5.0.4-4_all.deb
    • gpfs.base-5.0.4-4_*.deb
    • gpfs.gpl-5.0.4-4_all.deb
    • gpfs.docs-5.0.4-4_all.deb
    • gpfs.compression-5.0.4-4_*.deb
    • gpfs.gskit_8.0.50-86.*.deb
    • gpfs.gui_5.0.4-4_all.deb
    • gpfs.java_5.0.4-4_*.deb
    • gpfs.kafka_5.0.4-4_*.deb (x86_64 only)
    • gpfs.librdkafka_5.0.4-4_*.deb (x86_64 only)
    • gpfs.license.xxx_5.0.4-4_*.deb (where xxx is the license type)
    • gpfs.gss.pmcollector_5.0.4-4.xxx_*.deb (where xxx is the license type)
    • gpfs.gss.pmsensors_5.0.4-4.xxx_*.deb (where xxx is the license type)
    • gpfs.adv_5.0.4-4_*.deb (IBM Spectrum Scale Advanced or Data Management Edition only)
    • gpfs.tct.client-1.1.7*.deb (IBM Spectrum Scale Advanced or Data Management Edition only, x86_64 only)

    SE-Package contents:

    To view the full list of packages including protocols:

    ./Spectrum_Scale_xxx-5.0.4.4-yyy-Linux-install --manifest (where xxx is the license type and yyy is the arch (ppc64LE, ppc64 or x86_64))

  • - Summary of changes for IBM Spectrum Scale

    Unless specifically noted otherwise, this history of problems fixed for IBM Spectrum Scale 5.0.x applies for all supported platforms.

    Problems fixed in IBM Spectrum Scale 5.0.4.4 [April 30, 2020]

    • Item: IJ23784
    • Problem description: The GPFS kernel module exports an ioctl interface used by the mmfsd daemon and some of the mm* commands. The provided improvements result in a more robust functionality of the kernel module.
    • Work around: None
    • Problem trigger:
    • Symptom: Unexpected Results/Behavior
    • Platforms affected: ALL Linux OS environments
    • Functional Area affected: All
    • Customer Impact: High
    • IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    • Item: IJ23789
    • Problem description: The GPFS mmfsd daemon services multiple types of requests received over multiple interfaces. The hardening of the mmfsd daemon results in a more robust functionality.
    • Work around: None
    • Problem trigger:
    • Symptom: Unexpected Results/Behavior
    • Platforms affected: ALL
    • Functional Area affected: All
    • Customer Impact: High
    • IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    • Item: IJ24078
    • Problem description: HDR100 IB adapter is shown as '?x ?DR INFINIBAND' in mmfs.log
    • Work around: None
    • Problem trigger: Configure HDR100 IB adapter in verbsRdmaPorts
    • Symptom: Error output/message
    • Platforms affected: ALL Linux OS environments
    • Functional Area affected: RDMA
    • Customer Impact: High
    • IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    • Item: IJ23953
    • Problem description: mmlsnsd -X display bad output for disks that are not found
    • Work around: None
    • Problem trigger: The mmlsnsd -X option only affects cluster with missing underlying disks.
    • Symptom: Command output
    • Platforms affected: ALL
    • Functional Area affected: Admin command
    • Customer Impact: Suggested: has little or no impact on customer operation
    • IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    • Item: IJ23984
    • Problem description: If the directory passed to --directory option has spaces or any special characters in the name, then prefetch is not able to handle them correctly. And it fails printing the usage error to exit.
    • Work around: Prefetch needs to be done with a list file option  when directory prefetch is not able to help.
    • Problem trigger: Pass directory names with special character in their  name to the directory prefetch option.
    • Symptom: Error output/message
    • Platforms affected: ALL Linux and AIX
    • Functional Area affected: AFM
    • Customer Impact: High
    • IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    • Item: IJ24115
    • Problem description: Disk problem could cause newly created log file to become inconsistent and this in turn could cause file system panic during log recovery.   All attempts to mount the file system will fail when this occurs.
    • Work around: None
    • Problem trigger: Disk Error and node failure that require log recovery
    • Symptom: Cluster/File System Outage
    • Platforms affected: ALL
    • Functional Area affected: All
    • Customer Impact: High
    • IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    • Item: IJ24057
    • Problem description: On occasion when performing operations on a dependent fileset, it doesn't set the recovery flag and due to this, if somehow fileset is gone to recovery then recovery will not be triggered. Because of this case, pending data changes will not replicate to remote (home) and data mismatches can be seen. 
    • Work around: Create one file on an independent fileset which will set the recovery flag.
    • Problem trigger: When recovery happens on a dependent fileset and the needRecovery bit is set to 0.
    • Symptom: File size and data mismatches.
    • Platforms affected: All
    • Functional Area affected: AFM
    • Customer Impact: Suggested
    • IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    • Item: IJ24068
    • Problem description: Facing challenge to find the file path of associated inodes in millions/billion files including recursive path where tsfindinode utility is taking lot of time and resources to search path from given path.
    • Work around: None
    • Problem trigger: None
    • Symptom: None
    • Platforms affected: All
    • Functional Area affected: AFM
    • Customer Impact: Suggested
    • IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    • Item: IJ24122
    • Problem description: The Spectrum Scale GUI crashes frequently; the log files usually contain postgresql-related or "too many open files" errors.
    • Work around: Reduce the frequency of mmpmon usage or replace its use with pmcollector.
    • Problem trigger: mmpmon calls made frequently and/or on many nodes
    • Symptom: Component Level Outage
    • Platforms affected: All
    • Functional Area affected: GUI
    • Customer Impact: Suggested
    • IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    • Item: IJ24072
    • Problem description: When adding a descOnly disk to a file system, mmadddisk may incorrectly report that the system pool has a mixture of standard and vdisk-based NSDs.
    • Work around: Use the --force-nsd-mismatch option to mmadddisk
    • Problem trigger: The problem will be seen when a descOnly disk is being added to a filesystem and the cluster contains vdisk-based NSDs.
    • Symptom: Unexpected Results
    • Platforms affected: All
    • Functional Area affected: AFM
    • Customer Impact: High
    • IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    • Item: IJ24158
    • Problem description: While migrating a file dm_set_dmattr failed with rc 9 (EBADF)
    • Work around: None
    • Problem trigger: Migrating a file
    • Symptom: Migration fails
    • Platforms affected: All
    • Functional Area affected: All
    • Customer Impact: Suggested
    • IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    • Item: IJ24358
    • Problem description: AFM gateway node crashes if the fileset target is a local filesystem from the same cluster. This happens while trying to access the local name for the target filesystem which will be NULL for the local filesystem.
    • Work around: None
    • Problem trigger: AFM fileset access with target as a local filesystem
    • Symptom: Crash
    • Platforms affected: All Linux OS environments
    • Functional Area affected: AFM and AFM DR
    • Customer Impact: High
    • IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    • Item: IJ24379
    • Problem description: Orphan files are created during the readdir and will get repaired on the subsequent lookup. While deleting the files from the AFM fileset on multiple nodes, FSErrBadInodeStatusBit error is reported due to a race condition where multiple nodes try to deallocate the same inode which belonged to the orphan file. 
    • Work around: None
    • Problem trigger: Deleting orphan files from the AFM fileset
    • Symptom: Unexpected Results/Behavior
    • Platforms affected: All OS environments
    • Functional Area affected: AFM
    • Customer Impact: High
    • IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    • Item: IJ24174
    • Problem description: AFM prefetch program crashes while handling long path names more than 4096 characters. Also prefetch programs does not handle symlinks which results in a loop.
    • Work around: Use list file prefetch instead of directory prefetch
    • Problem trigger: AFM migration with long path names or recursive symbolic links to directories 
    • Symptom: Crash
    • Platforms affected: All Linux OS environments
    • Functional Area affected: AFM
    • Customer Impact: High
    • IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    • Item: IJ24452
    • Problem description: The copy-on-write to snapshot could be missed when doing fast direct I/O writes, then cause the inode or file data miss to copy the snapshot before modifying a file.
    • Work around: Stop using snapshots or applying this fix.
    • Problem trigger: Doing small direct I/O when there is a snapshot created for the file system.
    • Symptom: FSErrSnapInodeModified structure error.
    • Platforms affected: All OS environments
    • Functional Area affected: Direct IO with snapshot
    • Customer Impact: Critical
    • IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    • Item: IJ24294
    • Problem description: On RHEL 7.7 node, with supported GPFS versions 4.2.3.18 or higher and 5.0.4.0 or higher, when the kernel is upgraded to a version 3.10.0-1062.18.1 or higher, the node may encounter a kernel crash when accessing a deleted directory
    • Work around: None
    • Problem trigger: Accessing a directory which has been deleted
    • Symptom: Abend/Crash
    • Platforms affected: All RHEL 7.7 OS environments with kernel version equal or higher than 3.10.0-1062.18.1
    • Functional Area affected: All
    • Customer Impact: High
    • IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    • Item: IJ24455
    • Problem description: When a file is being compressed or recompressed, a small write could go to update the cached data but dropped after the compression process is done.
    • Work around: Stop doing file compression while small sequential I/O is in progress.
    • Problem trigger: Doing small sequential I/O while file is being compressed.
    • Symptom: Silent data loss.
    • Platforms affected: All OS environments
    • Functional Area affected: File compression or small sequential writes
    • Customer Impact: Critical
    • IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    • Item: IJ22894
    • Problem description: The directory files are considered metadata files and disk blocks are allocated from the system pool. Quota counts disk usage in data subblock units. In file systems where metadata pool and data pool subblocksizes are different, there could be a discrepancy (due to approximation) between what is actually allocated (in metadata subblocksize) and what is tracked by quota.
    • Work around: None
    • Problem trigger: Different metadata and data subblock sizes. 
    • Symptom: Inconsistent mmcheckquota results in file systems with different metadata and data subblock sizes.
    • Platforms affected: All OS environments
    • Functional Area affected: Quotas
    • Customer Impact: High Importance: an issue which will cause a degradation of the system in some manner, or loss of a less central capability
    • IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    • Item: IJ24456
    • Problem description: On Linux node with kernel v4.7 or later,  when getting POSIX ACL data by the getxattr syscall and the  buffer size is equal to the size of the POSIX xattr string value length, the POSIX ACL data size is changed to the input buffer size by setxattr.  The kernel may crash due to the input buffer of getxattr getting over written.
    • Work around: None
    • Problem trigger: Kernel v4.7 
    • Symptom: Unexpected Results/Behavior
    • Platforms affected: Linux nodes with kernel v4.7 or later
    • Functional Area affected: All Scale Users
    • Customer Impact: High
    • IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    • Item: IJ23986
    • Problem description: Problem with mmdsh to do a remote copy of a file
    • Work around: Use scp to do a remote copy.
    • Problem trigger: Using an undocumented option of mmdsh to do a remote copy.
    • Symptom: Error output/message
    • Platforms affected: All
    • Functional Area affected: N/A
    • Customer Impact: Suggested: has little or no impact on customer operation
    • IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    • Item: IJ24488
    • Problem description: Offline fsck requires a certain amount of pagepool memory for it to run with a single inode scan pass and if the needed amount of pagepool memory is not available it will display a warning message before starting the fsck scan indicating the number of inode scan passes it will take with the current available pagepool memory. It also shows the amount of pagepool memory it would need  to run a complete single inode scan pass.  For example below is the message displayed by fsck if there is insufficient pagepool memory available for fsck to run with a single inode scan pass: ---------------- Available pagepool memory will require 3 inode scan passes by mmfsck. To scan inodes in a single pass, total pagepool memory of 11767119872 bytes is needed. The currently available total memory f or use by mmfsck is 8604614656 bytes. Continue fsck with multiple inode scan passes? n ---------------- Now the problem is that in some cases it will display an incorrect value of pagepool memory needed. Also another side effect of this problem is that in some cases fsck might not show the above message and instead shows the below incorrect message: --------------- There is not enough free memory available for use by mmfsck in . Continue fsck with multiple inode scan passes? n
    • Work around: While there is no specific work around here, so you can either choose to continue running fsck with multiple inode scan passes or else try to increase the pagepool with some random amount and keep incrementing as much as possible till it makes fsck run with a single inode scan pass.
    • Problem trigger: This issue is most likely to trigger when offline fsck is run on a large filesystem but where the nodes do not have small pagepool memory.
    • Symptom: Incorrect value in message
    • Platforms affected: All
    • Functional Area affected: FSCK
    • Customer Impact: Suggested: has little or no impact on customer operation
    • IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII This update addresses the following APARs: IJ22894 IJ23784 IJ23789 IJ23953 IJ23984 IJ23986 IJ24057 IJ24068 IJ24072 IJ24078 IJ24115 IJ24122 IJ24158 IJ24174 IJ24294 IJ24358 IJ24379 IJ24452 IJ24455 IJ24456 IJ24488

    Problems fixed in IBM Spectrum Scale 5.0.4.3 [March 5, 2020]

    • Item: IJ22506
    • Problem description: In rare cases, block allocation could fail on client nodes while mmadddisk command is running. This could cause user application to see unexpected E_IO error while mmadddisk command is running.
    • Work around: None
    • Problem trigger: Run mmadddisk command
    • Sympton: IO error
    • Platforms affected: All
    • Functional Area affected: All
    • Customer Impact: Suggested
    • IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    • Item: IJ22580
    • Problem description: On Linux node, GPFS only allows root (CAP_SYS_ADMIN) to set the security xattr from setxattr syscall, which differs from Linux native filesystems, that file owner (CAP_FOWNER) can also set the security xattr when security module loaded on the system.
    • Work around: None
    • Problem trigger: Design issue
    • Sympton: Unexpected Results/Behavior
    • Platforms affected: Linux
    • Functional Area affected: All
    • Customer Impact: High
    • IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    • Item: IJ23001
    • tsfindinode could skip scanning additional fileset/snapshots after encountering error while trying to open a directory for scan. This could cause tsfindinode to not finding all the files.
    • Work around: Run tsfindinode multiple times and avoid changing directory tree while tsfindinode is running.
    • Problem trigger: Running workload that changes directory tree while tsfindinode is running.
    • Sympton: Unexpected Results/Behavior
    • Platforms affected: All
    • Functional Area affected: Admin Commands
    • Customer Impact: Suggested
    • IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    • Item: IJ21773
    • Problem description: The igoreReplicationForQuota configuration variable hides file system replication for quota information such that the disk consumption due to replication is transparent to the end-users. Up until Spectrum Scale 5.0.3, igoreReplicationForQuota is not a supported variable and it is only available for the mmsetquota command. In Spectrum Scale 5.0.4.0 and later version the support is expanded to all other quota commands. As result, the behavior of quota commands in a mixed level cluster is not consistent depending on where the quota commands are executed.
    • Work around: execute quota command on a file system manager node
    • Problem trigger: Executing quota commands on non file system manager node in a mixed level cluster when file system has replication on and ignoreReplicationForQuota configuration variable is set to yes
    • Sympton:
    • Platforms affected: All
    • Functional Area affected: Quota
    • Customer Impact: High
    • IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    • Item: IJ22234
    • Problem description: mmnfs was not seeing path names with and without trailing slashes as the same path.
    • Work around: Remove exports with trailing slashes.
    • Problem trigger: Adding an existing export with extra slashes in the path names may cause multiple exports for the path.
    • Symptom:  Unexpected results
    • Platforms affected: ALL
    • Functional Area affected: CES
    • Customer Impact: Suggested
    • IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    • Item: IJ22634
    • Problem description: AFM IW Fileset shows Dirty state without any write class operations in the queue. With only reads and lookups, it still reflects Dirty - which is slightly misleading.
    • Work around: None
    • Problem trigger: Having many uncached files at the IW cache site and performing ls/read on these files while at the same time looking at "mmafmctl getstate" command's output.
    • Symptom: Unexpected Results/Behavior
    • Platforms affected: ALL Linux OS and AIX environments
    • Functional Area affected: AFM
    • Customer Impact: Suggested
    • IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    • Item: IJ22688
    • Problem description: AFM relies on .afm/.afmtrash directory at the remote sites to handle conflict operations that arise at the Cache/production site. When conflicts arise, AFM moves the particular entity to .afm/.afmtrash and continues with the operation in question. For non-Scale Filesystem at remote, this .afmtrash directory is not present and such conflicting operations get stuck forever.
    • Work around: None
    • Problem trigger: Having SW fileset targeting a non-GPFS home where .afm/.afmtrash directory is not available. And on such SW fileset creating a sequence of operations that can cause the cache to see conflict with home and take evasive action to move the entire entity to the .afmtrash directory.
    • Symptom: Unexpected Results/Behavior
    • Platforms affected: ALL Linux OS and AIX environments
    • Functional Area affected: AFM
    • Customer Impact: Suggested
    • IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    • Item: IJ22664
    • Problem description: A node/thread/terminal running mmafmctl getstate command can get deadlocked with the FS manager while trying to create/delete filesets or while trying to link/unlink them. With dependent filesets linked under AFM filesets the possibility of deadlock increases. 
    • Work around: None
    • Problem trigger: Running the "mmafmctl getstate" command when the FS manager is creating/linking/unlinking/deleting filesets on the same Filesystem.
    • Symptom: Hang/Deadlock/Unresponsiveness/Long Waiters
    • Platforms affected: ALL Operating System environments
    • Functional Area affected: AFM
    • Customer Impact: High
    • IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    • Item: IJ22503
    • Problem description: Error NT_STATUS_NONE_MAPPED when trying to set smb option 'admin users' for a share.
    • Work around: None
    • Problem trigger: Spectrum Scale SMB CLI is used to set SMB option 'admin users' for a share.
    • Symptom: Error NT_STATUS_NONE_MAPPED is reported by SMB CLI.
    • Platforms affected: ALL Operating System environments
    • Functional Area affected: SMB
    • Customer Impact: High
    • IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    • Item: IJ22665
    • Problem description: An internally used file is stored in /root space instead of its correct location.
    • Work around: None
    • Problem trigger: Any CES IP move.
    • Symptom: Polluting /root space
    • Platforms affected: Linux only
    • Functional Area affected: CES
    • Customer Impact: High (normally low, but in certain rare situations it could be high.)
    • IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    • Item: IJ22666
    • Problem description: Kernel crash while shutting down the daemon due to the memory allocation failure.
    • Work around: None
    • Problem trigger: mmshutdown command
    • Symptom: Crash
    • Platforms affected: All
    • Functional Area affected: AFM and AFMDR
    • Customer Impact: Suggested
    • IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    • Item: IJ22668
    • Problem description: Client node asserts when "mmafmctl device getstate" command is invoked and if the filesystem is not mounted at the gateway node.
    • Work around: Mount the filesystem at the gateway node
    • Problem trigger: "mmafmctl device getstate" command
    • Symptom: Crash
    • Platforms affected: All
    • Functional Area affected: AFM and AFMDR
    • Customer Impact: Critical
    • IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    • Item: IJ22694
    • Problem description: Kernel crash with UID remapping enabled due to the NULL pointer dereference in the tracing.
    • Work around: None
    • Problem trigger: UID remapping with remote cluster mounts
    • Symptom: Crash
    • Platforms affected: All
    • Functional Area affected: Remote cluster mount/UID remapping
    • Customer Impact: Critical
    • IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    • Item: IJ22686
    • Problem description: The tspcacheutil program is used for looking at what is the state of each file/dir/entity at the cache/production site with respect to its home/DR site. This program cannot handle 64 bit inode numbers and is seen to pick up random 32 bit inode numbers and print its stats.
    • Work around: None
    • Problem trigger: Run tspcacheutil on a file with inode number in the 64 bit inode range (> 4B)
    • Symptom: Unexpected Results/Behavior
    • Platforms affected: ALL Linux OS and AIX environments
    • Functional Area affected: AFM
    • Customer Impact: Suggested
    • IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    • Item: IJ22687
    • Problem description: When adding vdisk sets which have member vdisks from multiple ESS recovery group pairs to a file system, some vdisks may not be added if any of the node classes involved had some, but not all vdisks already in the file system. This will cause a message pointing out orphaned vdisks in the file system.
    • Work around: Rerun the "mmvdisk fs add" command.
    • Problem trigger: The vdisk set contains vdisks from multiple ESS recovery groups in different node classes. Also, some, but not all, vdisks of a node class are already in the file system.
    • Symptom: Unexpected results
    • Platforms affected: ALL Linux OS
    • Functional Area affected: ESS/GNR
    • Customer Impact: Suggested
    • IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    • Item: IJ22502
    • Problem description: When AFM runs recovery/resync today to catch up missed updates from cache/production to home/remote site, it tends to populate recovery/resync ops and only flush them. Any live ops are held playing to remote until the recovery/resync can complete. This should not be the case for files that are evicted and awaiting recovery/resync to complete.
    • Work around: Wait till Recovery/Resync completes so that we're able to read the file data back from the remote site.
    • Problem trigger: Read the evicted file when the Recovery/Resync procedures are in progress
    • Symptom: IO error
    • Platforms affected: ALL Linux OS and AIX environments
    • Functional Area affected: AFM
    • Customer Impact: High
    • IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    • Item: IJ22723
    • Problem description: In case of GPFS backend if remote cluster is unresponsive then operations in the Gateway node queue get stuck.  In case of IW mode application running on cache cluster can also hang if home is not responding.AFM triggers SGPanic for remote cluster stripe so that application can continue with operation in cache cluster.  AFM fileset goes to Unmounted state. User needs to manually re-mount the stripe group.
    • Work around: Manually re-mount remote stripe group on a gateway node.
    • Problem trigger: Unresponsive home with operation in the Gateway node queue.
    • Symptom: AFM fileset goes into unmounted state and replication stops.
    • Platforms affected: ALL Linux OS
    • Functional Area affected: AFM
    • Customer Impact: Suggested
    • IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    • Item: IJ22689
    • Problem description: The handling of mal-formed requests by the Spectrum Scale mmfsd daemon can trigger a segmentation violation, resulting in the daemon abnormally terminating its execution.
    • Work around: None
    • Problem trigger: N/A
    • Symptom: Abend/Crash Unexpected Results/Behavior
    • Platforms affected: ALL
    • Functional Area affected: ALL
    • Customer Impact: Critical: an issue which will cause an application to fail, a silent data corruption or data loss or loss of major capability
    • IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    • Item: IJ22690
    • Problem description: Issue related to snapshots
    • Work around: Do not change a file in live file system while this file in snapshots are being read.
    • Problem trigger: Data updates happen on files in live file system while these files are being read from snapshots.
    • Symptom: Bad data returned
    • Platforms affected: All
    • Functional Area affected: Snapshot
    • Customer Impact: Critical
    • IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    • Item: IJ22700
    • Problem description: The handling of mal-formed requests by the Spectrum Scale mmfsd daemon can trigger a segmentation violation, resulting in the daemon abnormally terminating its execution.
    • Work around: None
    • Problem trigger: N/A
    • Symptom: Abend/Crash Unexpected Results/Behavior
    • Platforms affected: All
    • Functional Area affected: All
    • Customer Impact: Critical an issue which will cause an application to fail, a silent data corruption or data loss or loss of major capability
    • IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    • Item: IJ22707
    • Problem description: Issue related to snapshots
    • Work around: Do not change a file in live file system while this file in snapshots are being read.
    • Problem trigger: Data updates happen on files in live file system while these files are being read from snapshots.
    • Symptom: Bad data returned
    • Platforms affected: All
    • Functional Area affected: Snapshot
    • Customer Impact: Critical
    • IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    • Item: IJ22718
    • Problem description: Compression problems
    • Work around: Stop compression
    • Problem trigger: start mmap write on file being compressed.
    • Symptom: FSErrBadCompressBlock structure error
    • Platforms affected: All
    • Functional Area affected: File compression, mmap
    • Customer Impact: Critical
    • IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    • Item: IJ22939
    • Problem description: When the use of the gpfsready script is configured ('verifyGpfsReady' configuration variable must be set to 'true') this user script will be called during GPFS startup. In case user specific checks in this script fail and the script returns with a non-zero exit code GPFS goes down. It turned out, that during the following GPFS shutdown, the cleanup thread doing the shutdown in the GPFS mmfsd daemon is waiting for a mutex which as been acquired by another thread but not released yet. The other thread has been sent to a particular handler routine by the cleanup thread without having the chance to release the mutex the cleanup thread is waiting for. This way the cleanup thread cannot make progress and the other thread is waiting for 5 minutes in the handler routine before it will exit.
    • Work around: None
    • Problem trigger: gpfsready user script fails during GPFS startup.
    • Symptom: Hang
    • Platforms affected: Just seen on x86_64-linux, other platforms possible
    • Functional Area affected: GPFS mmfsd startup
    • Customer Impact: High
    • IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    • Item: IJ22863
    • Problem description: AFM tends to filter out a few messages getting queued to avoid unnecessary playback of such messages to the remote site. While filtering such operations, if the Fileset moves to unmounted state because of a change in home - then the queueing back of pending operations causes the Daemon on a gateway to assert.
    • Work around: None
    • Problem trigger: While trying to remove files on which lookups are queued, at the same time due to some error at the home the Cache fileset encounters Unmounted state.
    • Symptom: Abend/Crash
    • Platforms affected: ALL Linux OS environments
    • Functional Area affected: AFM
    • Customer Impact: High
    • IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    • Item: IJ22868
    • Problem description: The ccrrestore script (used by mmsdrrestore --ccr-repair) enumerates the id's in the ccr.disks file statically (always starting with 1), but the CCR calculates the id based on an existing disk list and starts with 1 only, if the disk list in the committed state is empty, i.e. an inconsistent disk list in CCR's committed state.
    • Work around: Configure not to use (CCR) tiebreaker disks as long as the cluster is in a working state.
    • Problem trigger: 'mmsdrrestore --ccr-repair' in conjunction with (CCR) tiebreaker disks configured.
    • Symptom: Crash
    • Platforms affected: All
    • Functional Area affected: -CCR -Admin Commands (mmsdrrestore --ccr-repair)
    • Customer Impact: High
    • IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    • Item: IJ22886
    • Problem description: When quorum nodes get removed from the cluster by using 'mmdelnode' and added back later for use as quorum nodes, the first attempt of 'mmchnode' (to declare those nodes as quorum nodes) might fail. This is caused by a CCR bug, which uses old cached outgoing connections not closed and detected, when those nodes have been removed from the cluster (during 'mmdelnode').
    • Work around: Attempt the same 'mmchnode' command again, and it should succeed.
    • Problem trigger: Executing 'mmchnode' to declare non-quorum nodes as quorum nodes, after those (other quorum nodes) nodes have been removed from the cluster shortly prior,  and the cluster has just one quorum node. (at the time the new quorum node is being declared)
    • Symptom: Unexpected Results/Behavior
    • Platforms affected: All
    • Functional Area affected: -CCR
    • Customer Impact: Suggested
    • IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    • Item: IJ22902
    • Problem description: In rare situations, the quota shares relinquished by quota clients in the first phase of the mmcheckquota command, may not be flushed to disk and if a new quota manager is appointed it may fetch stale in-doubt values from disk. 
    • Work around: None, but avoiding new quota manager instance (umount and mount or mmchmgr) can decrease the window of opportunity that the stale in-doubt information remains on disk.
    • Problem trigger: mmcheckquota followed by a new quota manager in a busy system.
    • Symptom: After mmcheckquota, the in-doubt information provided by quota commands is different from the in-doubt information presented by a newly appointed quota manager. This is one of the possible causes of in-doubt not decreasing after long time of quota inactivity; 
    • Platforms affected: All
    • Functional Area affected: Quotas
    • Customer Impact: High Importance: an issue which will cause a degradation of the system in some manner, or loss of a less central capability
    • IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    • Item: IJ22925
    • Problem description: mmlinkfileset is failing for a dependent fileset at an AFM-DR secondary site because the fileset state is read-only (afmSecondaryRW=no) and an attempt is made to create the special AFM directory (.afm/.ptrash) which fails.
    • Work around: Modify afmSecondaryRW=yes and mark secondary fileset to RW state and then link it.
    • Problem trigger: When dependent fileset is being linked at secondary site.
    • Symptom: Unexpected Results.
    • Platforms affected: All OS environments
    • Functional Area affected: AFM
    • Customer Impact: Suggested
    • IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    • Item: IJ23004
    • Problem description: long waiter detected: SharedHashTabFetchHandlerThread: 'wait for SubToken to become stable'
    • Work around: None
    • Problem trigger: writes that overlap with mmap read ranges
    • Symptom: Hang
    • Platforms affected: All OS environments
    • Functional Area affected: All
    • Customer Impact: High Importance
    • IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    • Item: IJ22972
    • Problem description: File system panic during processing of revoke ownership on an allocation region can cause wrong return code to be passed to file system manager.   This can lead to another node accessing the allocation region before log recovery is performed for the file system panic.
    • Work around: None
    • Problem trigger: File system panic on a client node
    • Symptom: Unexpected Results/Behavior
    • Platforms affected: All
    • Functional Area affected: All
    • Customer Impact: Critical
    • IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    • Item: IJ22544
    • Problem description: For a file system with multiple storage pools defined, 'df' command may temporarily show 0% free space after mounting the fs (It can happen both on file system manager and client node). In most cases, the problem will disappear in a sync period. But if a client node does not do block allocation/deallocation, this problem can persist forever.
    • Work around: There are several workarounds 1) Run 'mmdf' command can solve this problem, but this command may be time consuming 2) On problematic client node, do some block allocation or deallocation on this node then run 'sync' 
    • Problem trigger: Running the df command during a file system mount
    • Symptom: Error output/message
    • Platforms affected: All
    • Functional Area affected: All
    • Customer Impact: Suggested
    • IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII This update addresses the following APARs: IJ21773 IJ22234 IJ22502 IJ22503 IJ22506 IJ22544 IJ22580 IJ22634 IJ22664 IJ22665 IJ22666 IJ22668 IJ22686 IJ22687 IJ22688 IJ22689 IJ22690 IJ22694 IJ22700 IJ22707 IJ22718 IJ22723 IJ22863 IJ22868 IJ22886 IJ22902 IJ22939 IJ22925 IJ22972 IJ23001 IJ23004

    Problems fixed in IBM Spectrum Scale 5.0.4.2 [January 30, 2020]

    • Item: IJ21257
    • Problem description: GPFS daemon assert: err == E_OK dirop.C. This could happen after GPFS runs out of file cache entries and is forced to move a directory from file cache to stat cache.
    • Work around: Increase maxFilesToCache will reduce chance of hitting this assert.
    • Problem trigger: Directory is being moved from file cache to stat cache.
    • Symptom: Abend/Crash
    • Platforms affected: All
    • Functional Area affected: All
    • Customer Impact: Critical
    • IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    • Item: IJ21258
    • Problem description: Running mmsdrrestore against a quorum node in a CCR-enabled cluster will crash the GPFS daemon.
    • Work around: Shutdown GPFS before performing mmsdrrestore
    • Problem trigger: Running mmsdrrestore against a quorum node in a CCR-enabled cluster
    • Symptom: Unexpected Results/Behavior
    • Platforms affected: All
    • Functional Area affected: CCR, Admin Commands
    • Customer Impact: Critical
    • IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    • Item: IJ21260
    • Problem description: GPFS daemon assert: !(ccP->hasJoined() && ccP->isXClust(destNode())). This could happen after moving a node from one remote cluster to another while both clusters have remote mounted a file system from a home cluster.
    • Work around: Disable ialloc function ship via "mmchconfig iallocFuncshipEnabled=false -i"
    • Problem trigger: Moving a node from one remote cluster to another.
    • Symptom: Abend/Crash
    • Platforms affected: All
    • Functional Area affected: Remote cluster mount/UID remapping
    • Customer Impact: High
    • IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    • Item: IJ21396
    • Problem description: On RHEL 7 nodes (pre-Linux kernel v3.18), in the GPFS kernel NFS support environment, GPFS may try to acquire some mutex, while holding an inode spin lock, which may be detected as a soft lockup issue by the kernel NMI watchdog.
    • Work around: None
    • Problem trigger: GPFS breaks a spin lock holding policy in NFS support environment
    • Symptom: Performance Impact/CPU stuck
    • Platforms affected: All RHEL 7.x
    • Functional Area affected: Users of KNFS/CNFS only
    • Customer Impact: High
    • IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    • Item: IJ21304
    • Problem description: If encryption is not configured properly, starting down disks could result in mismatched replicas.
    • Work around: None
    • Problem trigger: During the "start disk", repairing mismatched replicas failed on certain files because encryption context was not available, and the error E_ENC_CTX_NOT_READY was treated as a SEVERE error which means that the code continues to repair the replicas to the degree possible. In the final phase of repair, the missupdate flag was incorrectly cleared from the inode even though we did not synchronize the replicas, as the repair failed due to unavailable encryption context. As the missupdate flag was cleared from the inode, a subsequent "start disk" brought up all down disks, but the file still had mismatched replicas. A later "mmrestripefs -c" may then pick up the wrong replica and overwrite the good replicas.
    • Symptom: Encrypted replicas mismatch after start disk.
    • Platforms affected: All
    • Functional Area affected: Core
    • Customer Impact: Critical
    • IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    • Item: IJ21261
    • Problem description: On one side of an AFM relationship, an AFM fileset is being deleted and on the other side there's a getstate to show AFM fileset states. The getstate command picks the fileset being deleted to print its stats, and causes the Assert.
    • Work around: Do not run "mmafmctl getstate/mmdiag" commands when AFM filesets are being Deleted.
    • Problem trigger: n one side an AFM fileset is being deleted (which could take time depending on number of inodes in the fileset and amount of data). While this is happening, another node in the cluster queries AFM stats on the AFM filesets (mmafmctl getstate (or) an mmdiag running).
    • Symptom: Abend/Crash
    • Platforms affected: ALL Linux and AIX environments.
    • Functional Area affected: AFM
    • Customer Impact: High
    • IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    • Item: IJ21263
    • Problem description: Starting from 5.0, a few special afmIOFlags were introduced to make AFM behave in special ways (for migration and replication). The flags started getting out of control, and needed a human readable format to understand what flags are set.
    • Work around: None
    • Problem trigger: "mmlsfileset -L --afm" does not print human readable IO Flags.
    • Symptom: Error output/message.
    • Platforms affected: ALL Linux and AIX environments.
    • Functional Area affected: AFM
    • Customer Impact: Suggested
    • IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    • Item: IJ21422IJ21422
    • Problem description: EACCESS error is returned to NFS client from Ganesha Server and it can cause IO failure for metadata access (ls command) for file/directory or can fail rm operation on the directory.
    • Work around: None
    • Problem trigger: It is difficult to recreate but possible reason could be file/directory move/deletion from parent directory which leaves a disconnected dentry in the linux kernel.
    • Symptom: IO failure
    • Platforms affected: Linux Only
    • Functional Area affected: NFS Ganesha
    • Customer Impact:
    • IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    • Item: IJ21394
    • Problem description: Correct description of the resumeRequeued command to indicate that the filesetName is a required argument.
    • Work around: None
    • Problem trigger: Running the mmafmctl command as recommended in the man
    • Symptom: mmafmctl shows wrong help - not mandating the filesetName for mmafmctl resumeRequeued sub command.
    • Platforms affected: ALL Linux and AIX environments.
    • Functional Area affected: AFM
    • Customer Impact: Suggested
    • IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    • Item: IJ21432
    • Problem description: A linux mknod operation for a FIFO object can encounter this assert if the object is opened before the operation completely finishes.
    • Work around: The assert can be disabled with the assistance of service via "mmchconfig disableAssert"
    • Problem trigger: A linux mknod operation to create a FIFO object while another process attempts to open the same object (not actually waiting for the create to complete).
    • Symptom: Abend/Crash
    • Platforms affected: ALL Linux OS environments
    • Functional Area affected: All
    • Customer Impact: High
    • IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    • Item: IJ21424
    • Problem description: mmfsadm (safer)dump afm (fset ) - which displays the AFM handler of an AFM fileset - reports incorrect negative values for numAsyncLookups column. Also the same is collected as part of internal dumps that are collected for gpfs.snap.
    • Work around: None
    • Problem trigger: The "mmfsadm (safer)dump afm (fset )" command that displays the handler for an AFM fileset is issued. Also the same is collected as part of internal dumps that are collected for gpfs.snap.
    • Symptom: mmfsadm (safer)dump afm (fset ) - which displays the AFM handler of an AFM fileset - reports incorrect negative values for numAsyncLookups column.
    • Platforms affected: ALL Linux OS environments
    • Functional Area affected: AFM
    • Customer Impact: Suggested
    • IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    • Item: IJ21541
    • Problem description: AFM deletes the orphan file when the home is not reachable during the lookup. Orphan file is created during the readdir and is repaired during the lookup. It is possible that multiple threads deleting the same orphan file and runs into FSStruct error as the same inode is attempted for deallocation multiple times.
    • Work around: None
    • Problem trigger: Doing readdir and lookup on the AFM cache fileset when the home is disconnected after the readdir.
    • Symptom: Error output/message
    • Platforms affected: ALL
    • Functional Area affected: AFM
    • Customer Impact: High
    • IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    • Item: IJ21550
    • Problem description: Deadlock could happen if quorum loss occurs on a newly appointed stripe group manager. Threads could be stuck in 'waiting for stripe group takeover' and 'waiting for SG cleanup'.
    • Work around: None
    • Problem trigger: Quorum loss just as a node start taking over file system manager role
    • Symptom: Hang/Deadlock/Unresponsiveness/Long Waiters
    • Platforms affected: ALL
    • Functional Area affected: All
    • Customer Impact: Critical
    • IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    • Item: IJ21554
    • Problem description: Enable the uage of a list of groups for --ces-group option of the mmces command
    • Work around: Repeat the command using one ces-group for each command
    • Symptom: Without the fix the user cannot choose a combination of groups when filtering the command output for ces groups.
    • Platforms affected: ALL Linux OS environments
    • Functional Area affected: CES
    • Customer Impact: Suggested
    • IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    • Item: IJ21557
    • Problem description: Make timeout of commMsgCheckMessages RPC consistent on all nodes and issue a warning message if it took more than one third of the timeout to get the reply of commMsgCheckMessages RPC.
    • Work around: None
    • Problem trigger: Network is not good which leads to sending commMsgCheckMessages RPC
    • Symptom: Error output/message
    • Platforms affected: All
    • Functional Area affected: All
    • Customer Impact: Suggested
    • IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    • Item: IJ21645
    • Problem description: On Linux node with kernel version 4.7 or later, when copy one source file with command cp -p, the ACL data is lost in the destination file, if the source file contains many ACL entries, for example, 20+ ACL entries.
    • Work around: None
    • Problem trigger: Defect in porting of GPFS to Linux kernel version 4.7.
    • Symptom: Unexpected Results/Behavior
    • Platforms affected: Linux nodes with kernel version 4.7 or later
    • Functional Area affected: All
    • Customer Impact: High
    • IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    • Item: IJ21647
    • Problem description: The systemhealth monitor fails to start.
    • Work around: None
    • Problem trigger: The problem depends on the provided python packages in the various linux distributions. It seems that not all distros provided the required packages. During development and internal test RHEL 7.6 was used without issues.
    • Symptom: Unexpected Results/Behavior
    • Platforms affected: ALL Linux OS environments (CES nodes)
    • Functional Area affected: CES
    • Customer Impact: High
    • IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    • Item: IJ21648
    • Problem description: mmces address add is failing when both object attributes are assigned to one CES IP address
    • Work around: cat /var/mmfs/gen/cesAddressPoolfile will show the requested information,
    • Problem trigger:
    • Symptom: Customer gets incorrect information using the mmces list command.
    • Platforms affected: ALL Linux OS environments
    • Functional Area affected: CES
    • Customer Impact: Suggested
    • IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    • Item: IJ21434
    • Problem description: GPFS user space daemon crashed during read/write through NFS or mmapplypolicy.
    • Work around: None
    • Problem trigger: In openNFS the first lockFile put a hold on the cachObj, the next lockFile in the openNFS skipped the lookup of the file from the hash table which means the cachObjMutex will not acquired, as a result the releaseCacheObjMutex in the end of lockFile wrongly cleared the lockWordCopy in the mutex, unfortunately this mutex was acquired by a daemon thread before the lockFile called releaseCacheObjMutex. So the daemon thread continued to do its work and hit the assert when it called ASSERT_MUTEX_HELD to check it did acquired the mutex. Because the lockWordCopy in the mutex was wrongly cleared by the kernel lockFile, the assert went off in the daemon thread.
    • Symptom: Daemon crash
    • Platforms affected: All
    • Functional Area affected: All
    • Customer Impact: High
    • IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    • Item: IJ21654
    • Problem description: AFM dependent filesets does not have .afm/.ptrash/.pconflicts/.afmtrash which are used for storing the conflicting files. This .afmtrash dir is used to move non empty directory during the directory deletion.
    • Work around: None
    • Problem trigger: Replication to dependent filesets
    • Symptom: Unexpected Results/Behavior
    • Platforms affected: ALL Operating System environments
    • Functional Area affected: AFM and AFM DR
    • Customer Impact: High
    • IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    • Item: IJ21659
    • Problem description: Revalidation on the fileset root path might not happen correctly if the gateway is running some operating systems like RHEL 7.7. This causes the new data from the target path not to be fetched from the home.
    • Work around: None
    • Problem trigger: Revalidation on the fileset root path in the AFM caching modes.
    • Symptom: Unexpected Results/Behavior
    • Platforms affected: Certain Linux OS environments, like RHEL 7.7
    • Functional Area affected: AFM
    • Customer Impact: High
    • IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    • Item: IJ21660
    • Problem description: An upgrade to a major release of Postgres SQL server will trigger a new health event informing the user that the database will be reinitialized.
    • Work around: Manually drop the database and allow the GUI to create it.
    • Problem trigger: Upgrade Postgres SQL to a new major release
    • Symptom: Component Level Outage
    • Platforms affected: ALL Linux OS environments
    • Functional Area affected: REST APIs, GUI
    • Customer Impact: High
    • IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    • Item: IJ21964
    • Problem description: Harden mmces command against injection vulnerability
    • Work around: None
    • Problem trigger:
    • Symptom: For some mmces command it is possible to inject a shell command by adding "| " to the parameter list to execute . This injection is possible on the command line and from the gui.
    • Platforms affected: ALL Linux OS environments
    • Functional Area affected: CES
    • Customer Impact: Critical: security issue
    • IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    • *Item: IJ21974
    • Problem description: AFM gateway daemon asserts when the request arrives before the filesystem is mounted.
    • Work around: Remove gateway designation from the gateway, start GPFS, mount filesystem, and make the node a gateway again using the "mmchnode --gateway -N " command.
    • Problem trigger: Start the gateway node while IO is running on the AFM fileset.
    • Symptom: Crash
    • Platforms affected: Linux
    • Functional Area affected: AFM
    • Customer Impact: Critical
    • IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    • Item: IJ21975
    • Problem description: AFM gets the sparse information from the home before reading the file and the actual data size is used to set the cached bit. It is possible that data blocks allocated at the cache are more than the actual data size if the file is sparse in between and cached bit is set without fully reading the file.
    • Work around: Disable sparse file detection using the afmReadSparseThreshold=disable command
    • Problem trigger: AFM read on the sparse files with afmReadSparseThreshold set (default on)
    • Symptom: Unexpected result
    • Platforms affected: Linux
    • Functional Area affected: AFM
    • Customer Impact: HiPER
    • IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    • Item: IJ21977
    • Problem description: AFM gateway daemon asserts if the remote mount initialization fails during the first access to the fileset
    • Work around: None
    • Problem trigger: Remote mount failure
    • Symptom: Crash
    • Platforms affected: Linux
    • Functional Area affected: AFM
    • Customer Impact: High
    • IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    • Item IJ21978
    • Problem description: When deleting a snapshot, the process may miss to move the data blocks of being delete snapshot files with small inode numbers. The inodes with small numbers must be in the same inode block with fileset metadata file, and not in the first inode block of inode 0 file.
    • Work around: None
    • Problem trigger: Deleting a snapshot which contains a file with small inode number
    • Symptom: Data corruption
    • Platforms affected: All
    • Functional Area affected: Snapshot
    • Customer Impact: Critical
    • IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    • Item: IJ22022
    • Problem description: When handling page fault GPFS didn't detach I/O buffer segment. This later caused kernel crash.
    • Work around: None
    • Problem trigger: Multiple threads doing both normal I/O and mmap I/O on the same file at the same time.
    • Symptom: Kernel crash
    • Platforms affected: AIX
    • Functional Area affected: Mmap I/O
    • Customer Impact: Critical
    • IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    • Item: IJ22004
    • Problem description: AFM gateway daemon crashes during resync operations due to the race between the thread which is monitoring the stuck messages and threads replicating the data.
    • Work around: Increase the afmAsyncOpWaitTimeout value
    • Problem trigger: AFM resync
    • Symptom: Crash
    • Platforms affected: All Linux OS environments
    • Functional Area affected: AFM
    • Customer Impact: Critical
    • IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    • Item: IJ22005
    • Problem description: Customer data showed GPFS asserted when trying to open a disk on calling mmadddisk/mmrpldisk because the disk was assigned a valid storage pool. The root of the problem (why the disk was associated with an invalid storage pool during mmadddisk/mmrpldisk) was not discovered due to lack of data. The logic is: by the time GPFS tries to open a disk due to stripe group descriptor update from mmadddisk/mmrpldisk, the disk should be assigned to a valid storage pool. It is decided to safeguard GPFS not to open a disk when the disk is assigned to an invalid storage pool.
    • Work around: None
    • Problem trigger: This problem has not surfaced internally and there is not enough data from customer to find out why this could happen. From examining the code, GPFS should have assigned a valid storage pool during mmadddisk/mmrpldisk even though the disk was created without specifying the storage pool.
    • Symptom: Abend/Crash
    • Platforms affected: All
    • Customer Impact: High
    • IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    • Item: IJ22007
    • Problem description: Online replica compare function (mmrestripefs -c) could give incorrect replica mismatch error on directories. This could happen if subblock size for metadata is greater than 256K.
    • Work around: None
    • Problem trigger: Run mmrestripefs -c on file system with metadata subblock size greater than 256K.
    • Symptom: Error output/message
    • Functional Area affected: Admin commands
    • Customer Impact: Suggested
    • IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    • Item: IJ21097
    • Problem description: The ports of 2nd (and later) IB adapter on the node which starts the verbs connection might be mis-recognized as RDMA CM disabled ports, and fails to be connected. The nodes that start the verbs connection are nsd clients if verbsRdmaSend=no, but they also could be other nodes if verbsRdmaSend=yes. You will see "ibv_modify_qp init err 22" error message in the mmfs.log file if it happens.
    • Work around: None. But if RDMA-CM is not really needed in your environment, you can just disable it.
    • Problem trigger: Users having
    • Platforms affected: ALL Linux OS environments
    • Functional Area affected: RDMA
    • Customer Impact: High
    • IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    • Item: IJ22034
    • Problem description: File creation could fail unexpectedly with EFBIG error. This could happen when multiple nodes access the same directory while 1 node repeatedly create and delete the same file in the directory.
    • Work around: Perform rename on a file in the directory after encounter EFBIG error.
    • Problem trigger: Repeatedly create and delete the same file in a directory.
    • Symptom: Unexpected Results/Behavior
    • Platforms affected: ALL
    • Functional Area affected: All
    • Customer Impact: High
    • IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    • Item: IJ22009
    • Problem description: GPFS command mmchattr stores extended attribute name value pair into the inode itself the same even for ACL xattr, which should be stored into GPFS internal ACL file. This behavior of ACL xattr handling may confuse users.
    • Work around: None.
    • Problem trigger: None
    • Symptom: Confused output
    • Platforms affected: Linux
    • Functional Area affected: All
    • Customer Impact: Suggested
    • IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    • Item: IJ22036
    • Problem description: On a file system with unavailable metadata disks, log recovery failure prevents file system from being mounted or disks from being started. Either mmfsck -xk should allow repair of logs in this case or tsdbfs -f should allow user to patch the disks states. Fixed code to bypass disk availability check if fsck is invoked in read-only mode. This allows both mmfsck -xk and tsdbfs -f to run in such situations.
    • Work around: Use a node at version less than 5.0.2 to either run mmfsck -xk or tsdbfs -f to patch disk states. This only works if the file system version is less than 5.0.2.
    • Problem trigger: File system disks are down and log recovery has failed.
    • Symptom: Error output/message
    • Platforms affected: ALL Operating System environments
    • Functional Area affected: FSCK
    • Customer Impact: High
    • IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    • Item: IJ22010
    • Problem description: Log recovery error after node failure can cause recovery buffer to be overwritten which will most likely lead to GPFS daemon assert.
    • Work around: None.
    • Problem trigger: Node failure
    • Symptom: Abend/Crash
    • Platforms affected: ALL Operating System environments
    • Functional Area affected: All
    • Customer Impact: HiPER
    • IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    • Item: IJ22013
    • Problem description: In AFM Stopped and Queue dropped states, when file/directory are removed at the cache site the inode is still seen as USERFILE and is not reclaimed.
    • Work around: None.
    • Problem trigger: Running applications/workload when AFM fileset is in Stopped state.
    • Symptom: Unexpected Results/Behavior
    • Platforms affected: All Linux and AIX operating systems. Functional Area affected: AFM
    • Customer Impact: High
    • IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    • Item: IJ22017
    • Problem description: TSM client version can contain 2 or more digits in any position of V.R.M.F but mmbackup cannot handle such case. As a result, mmbackup fails while parsing TSM client version.
    • Work around: None.
    • Problem trigger: Executing mmbackup with TSM client 8.1.10.
    • Symptom: Component Level Outage
    • Platforms affected: ALL Operating System environments
    • Functional Area affected: mmbackup
    • Customer Impact: High
    • IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    • Item: IJ22024
    • Problem description: mmprotocol trace timers of manually stopped traces would unexpectedly stop newly initiated mmprotocol traces.
    • Work around: 1. Either wait for the duration of the previous protocol trace (default: 10 min) before starting a new trace for the same component 2. or kill all mmprotocoltrace processes on all CES nodes, which participate in the trace (by default: all CES nodes)
    • Problem trigger: Starting the second protocol trace via mmprotocoltrace for the same component after the first trace was manually stopped and the timeout of the first trace was not yet reached.
    • Symptom: Unexpected Results/Behavior
    • Platforms affected: ALL Linux OS environments
    • Functional Area affected: Trace CES
    • Customer Impact: High
    • IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    • Item: IJ22041
    • Problem description: When running "mmdiag --waiters" or "mmfsadm dump waiters", or the periodical health check performs long waiters detection, the code could run into memory overflow for a local buffer, then triggers the signal 6 to mmfsd daemon and causes it restarted abnormally.
    • Work around: None.
    • Problem trigger: mmdiag --waiters or mmfsadm dump waiters, or the periodical health check inside mmfsd daemon.
    • Symptom: Daemon crash
    • Platforms affected: ALL Operating System environments
    • Functional Area affected: Long waiters detection and dump
    • Customer Impact: High
    • IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    • Item: IJ21955
    • Problem description: When using Microsoft Office applications such as Word and Excel on Windows 10 version 1709 or newer, any attempt to modify and save an xisting file (.docx, .xlsx etc) will fail with sharing violation error.
    • Work around: None.
    • Problem trigger: This issue is triggered when installing or upgrading to Windows 1 version 1709 or newer. It is also hit in Windows Server version 1809 or newer.
    • Symptom: Sharing violation errors when attempting to modify and save existing *.docx, *.xlsx (and other Office) files using Microsoft Office applications such as Word and Excel. Saving as a different name works.
    • Platforms affected: Windows/x86_64 only. Specifically, Windows 10 (version 1709 or newer) and Windows Server (version 1809 or newer) only.
    • Functional Area affected: Windows.
    • Customer Impact: High Importance.
    • IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    • Item: IJ22158
    • Problem description: ccMsgGroupJoinPhaseN message is sent to all the nodes which are up during the join protocol, in this case this message is sent to the down gateway node causing the deadlock
    • Work around: None.
    • Problem trigger: Remote node joining the cluster with a down gateway node.
    • Symptom: Deadlock
    • Platforms affected: ALL Operating System environments
    • Functional Area affected: AFM and AFM DR
    • Customer Impact: Critical
    • IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    • Item: IJ22207
    • Problem description: Kernel assert going off: bufOffset+len = iobP->ioBufLen in file cxiIOBuffer.c, resulting in a kernel panic.
    • Work around: None.
    • Problem trigger: Calling Spectrum Scale APIs to scan inodes in the file system. Note that some binaries delivered along with Spectrum Scale package are also calling such Spectrum Scale APIs, like policy rules to scan files in the file system, IE snapshot restore and sobar backup utilities.
    • Symptom: Abend/Crash
    • Platforms affected: ALL Operating System environments
    • Functional Area affected: applications using GPFS APIs, including policy, snapshot restore and sobar backup.
    • Customer Impact: High Importance.
    • IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    • Item: IJ22261
    • Problem description: Due to the way mmfsck internally traverses reserved files and snapshots it is not able to report and fix duplicate addresses present among inode 0 files of the active file system and its snapshots. So as a result of this even though mmfsck -y runs successfully and reports file system as clean the duplicate address corruptions are not fixed and so the next mmfsck run will report some new corruptions like mismatched replicas present in inode 0. And there can also be fsstructs reported in the logs due to this after mmfsck -y
    • Work around: Delete all the snapshots in the file system and then run mmfsck repair
    • Problem trigger: ??
    • Symptom: Operation failure due to FS corruption Also on a file system having snapshots the fsck output shows the below signs after a successful mmfsck -y run: 1) Mismatch replicas in inode 0 Error in inode 0 snap 0: Inode block 289710225 has mismatched replicas 2) Even though no duplicates are reported fsck shows the below Checking for the first reference to a duplicate fragment. 3) Even though no duplicates are reported we see a non-zero duplicates count at the end of fsck output 896 duplicates
    • Platforms affected: ALL Operating System environments
    • Functional Area affected: FSCK is not able to repair the corruption
    • Customer Impact: Critical
    • IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    • This update addresses the following APARs: IJ21097 IJ21257 IJ21258 IJ21260 IJ21261 IIJ21263 IJ21304 IJ21394 IJ21396 IJ21422 IJ21424 IJ21432 IJ21434 IJ21541 IJ21550 IJ21554 IJ21557 IJ21645 IJ21648 IJ21654 IJ21955 IJ21659 IJ21660 IJ21964 IJ21974 IJ21975 IJ21977 IJ21978 IJ22004 IJ22005 IJ22007 IJ22009 IJ22010 IJ22013 IJ22017 IJ22022 IJ22024 IIJ22034 IJ22036 IJ22041 IJ22158 IJ22207 IJ22261

    Problems fixed in Spectrum Scale 5.0.4.1 [November 21, 2019]

    • Item: IJ20948
    • Problem description: On an AFM cache cluster using the AFM independent writer mode data maybe incompletely read if a file is modified before it is fully cached. Normally AFM reads a file from the AFM home cluster before allowing write operations to occur. However, if a file is not opened in append mode but a write is made at the end of a file, the data for the file may not be completely cached.
    • Work around: Run prefecth on the partially cached files.
    • Problem trigger: AFM caching modes and updating at the end of the file before fully caching it.
    • Symptom: Unexpected results
    • Platforms affected: All
    • Functional Area affected: AFM caching.
    • Customer Impact: HiPER
    • IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    • Item: IJ20909
    • Problem description: When mmfsck scans and finds corrupted reserved file blocks it prints the list of blocks corrupted and due to a code bug in that path, the file system manager node asserts with Signal 11.
    • Work around: Do not run mmfsck
    • Problem trigger: This will happen when mmfsck is run on a file system having corrupted reserved file blocks
    • Symptom: File system manager node assert
    • Platforms affected: All
    • Functional Area affected: FSCK
    • Customer Impact: High
    • IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    • Item: IJ20710
    • Problem description: FSSTRUCT error: FSErrCheckHeaderFailed could be issued while accessing some directory. This could happen on a file system with metadata replication where there is metadata disk in down state and node failure.
    • Work around: None
    • Problem trigger: Metadata disk in down state and node failure.
    • Symptom: Error output/message
    • Platforms affected: All
    • Functional Area affected: All
    • Customer Impact: Critical
    • IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    • Item: IJ20678
    • Problem description: On a node with multiple file system mounted, DiskLeastThread could be blocked by a file system unmount causing delay in renewal of disk lease and potential quorum loss.
    • Work around: None
    • Problem trigger: File system unmount
    • Symptom: Node expel/Lost Membership
    • Platforms affected: All
    • Functional Area affected: Cluster Membership
    • Customer Impact: High
    • IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    • Item: IJ20726
    • Problem description: After a system crash the configuration file /etc/sysconfig/ganesha contained only an entry for NOFILE but not for OPTIONS and EPOCH_EXEC any more. No Ganesha logs were created.
    • Work around: Since there is no backup file of /etc/sysconfig/ganesha by default, the file must be extracted either from RPM or fetched from another CES node.
    • Problem trigger: The /etc/sysconfig/ganesha file was modified in-place whenever NFS was started. The procedure used the sed -i command for this. The goal was to have always the latest NOFILE entry in the file, along with those of OPTION (startup options for Ganesha) and EPOCH_EXE. Some investigation indicate that during a system crash not all changes in the file were written to disk. So once this file is damaged or truncated, the only entry is then the added NOFILE data. Previously existing OPTIONS and EPOC_EXEC cannot be recovered since there is no mechanism to do so. After the code change the NOFILE data is updated on a copy of the original configuration file. If the changes are all done then this copy is restored back to the original file.
    • Symptom: Unexpected Results/Behavior
    • Platforms affected: ALL Linux OS environments (CES nodes)
    • Functional Area affected: CES
    • Customer Impact: High
    • IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    • Item: IJ20811
    • Problem description: The application allows a regular user to inject OS commands in the "NFS Exports" Client field. The injected command is executed on the underlying operating system as "root" user.
    • Work around: None
    • Problem trigger: Using the GUI to add NFS exports allows this condition.
    • Symptom: Behavior - Security risk
    • Platforms affected: All
    • Functional Area affected: NFS
    • Customer Impact: High
    • IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    • Item: IJ20941
    • Problem description: Removing message headers that are not utilized from librdkafka message to reduce message size sent to external sink.
    • Work around: None
    • Problem trigger: When running Clustered watch with a heavy workload producing many events, if the external kafka cluster gets overloaded, clustered watch may hit a timeout and auto disable. With this fix, the librdkafka message size reduction makes it less likely to hit this timeout.
    • Symptom: 45 Second timeout on clustered watch will hit causing conduit(s) to go down. Following error message in /var/adm/ras/mmwfclient.log 2019-08-26_00:51:49: [E] WF Producer: t: newtopic a: 3
    • Platforms affected:
    • Functional Area affected:
    • Customer Impact: High
    • IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    • Item: IJ20797
    • Problem description: AFM Secondary mode filesets are passive in nature (and RO), since Primary can be the only one allowed to perform Write class operations on the secondary mode fileset. This bug allows Creates to be directly performed on the Secondary mode fileset even when afmSecondaryRW is set to no. However other write class operations like set times, chmod etc are not allowed on the file.
    • Work around: None
    • Problem trigger: User tries to perform IO Operations on an AFM Secondary mode fileset when afmSecondaryRW is set to no.
    • Symptom: Unexpected Results/Behavior
    • Platforms affected: All Linux and AIX environments.
    • Functional Area affected: AFM
    • Customer Impact: Suggested
    • IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    • Item: IJ20733
    • Problem description: Node crashes with assert when the AFM fileset with active IO is unlinked.
    • Work around: Stop AFM fileset and then unlink the fileset.
    • Problem trigger: Fileset unlink with active IO.
    • Symptom: Abend/Crash
    • Platforms affected: ALL
    • Functional Area affected: AFM
    • Customer Impact: High
    • IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    • Item: IJ20677
    • Problem description: Daemon crashes due to invalid config setting where enableStatUIDremap is enabled without enabling the enableUIDremap config option.
    • Work around: Enable both enableUIDremap and enableStatUIDremap options.
    • Problem trigger: UID remapping with invalid config options.
    • Symptom: Crash
    • Platforms affected: All
    • Functional Area affected: Remote cluster mount/UID remapping
    • Customer Impact: High
    • IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    • Item: IJ20730
    • Problem description: When running "mmces node suspend/resume -N" with a list of nodes it might happen that not all of them are in the expected state afterwards.
    • Work around: Repeat the "mmces node suspend/resume -N" command with a list of nodes which were not set to the expected state previously.
    • Problem trigger: The cesiplist file has a unique serial number assigned when it is stored in CCR. Each node reads the cesiplist file (and its serial number) from CCR as a local copy and modifies the suspend flag in that local copy. After this all nodes which did this kind of local update try now to update their modified copy of the cesiplist file in CCR with an incremented (+1) serial number. That may fail when other nodes did this update already with the same serial number earlier. There is some randomness, since not all nodes try this update at the very same time. There could be a timespan of several seconds between the first and the last one, so that some nodes get updated cesiplist files and serial numbers, and work on those.
    • Symptom: Unexpected Results/Behavior
    • Platforms affected: ALL Linux OS environments (CES nodes)
    • Functional Area affected: CES
    • Customer Impact: Suggested
    • IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    • Item: IJ20741
    • Problem description: Fix quota share revoke/reclaim delay when the quota usage is approaching the limits.
    • Work around: None
    • Problem trigger: When quota usage is approaching the limits (hard limit), the attempts to reclaim the remaining quota shares from other quota clients can lead to very slow quota management operations.
    • Symptom: Processes waiting for available quota, when the quota usage is approaching the limits, leading to apparent system hung.
    • Platforms affected: All
    • Functional Area affected: Quotas
    • Customer Impact: High Importance: an issue which will cause a degradation of the system in some manner, or loss of a less central capability
    • IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    • Item: IJ20808
    • Problem description: On AIX, when trying to clear/write the primary GPT area, mmcrnsd does non-4k aligned writes to 4K disks while trying to preserve the OS PVID, causing a failure.
    • Work around: None
    • Problem trigger: Create an nsd out of 4kb sector size native disk(s) on AIX
    • Symptom: Error output/message
    • Platforms affected: AIX
    • Functional Area affected: All
    • Customer Impact: High
    • IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    • Item: IJ20805
    • Problem description: The GPFS daemon (mmfsd) consumes high CPU load on a quorum node when Windows 2016 is used as the operating system. This is caused by a CCR thread listening to incoming CCR requests on cached connections from other quorum nodes by using the poll system call. This logic doesn't consider particular flags returned by the poll system call (in detail: POLLHUP, POLLERR, POLLNVAL). A second GPFS daemon (mmsdrserv) might be affected by this issue. This daemon is running when GPFS has been shutdown by the mmshutdown command. This issue doesn't occur on Linux or AIX.
    • Work around: Assign other nodes as quorum nodes which don't use Windows 2016 as the underlying operating system, if possible, e.g. nodes in the cluster running on Linux or AIX.
    • Problem trigger: GPFS startup (mmsdrserv starts automatic, mmfsd after 'mmstartup -a')
    • Symptom: -Performance Impact/Degradation -Unresponsiveness
    • Platforms affected: Windows 2016 (at least, earlier/later Windows version might be affected too)
    • Functional Area affected: CCR admin commands
    • Customer Impact: High
    • IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    • Item: IJ20736
    • Problem description: Readdir is failed for pcache fileset root, due to cache bit set for first created pcache fileset (even it is not linked) on non-existing pcache fileset filesystem.
    • Work around: None
    • Problem trigger: Access files structure first time from first Pcache fileset on non-existing pcache fileset filesystem .
    • Symptom: File/dir tree mismatches.
    • Platforms affected: ALL Linux OS environments
    • Functional Area affected: AFM
    • Customer Impact: Suggested
    • IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    • Item: IJ20742
    • Problem description: In a multicluster environment, a remote cluster client node is creating a file in a directory inode which has its metanode in a different remote client cluster. Live lock can happen in this case, if the directory is empty or small, due to a performance optimization.
    • Work around: Use the directory only from one remote cluster.
    • Problem trigger: Creating files in an empty or small directory from two remote clusters
    • Symptom: Hang
    • Platforms affected: ALL
    • Functional Area affected: All
    • Customer Impact: High
    • IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    • Item: IJ20728
    • Problem description: Policy scans cannot be executed successfully through the Jobs framework
    • Work around: Manually change the commandtemplates json file: # mmccr fget _jobCommandTemplates.json /tmp/jct.json # vim /tmp/jct.json # mmccr fput _jobCommandTemplates.json /tmp/jct.json # /usr/lpp/mmfs/gui/bin/runtask GPFS_JOBS The two changes that need to be made are: change localWorkDir to localWorkDirectory in the command template - change fileListPathname to fileListPathName in the argument definition
    • Problem trigger: The policy-scan template is used in a job
    • Symptom: Command execution failure
    • Platforms affected: All
    • Functional Area affected: Jobs
    • Customer Impact: Suggested
    • IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    • Item: IJ20725
    • Problem description: When writing to a memory mapped file that was compressed, it fails with SIGBUS when mmapRangeLock config variable is disabled.
    • Work around: Don't disable mmapRangeLock config variable
    • Problem trigger: Writing to memory mapped files that were compressed, and mampRangeLock config variable is disabled.
    • Symptom: Application fails with SIGBUS
    • Platforms affected: ALL Linux OS environments
    • Functional Area affected: All
    • Customer Impact: It is critical if customer disabled mmapRangeLock config variable.
    • IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    • Item: IJ20709
    • Problem description: Updating ESS drive firmware on a live system can be blocked for long periods of time (and may timeout) due to a declustered array that shows up in "rebalance" state.
    • Work around: None
    • Problem trigger: This problem is seen when updating drive firmware.
    • Symptom: Error output/message
    • Platforms affected: ALL Linux OS environments
    • Functional Area affected: ESS/GNR
    • Customer Impact: Suggested
    • IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    • Item: IJ20695
    • Problem description: A TCT enabled system can see gpfs waiters of type "LweAccessRightThread waiting for XW lock"
    • Work around: None
    • Problem trigger: If a dmapi right is acquired on a file, and the file gets deleted, then releasing the right would cause a waiter to appear
    • Symptom: appearance of gpfs waiters of type "LweAccessRightThread waiting for XW lock"
    • Platforms affected: ALL Linux OS environments
    • Functional Area affected: TCT
    • Customer Impact: High
    • IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    • Item: IJ20675
    • Problem description: During the Copy On Write process in which a data block is copied to a snapshot, if the metanode fails, there is a chance for the assert to happen, due to the flush flag not being held.
    • Work around: None
    • Problem trigger: With debugDataControl set to heavy on AIX when automatic debug data collection on unexpected long waiter happens.
    • Symptom: Performance Impact/Degradation
    • Platforms affected: All non-Linux platforms.
    • Functional Area affected: All
    • Customer Impact: High
    • IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    • Item: IJ20951
    • Problem description: There are some access problems to disks, causing the log recovery failure, eventually causing the file system to be panicked on all nodes. Since the incoming remote mounts prevented the offline fsck from running, users then moved the file system into maintenance mode and wanted to try offline fsck again. However, the log recovery was not skipped even when the file system was in maintenance mode, so resulted in the same result for the offline fsck running.
    • Work around: None
    • Problem trigger: The file system logs for some nodes are not clean before moving the file system into maintenance mode.* Symptom: Log recovery is attempted and failed.
    • Platforms affected: All
    • Functional Area affected: File System Maintenance Mode
    • Customer Impact: High
    • IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    • Item: IJ20789
    • Problem description: When deleting a global snapshot, if the snapshot refers to a deleted fileset then the assert will be triggered.
    • Work around: None
    • Problem trigger: This problem only happens when deleting a global snapshot, while a fileset included in it has been deleted.
    • Symptom: Daemon abend
    • Platforms affected: All
    • Functional Area affected: Global snapshot deletion
    • Customer Impact: Suggested
    • IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    • Item: IJ20809
    • Problem description: Daemon crashed with assert ofP->metadata.notAFragment(subblocks)> It may occur in appending data to a file after previous write was failed due to invalid data buffer in application.
    • Work around: Make sure the user data buffer is valid before write data into the scale file system
    • Problem trigger: An invalid user data buffer caused GPFS to fail when writing data to a file while leaving the invalid data in the buffer. A flush of the buffer incorrectly set the file's fragment to a full block which resulted in a failure to expand the last block of the file, triggering the assert.
    • Symptom: Scale daemon crashed with assert ofP->metadata.notAFragment(subblocks) in bufdesc.C
    • Platforms affected: All
    • Functional Area affected: Core
    • Customer Impact: High
    • IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    • Item: IJ20739
    • Problem description: The RPO thread which takes care of creating RPO snapshots for AFM DR filesets, is taking locks on all filesets in filesystem before it can see which filesets require RPO snapshots to be taken. This includes any non-AFM independent/dependent filesets as well.
    • Work around: None
    • Problem trigger: Having multiple AFM DR Primary filesets with RPO intervals enabled.
    • Symptom: Performance Impact/Degradation Hang/Deadlock/Unresponsiveness/Long Waiters (Lesser probability)
    • Platforms affected: All Linux
    • Functional Area affected: AFM Snapshots
    • Customer Impact: Suggested
    • IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    • Item: IJ20791
    • Problem description: After migrating a file from GPFS to external storage any indirect blocks used by the file are not freed.
    • Work around: None
    • Problem trigger: Migration of large files, requiring indirect blocks, to external storage.
    • Symptom: Metadata disk space is not freed after files are migrated to external storage.
    • Platforms affected: All
    • Functional Area affected: All
    • Customer Impact: Critical
    • IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    • Item: IJ20674
    • Problem description: Due to a bug, fsck continues to process a deleted inode and marks it as an orphan which causes this assert.
    • Work around: Patch the problematic inode using tsdbfs so that the inode is no longer corrupt and retry fsck.
    • Problem trigger: A deleted inode is corrupt.
    • Symptom: Abend/Crash
    • Platforms affected: ALL
    • Functional Area affected: FSCK
    • Customer Impact: High
    • IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    • Item: IJ20676
    • Problem description: During offline fsck multi-pass directory scan, if patch queue feature is disabled and --skip-inode-check option is used, then fsck tries to access an out of range entry in dotdotArray and hits this assert.
    • Work around: None
    • Problem trigger: Multi-pass offline fsck --skip-inode-check with patch queue feature disabled.
    • Symptom: Abend/Crash
    • Platforms affected: ALL
    • Functional Area affected: FSCK
    • Customer Impact: High
    • IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    • Item: IJ20734
    • Problem description: Every 15 seconds the CES monitor daemon runs a helper script to analyze the state of CES. The change hardened the monitor to not die but to collect information about the malfunction. If the malfunction repeats this problem is reported to system health by an event and the customer will find the problem in the event logs of mmhealth.
    • Work around: Before the implementation of the fix the information had to be collected from the log files.
    • Problem trigger: Unexpected behavior of a helper script called by CES monitor daemon. The helper script may die because of low memory, blocked lock, etc.
    • Symptom: Performance Impact/Degradation
    • Platforms affected: ALL Linux OS environments
    • Functional Area affected: CES
    • Customer Impact: Suggested: has little or no impact on customer operation
    • IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    • Item: IJ20735
    • Problem description: GPFS daemon could assert when trying to mount a file system. This could happen after a node failure and file system is being mounted again after daemon restart. File system manager node would also fail with an assert.
    • Work around: None
    • Problem trigger: A client node failure
    • Symptom: Abend/Crash
    • Platforms affected: All
    • Functional Area affected: All
    • Customer Impact: High
    • IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    • Item: IJ20788
    • Problem description: After upgrading a cluster with a pre-4.1 file system which has quotas enabled, the old user visible quota files will be converted to GPFS internal files. This changes is kept in the stripe group descriptor for the file system. However, this change is not broadcast to all nodes and causes a metadata inconsistency leading to the assert
    • Work around: Method 1) "Run the commands mmumount -a", then "smmmount -a" after upgrading pre-4.1 fs which has quota enabled Method 2)Execute commands that update the stripe group descriptor for the file system, for example use mmchdisk to suspend then resume one of the disks of the file system."
    • Problem trigger: After upgrading pre-4.1 fs which has quota enabled, user.quota, group.quota and fileset.quota will be migrated to regular files. In rare cases, accessing them (through VFS interface or accessing internally by tools like mmrepairfs) may cause log assert.
    • Symptom: Abend
    • Platforms affected: All
    • Functional Area affected: Quotas
    • Customer Impact: High
    • IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    • Item: IJ20737
    • Problem description: Create fileset can be called before inode manager recovery has started which hits sig11 when accessing uninitialized variable.
    • Work around: Wait for inode manager recovery to be completed as part of mount before create fileset.
    • Problem trigger: Create fileset before inode manager recovery has started.
    • Symptom: Abend/Crash
    • Platforms affected: All
    • Functional Area affected: Filesets
    • Customer Impact: Suggested
    • IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    • Item: IJ20740
    • Problem description: If unmount interrupts inode manager recovery, it results in file system panic.
    • Work around: Wait for inode manager recovery to be completed as part of mount before unmount.
    • Problem trigger: Unmount while inode manager recovery is in progress.
    • Symptom: Error output/message
    • Platforms affected: All
    • Functional Area affected: All
    • Customer Impact: Suggested
    • IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    • Item: IJ20802
    • Problem description: Attacker can inject arbitrary commands through Spectrum Scale GUI or CLI when using mmsmb commands
    • Work around: None
    • Problem trigger: Command injection
    • Symptom: May not be any errors, or you may see Unexpected Results/Behavior
    • Platforms affected: All
    • Functional Area affected: SMB
    • Customer Impact: High
    • IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    • Item: IJ20807
    • Problem description: Attacker can inject arbitrary commands through Spectrum Scale GUI or CLI when using mmsmb commands
    • Work around: None
    • Problem trigger: Command injection
    • Symptom: May not be any errors, or you may see Unexpected Results/Behavior
    • Platforms affected: All
    • Functional Area affected: SMB
    • Customer Impact: High
    • IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    • Item: IJ20888
    • Problem description: When updating the file size for preallocation, the new file size is calculated incorrectly, which results in an unexpected file size.
    • Work around: Do not try to preallocate the same block more than once.
    • Problem trigger: In an FPO cluster, the problem can be triggered if one tries to pre-allocate the same block more than once and the second request has a larger file size.
    • Symptom: Unexpected Results/Behavior
    • Platforms affected: ALL Linux OS environments
    • Functional Area affected: FPO
    • Customer Impact: High
    • IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    • Item: IJ20889
    • Problem description: The mmvdisk server list command may fail if the servers involved have separate daemon and admin interfaces.
    • Work around: None
    • Problem trigger: Having GNR servers with separate admin and daemon interfaces.
    • Symptom: Error output/message
    • Platforms affected: All
    • Functional Area affected: ESS/GNR
    • Customer Impact: Suggested
    • IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    • Item: IJ20890
    • Problem description: When suspending an ECE server, the server may be incorrectly identified as a quorum node, which may prevent the server from being suspended.
    • Work around: Do not issue the suspend command on a quorum node.
    • Problem trigger: Issuing the mmvdisk recoverygroup --suspend command on a quorum node.
    • Symptom: Error output/message
    • Platforms affected: All
    • Functional Area affected: ESS/GNR
    • Customer Impact: Suggested
    • IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    • Item: IJ20891
    • Problem description: When the SG manager fails during a snapshot command, the new one cleans up incomplete operations during one-time async recovery. This depends on resetting its snapshot state to match the stripe group descriptor that is stored on disk. However, the snapshot state of non-SG-manager nodes is slightly ahead of the stripe group manager during the final stages of a snapshot deletion. The new SG manager needs to correct this discrepancy during takeover when it rereads the descriptor from disk. Otherwise, in rare cases, this inconsistency can lead to an FSSTRUCT error during subsequent snapshot commands.
    • Work around: There is no preventative measure. After problem occurs, however, restarting the new stripe group manager manually will resolve it.
    • Problem trigger: Stripe group manager crash during snapshot commands.
    • Symptom: Error output/message
    • Platforms affected: All
    • Functional Area affected: Snapshots
    • Customer Impact: Very rare, mysterious errors during snapshot commands
    • IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    • Item: IJ20892
    • Problem description: mmfsck --estimate-only option shows unreasonable estimates for some file systems.
    • Work around: None
    • Problem trigger: File system with larger log file sizes.
    • Symptom: Unexpected Results/Behavior
    • Platforms affected: All
    • Functional Area affected: FSCK
    • Customer Impact: Suggested
    • IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    • Item: IJ20940
    • Problem description: In certain configurations, where node name does not contain full domain name suffix, mmvdisk --server will return partial node name string which is not resolvable and cause mmvdisk to print out an error
    • Work around: None
    • Problem trigger: mmvdisk with --server option
    • Symptom: mmvdisk --server will return partial node name string which is not resolvable and cause mmvdisk to print out an error
    • Platforms affected: N/A
    • Functional Area affected: GNR
    • Customer Impact: Suggested
    • IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    • This update addresses the following APARs: IJ20674 IJ20675 IJ20676 IJ20677 IJ20678 IJ20695 IJ20709 IJ20710 IJ20725 IJ20726 IJ20728 IJ20730 IJ20733 IJ20734 IJ20735 IJ20736 IJ20737 IJ20739 IJ20740 IJ20741 IJ20742 IJ20788 IJ20789 IJ20791 IJ20797 IJ20802 IJ20805 IJ20807 IJ20808 IJ20809 IJ20811 IJ20888 IJ20889 IJ20890 IJ20891 IJ20892 IJ20909 IJ20940 IJ20941 IJ20948 IJ20951.

    Problems fixed in Spectrum Scale 5.0.4.4 for Protocols include the following:

    • nfs: ganesha version V2.7.5-ibm056.02
    • smb: Fix error recovery for node that was powered off hard
    • smb: Version gpfs.smb 4.9.18_gpfs_37-1

    Problems fixed in Spectrum Scale 5.0.4.3 for Protocols include the following:

    • nfs: Enable ASAN bits for the Ganesha using configuration files
    • nfs: Remove Ganesha concurrent connection limit
    • nfs: ganesha version V2.7.5-ibm054.05
    • smb: Version gpfs.smb 4.9.18_gpfs_35-1

    Problems fixed in Spectrum Scale 5.0.4.2 for Protocols include the following:

    • nfs: Fix responding with NFS version mismatch
    • nfs: Fix accessing object handle after freeing its last state
    • nfs: call set_current_entry only after checking state_lock
    • nfs: Add LogEventLimited to trace in fsal_common_is_referral
    • nfs: Add Per client and per export stats
    • nfs: Hold latch in mdcache_new_entry() until mdcache_lru_insert() completes
    • nfs: ganesha version V2.7.5-ibm054.03
    • For RPCSEC_GSS handle messages for negotiation or with wrong creds
    • install-toolkit: RHEL 8.1 support
    • install-toolkit: Config populate support for ess3k environment
    • install-toolkit: BDA HDFS protocol support through toolkit.
    • smb: Version gpfs.smb 4.9.16_gpfs_34-1

    Problems fixed in Spectrum Scale 5.0.4.1 for Protocols include the following:

    • zimon: Added missing encoding of special characters to prevent breakage of the REST APIs parsing
    • smb: Version gpfs.smb 4.9.13_gpfs_33-1
    • smb: Close ctdbd inflight connecting TCP sockets after fork.
    • smb: Avoid orphaning the TCP incoming queue
    • smb: Process all records not deleted on a remote node
    • nfs: ganesha version V2.7.5-ibm053.02

    Problems fixed in Spectrum Scale Protocols Packages 5.0.4-0 [Oct 18, 2019]

    • Please see the "What's New" page in the IBM Knowledge Center

[{"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"STXKQY","label":"IBM Spectrum Scale"},"Component":"","ARM Category":[],"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"","Edition":"","Line of Business":{"code":"LOB26","label":"Storage"}}]

Document Information

Modified date:
30 April 2020

UID

isg400004937