Readme and Release notes for release 5.0.4.2 IBM Spectrum Scale 5.0.4.2 Spectrum_Scale_Data_Access-5.0.4.2-x86

Fix Readme

Abstract

xxx

Content

Readme file for: Spectrum Scale
Product/Component Release: 5.0.4.2
Update Name: Spectrum_Scale_Data_Access-5.0.4.2-x86_64-Windows
Fix ID: Spectrum_Scale_Data_Access-5.0.4.2-x86_64-Windows
Publication Date: 30 January 2020
Last modified date: 30 January 2020

Download location
Prerequisites and co-requisites
Known issues

Installation information

Additional information

Download location

Below is a list of components, platforms, and file names that apply to this Readme file.

Fix Download for Windows

Product/Component Name:	Platform:	Fix:
IBM Spectrum Scale	Windows 2008 Windows 2012	Spectrum_Scale_Data_Access-5.0.4.2-x86_64-Windows

Prerequisites and co-requisites

Prerequisites

You may use this 5.0.4.2 package to perform a First Time Install or to upgrade from an existing 4.2.0.0 - 5.0.4.1 (If upgrading node by node ) or 3.5.0 - 5.0.4.1 (If you shutdown and upgrade the entire cluster).

Known issues

Problems discovered in IBM Spectrum Scale releases

None.

Installation information

Downloading Images

Choose the download option "Download using Download Director" to download the new Spectrum Scale package and place it in any location desired on the install node.
Note, if you must (not recommended) use download option "Download using your browser (HTTPS)", instead of clicking on the down arrow to the left of the package name, you must right-click on the package name and select the Save Link As.. option. If you just click on the download arrow, the browser will likely hang.

Installing IBM Spectrum Scale update for Microsoft Windows Server 2008+
Note: If your system has not had a full version of IBM Spectrum Scale installed, you must install the full version prior to performing these steps. When IBM Spectrum Scale is removed from a system, licensing and other information remains allowing the update package to install correctly.

Required packages (Windows):
- gpfs.ext-5.0.4-Windows-license.msi
- gpfs.ext-5.0.4.2-Windows.msi
- gpfs.gskit-8.0.50.86.msi
After you have downloaded the IBM Spectrum Scale 5.0.4.2 install, follow these installation steps:
1. Extract the contents of the ZIP archive so that the .msi file it includes is directly accessible to your system.
2. Uninstall the system's current version of GPFS using the Programs and Features control panel. If prompted to reboot the system, do this before installing the update package.
3. Follow the installation instructions in your IBM Spectrum Scale Installing and upgrading.

Additional information

Notices

Package information
The update image listed below and contained in the ZIP archive is a maintenance package for IBM Spectrum Scale. The update image can be directly applied to your system.

The images can be used for new install or update from a prior level of IBM Spectrum Scale.

After the Windows Installer package (.msi) is installed, you have successfully updated your IBM Spectrum Scale product.

Before installing IBM Spectrum Scale on Windows nodes, verify that all the installation prerequisites have been met. For more information, see the IBM Spectrum Scale Concepts, Planning and Installation Guide in IBM® Knowledge Center .

Update to Version:

5.0.4.2

Update from Version:

4.2.0.0 - 5.0.4.1 (If upgrading node by node )
3.5.0 - 5.0.4.1 (If you shutdown and upgrade the entire cluster)

Update (zip file) contents:
- gpfs.ext-5.0.4-Windows-license.msi
- gpfs.ext-5.0.4.2-Windows.msi
- gpfs.gskit-8.0.50.86.msi

Summary of changes for IBM Spectrum Scale
Unless specifically noted otherwise, this history of problems fixed for IBM Spectrum Scale 5.0.x applies for all supported platforms.

Problems fixed in IBM Spectrum Scale 5.0.4.2 [January 30, 2020]

Item: IJ21257

Problem description: GPFS daemon assert: err == E_OK dirop.C. This could happen after GPFS runs out of file cache entries and is forced to move a directory from file cache to stat cache.

Work around: Increase maxFilesToCache will reduce chance of hitting this assert.

Problem trigger: Directory is being moved from file cache to stat cache.

Symptom: Abend/Crash

Platforms affected: All

Functional Area affected: All

Customer Impact: Critical

IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

Item: IJ21258

Problem description: Running mmsdrrestore against a quorum node in a CCR-enabled cluster will crash the GPFS daemon.

Work around: Shutdown GPFS before performing mmsdrrestore

Problem trigger: Running mmsdrrestore against a quorum node in a CCR-enabled cluster

Symptom: Unexpected Results/Behavior

Platforms affected: All

Functional Area affected: CCR, Admin Commands

Customer Impact: Critical

IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

Item: IJ21260

Problem description: GPFS daemon assert: !(ccP->hasJoined() && ccP->isXClust(destNode())). This could happen after moving a node from one remote cluster to another while both clusters have remote mounted a file system from a home cluster.

Work around: Disable ialloc function ship via "mmchconfig iallocFuncshipEnabled=false -i"

Problem trigger: Moving a node from one remote cluster to another.

Symptom: Abend/Crash

Platforms affected: All

Functional Area affected: Remote cluster mount/UID remapping

Customer Impact: High

IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

Item: IJ21396

Problem description: On RHEL 7 nodes (pre-Linux kernel v3.18), in the GPFS kernel NFS support environment, GPFS may try to acquire some mutex, while holding an inode spin lock, which may be detected as a soft lockup issue by the kernel NMI watchdog.

Work around: None

Problem trigger: GPFS breaks a spin lock holding policy in NFS support environment

Symptom: Performance Impact/CPU stuck

Platforms affected: All RHEL 7.x

Functional Area affected: Users of KNFS/CNFS only

Customer Impact: High

IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

Item: IJ21304

Problem description: If encryption is not configured properly, starting down disks could result in mismatched replicas.

Work around: None

Problem trigger: During the "start disk", repairing mismatched replicas failed on certain files because encryption context was not available, and the error E_ENC_CTX_NOT_READY was treated as a SEVERE error which means that the code continues to repair the replicas to the degree possible. In the final phase of repair, the missupdate flag was incorrectly cleared from the inode even though we did not synchronize the replicas, as the repair failed due to unavailable encryption context. As the missupdate flag was cleared from the inode, a subsequent "start disk" brought up all down disks, but the file still had mismatched replicas. A later "mmrestripefs -c" may then pick up the wrong replica and overwrite the good replicas.

Symptom: Encrypted replicas mismatch after start disk.

Platforms affected: All

Functional Area affected: Core

Customer Impact: Critical

IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

Item: IJ21261

Problem description: On one side of an AFM relationship, an AFM fileset is being deleted and on the other side there's a getstate to show AFM fileset states. The getstate command picks the fileset being deleted to print its stats, and causes the Assert.

Work around: Do not run "mmafmctl getstate/mmdiag" commands when AFM filesets are being Deleted.

Problem trigger: n one side an AFM fileset is being deleted (which could take time depending on number of inodes in the fileset and amount of data). While this is happening, another node in the cluster queries AFM stats on the AFM filesets (mmafmctl getstate (or) an mmdiag running).

Symptom: Abend/Crash

Platforms affected: ALL Linux and AIX environments.

Functional Area affected: AFM

Customer Impact: High

IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

Item: IJ21263

Problem description: Starting from 5.0, a few special afmIOFlags were introduced to make AFM behave in special ways (for migration and replication). The flags started getting out of control, and needed a human readable format to understand what flags are set.

Work around: None

Problem trigger: "mmlsfileset -L --afm" does not print human readable IO Flags.

Symptom: Error output/message.

Platforms affected: ALL Linux and AIX environments.

Functional Area affected: AFM

Customer Impact: Suggested

IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

Item: IJ21422IJ21422

Problem description: EACCESS error is returned to NFS client from Ganesha Server and it can cause IO failure for metadata access (ls command) for file/directory or can fail rm operation on the directory.

Work around: None

Problem trigger: It is difficult to recreate but possible reason could be file/directory move/deletion from parent directory which leaves a disconnected dentry in the linux kernel.

Symptom: IO failure

Platforms affected: Linux Only

Functional Area affected: NFS Ganesha

Customer Impact:

IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

Item: IJ21394

Problem description: Correct description of the resumeRequeued command to indicate that the filesetName is a required argument.

Work around: None

Problem trigger: Running the mmafmctl command as recommended in the man

Symptom: mmafmctl shows wrong help - not mandating the filesetName for mmafmctl resumeRequeued sub command.

Platforms affected: ALL Linux and AIX environments.

Functional Area affected: AFM

Customer Impact: Suggested

IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

Item: IJ21432

Problem description: A linux mknod operation for a FIFO object can encounter this assert if the object is opened before the operation completely finishes.

Work around: The assert can be disabled with the assistance of service via "mmchconfig disableAssert"

Problem trigger: A linux mknod operation to create a FIFO object while another process attempts to open the same object (not actually waiting for the create to complete).

Symptom: Abend/Crash

Platforms affected: ALL Linux OS environments

Functional Area affected: All

Customer Impact: High

IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

Item: IJ21424

Problem description: mmfsadm (safer)dump afm (fset ) - which displays the AFM handler of an AFM fileset - reports incorrect negative values for numAsyncLookups column. Also the same is collected as part of internal dumps that are collected for gpfs.snap.

Work around: None

Problem trigger: The "mmfsadm (safer)dump afm (fset )" command that displays the handler for an AFM fileset is issued. Also the same is collected as part of internal dumps that are collected for gpfs.snap.

Symptom: mmfsadm (safer)dump afm (fset ) - which displays the AFM handler of an AFM fileset - reports incorrect negative values for numAsyncLookups column.

Platforms affected: ALL Linux OS environments

Functional Area affected: AFM

Customer Impact: Suggested

IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

Item: IJ21541

Problem description: AFM deletes the orphan file when the home is not reachable during the lookup. Orphan file is created during the readdir and is repaired during the lookup. It is possible that multiple threads deleting the same orphan file and runs into FSStruct error as the same inode is attempted for deallocation multiple times.

Work around: None

Problem trigger: Doing readdir and lookup on the AFM cache fileset when the home is disconnected after the readdir.

Symptom: Error output/message

Platforms affected: ALL

Functional Area affected: AFM

Customer Impact: High

IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

Item: IJ21550

Problem description: Deadlock could happen if quorum loss occurs on a newly appointed stripe group manager. Threads could be stuck in 'waiting for stripe group takeover' and 'waiting for SG cleanup'.

Work around: None

Problem trigger: Quorum loss just as a node start taking over file system manager role

Symptom: Hang/Deadlock/Unresponsiveness/Long Waiters

Platforms affected: ALL

Functional Area affected: All

Customer Impact: Critical

IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

Item: IJ21554

Problem description: Enable the uage of a list of groups for --ces-group option of the mmces command

Work around: Repeat the command using one ces-group for each command

Symptom: Without the fix the user cannot choose a combination of groups when filtering the command output for ces groups.

Platforms affected: ALL Linux OS environments

Functional Area affected: CES

Customer Impact: Suggested

IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

Item: IJ21557

Problem description: Make timeout of commMsgCheckMessages RPC consistent on all nodes and issue a warning message if it took more than one third of the timeout to get the reply of commMsgCheckMessages RPC.

Work around: None

Problem trigger: Network is not good which leads to sending commMsgCheckMessages RPC

Symptom: Error output/message

Platforms affected: All

Functional Area affected: All

Customer Impact: Suggested

IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

Item: IJ21645

Problem description: On Linux node with kernel version 4.7 or later, when copy one source file with command cp -p, the ACL data is lost in the destination file, if the source file contains many ACL entries, for example, 20+ ACL entries.

Work around: None

Problem trigger: Defect in porting of GPFS to Linux kernel version 4.7.

Symptom: Unexpected Results/Behavior

Platforms affected: Linux nodes with kernel version 4.7 or later

Functional Area affected: All

Customer Impact: High

IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

Item: IJ21647

Problem description: The systemhealth monitor fails to start.

Work around: None

Problem trigger: The problem depends on the provided python packages in the various linux distributions. It seems that not all distros provided the required packages. During development and internal test RHEL 7.6 was used without issues.

Symptom: Unexpected Results/Behavior

Platforms affected: ALL Linux OS environments (CES nodes)

Functional Area affected: CES

Customer Impact: High

IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

Item: IJ21648

Problem description: mmces address add is failing when both object attributes are assigned to one CES IP address

Work around: cat /var/mmfs/gen/cesAddressPoolfile will show the requested information,

Problem trigger:

Symptom: Customer gets incorrect information using the mmces list command.

Platforms affected: ALL Linux OS environments

Functional Area affected: CES

Customer Impact: Suggested

IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

Item: IJ21434

Problem description: GPFS user space daemon crashed during read/write through NFS or mmapplypolicy.

Work around: None

Problem trigger: In openNFS the first lockFile put a hold on the cachObj, the next lockFile in the openNFS skipped the lookup of the file from the hash table which means the cachObjMutex will not acquired, as a result the releaseCacheObjMutex in the end of lockFile wrongly cleared the lockWordCopy in the mutex, unfortunately this mutex was acquired by a daemon thread before the lockFile called releaseCacheObjMutex. So the daemon thread continued to do its work and hit the assert when it called ASSERT_MUTEX_HELD to check it did acquired the mutex. Because the lockWordCopy in the mutex was wrongly cleared by the kernel lockFile, the assert went off in the daemon thread.

Symptom: Daemon crash

Platforms affected: All

Functional Area affected: All

Customer Impact: High

IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

Item: IJ21654

Problem description: AFM dependent filesets does not have .afm/.ptrash/.pconflicts/.afmtrash which are used for storing the conflicting files. This .afmtrash dir is used to move non empty directory during the directory deletion.

Work around: None

Problem trigger: Replication to dependent filesets

Symptom: Unexpected Results/Behavior

Platforms affected: ALL Operating System environments

Functional Area affected: AFM and AFM DR

Customer Impact: High

IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

Item: IJ21659

Problem description: Revalidation on the fileset root path might not happen correctly if the gateway is running some operating systems like RHEL 7.7. This causes the new data from the target path not to be fetched from the home.

Work around: None

Problem trigger: Revalidation on the fileset root path in the AFM caching modes.

Symptom: Unexpected Results/Behavior

Platforms affected: Certain Linux OS environments, like RHEL 7.7

Functional Area affected: AFM

Customer Impact: High

IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

Item: IJ21660

Problem description: An upgrade to a major release of Postgres SQL server will trigger a new health event informing the user that the database will be reinitialized.

Work around: Manually drop the database and allow the GUI to create it.

Problem trigger: Upgrade Postgres SQL to a new major release

Symptom: Component Level Outage

Platforms affected: ALL Linux OS environments

Functional Area affected: REST APIs, GUI

Customer Impact: High

IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

Item: IJ21964

Problem description: Harden mmces command against injection vulnerability

Work around: None

Problem trigger:

Symptom: For some mmces command it is possible to inject a shell command by adding "| " to the parameter list to execute . This injection is possible on the command line and from the gui.

Platforms affected: ALL Linux OS environments

Functional Area affected: CES

Customer Impact: Critical: security issue

IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

*Item: IJ21974

Problem description: AFM gateway daemon asserts when the request arrives before the filesystem is mounted.

Work around: Remove gateway designation from the gateway, start GPFS, mount filesystem, and make the node a gateway again using the "mmchnode --gateway -N " command.

Problem trigger: Start the gateway node while IO is running on the AFM fileset.

Symptom: Crash

Platforms affected: Linux

Functional Area affected: AFM

Customer Impact: Critical

IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

Item: IJ21975

Problem description: AFM gets the sparse information from the home before reading the file and the actual data size is used to set the cached bit. It is possible that data blocks allocated at the cache are more than the actual data size if the file is sparse in between and cached bit is set without fully reading the file.

Work around: Disable sparse file detection using the afmReadSparseThreshold=disable command

Problem trigger: AFM read on the sparse files with afmReadSparseThreshold set (default on)

Symptom: Unexpected result

Platforms affected: Linux

Functional Area affected: AFM

Customer Impact: HiPER

IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

Item: IJ21977

Problem description: AFM gateway daemon asserts if the remote mount initialization fails during the first access to the fileset

Work around: None

Problem trigger: Remote mount failure

Symptom: Crash

Platforms affected: Linux

Functional Area affected: AFM

Customer Impact: High

IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

Item IJ21978

Problem description: When deleting a snapshot, the process may miss to move the data blocks of being delete snapshot files with small inode numbers. The inodes with small numbers must be in the same inode block with fileset metadata file, and not in the first inode block of inode 0 file.

Work around: None

Problem trigger: Deleting a snapshot which contains a file with small inode number

Symptom: Data corruption

Platforms affected: All

Functional Area affected: Snapshot

Customer Impact: Critical

IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

Item: IJ22022

Problem description: When handling page fault GPFS didn't detach I/O buffer segment. This later caused kernel crash.

Work around: None

Problem trigger: Multiple threads doing both normal I/O and mmap I/O on the same file at the same time.

Symptom: Kernel crash

Platforms affected: AIX

Functional Area affected: Mmap I/O

Customer Impact: Critical

IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

Item: IJ22004

Problem description: AFM gateway daemon crashes during resync operations due to the race between the thread which is monitoring the stuck messages and threads replicating the data.

Work around: Increase the afmAsyncOpWaitTimeout value

Problem trigger: AFM resync

Symptom: Crash

Platforms affected: All Linux OS environments

Functional Area affected: AFM

Customer Impact: Critical

IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

Item: IJ22005

Problem description: Customer data showed GPFS asserted when trying to open a disk on calling mmadddisk/mmrpldisk because the disk was assigned a valid storage pool. The root of the problem (why the disk was associated with an invalid storage pool during mmadddisk/mmrpldisk) was not discovered due to lack of data. The logic is: by the time GPFS tries to open a disk due to stripe group descriptor update from mmadddisk/mmrpldisk, the disk should be assigned to a valid storage pool. It is decided to safeguard GPFS not to open a disk when the disk is assigned to an invalid storage pool.

Work around: None

Problem trigger: This problem has not surfaced internally and there is not enough data from customer to find out why this could happen. From examining the code, GPFS should have assigned a valid storage pool during mmadddisk/mmrpldisk even though the disk was created without specifying the storage pool.

Symptom: Abend/Crash

Platforms affected: All

Customer Impact: High

IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

Item: IJ22007

Problem description: Online replica compare function (mmrestripefs -c) could give incorrect replica mismatch error on directories. This could happen if subblock size for metadata is greater than 256K.

Work around: None

Problem trigger: Run mmrestripefs -c on file system with metadata subblock size greater than 256K.

Symptom: Error output/message

Functional Area affected: Admin commands

Customer Impact: Suggested

IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

Item: IJ21097

Problem description: The ports of 2nd (and later) IB adapter on the node which starts the verbs connection might be mis-recognized as RDMA CM disabled ports, and fails to be connected. The nodes that start the verbs connection are nsd clients if verbsRdmaSend=no, but they also could be other nodes if verbsRdmaSend=yes. You will see "ibv_modify_qp init err 22" error message in the mmfs.log file if it happens.

Work around: None. But if RDMA-CM is not really needed in your environment, you can just disable it.

Problem trigger: Users having

Platforms affected: ALL Linux OS environments

Functional Area affected: RDMA

Customer Impact: High

IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

Item: IJ22034

Problem description: File creation could fail unexpectedly with EFBIG error. This could happen when multiple nodes access the same directory while 1 node repeatedly create and delete the same file in the directory.

Work around: Perform rename on a file in the directory after encounter EFBIG error.

Problem trigger: Repeatedly create and delete the same file in a directory.

Symptom: Unexpected Results/Behavior

Platforms affected: ALL

Functional Area affected: All

Customer Impact: High

IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

Item: IJ22009

Problem description: GPFS command mmchattr stores extended attribute name value pair into the inode itself the same even for ACL xattr, which should be stored into GPFS internal ACL file. This behavior of ACL xattr handling may confuse users.

Work around: None.

Problem trigger: None

Symptom: Confused output

Platforms affected: Linux

Functional Area affected: All

Customer Impact: Suggested

IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

Item: IJ22036

Problem description: On a file system with unavailable metadata disks, log recovery failure prevents file system from being mounted or disks from being started. Either mmfsck -xk should allow repair of logs in this case or tsdbfs -f should allow user to patch the disks states. Fixed code to bypass disk availability check if fsck is invoked in read-only mode. This allows both mmfsck -xk and tsdbfs -f to run in such situations.

Work around: Use a node at version less than 5.0.2 to either run mmfsck -xk or tsdbfs -f to patch disk states. This only works if the file system version is less than 5.0.2.

Problem trigger: File system disks are down and log recovery has failed.

Symptom: Error output/message

Platforms affected: ALL Operating System environments

Functional Area affected: FSCK

Customer Impact: High

IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

Item: IJ22010

Problem description: Log recovery error after node failure can cause recovery buffer to be overwritten which will most likely lead to GPFS daemon assert.

Work around: None.

Problem trigger: Node failure

Symptom: Abend/Crash

Platforms affected: ALL Operating System environments

Functional Area affected: All

Customer Impact: HiPER

IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

Item: IJ22013

Problem description: In AFM Stopped and Queue dropped states, when file/directory are removed at the cache site the inode is still seen as USERFILE and is not reclaimed.

Work around: None.

Problem trigger: Running applications/workload when AFM fileset is in Stopped state.

Symptom: Unexpected Results/Behavior

Platforms affected: All Linux and AIX operating systems. Functional Area affected: AFM

Customer Impact: High

IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

Item: IJ22017

Problem description: TSM client version can contain 2 or more digits in any position of V.R.M.F but mmbackup cannot handle such case. As a result, mmbackup fails while parsing TSM client version.

Work around: None.

Problem trigger: Executing mmbackup with TSM client 8.1.10.

Symptom: Component Level Outage

Platforms affected: ALL Operating System environments

Functional Area affected: mmbackup

Customer Impact: High

IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

Item: IJ22024

Problem description: mmprotocol trace timers of manually stopped traces would unexpectedly stop newly initiated mmprotocol traces.

Work around: 1. Either wait for the duration of the previous protocol trace (default: 10 min) before starting a new trace for the same component 2. or kill all mmprotocoltrace processes on all CES nodes, which participate in the trace (by default: all CES nodes)

Problem trigger: Starting the second protocol trace via mmprotocoltrace for the same component after the first trace was manually stopped and the timeout of the first trace was not yet reached.

Symptom: Unexpected Results/Behavior

Platforms affected: ALL Linux OS environments

Functional Area affected: Trace CES

Customer Impact: High

IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

Item: IJ22041

Problem description: When running "mmdiag --waiters" or "mmfsadm dump waiters", or the periodical health check performs long waiters detection, the code could run into memory overflow for a local buffer, then triggers the signal 6 to mmfsd daemon and causes it restarted abnormally.

Work around: None.

Problem trigger: mmdiag --waiters or mmfsadm dump waiters, or the periodical health check inside mmfsd daemon.

Symptom: Daemon crash

Platforms affected: ALL Operating System environments

Functional Area affected: Long waiters detection and dump

Customer Impact: High

IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

Item: IJ21955

Problem description: When using Microsoft Office applications such as Word and Excel on Windows 10 version 1709 or newer, any attempt to modify and save an xisting file (.docx, .xlsx etc) will fail with sharing violation error.

Work around: None.

Problem trigger: This issue is triggered when installing or upgrading to Windows 1 version 1709 or newer. It is also hit in Windows Server version 1809 or newer.

Symptom: Sharing violation errors when attempting to modify and save existing *.docx, *.xlsx (and other Office) files using Microsoft Office applications such as Word and Excel. Saving as a different name works.

Platforms affected: Windows/x86_64 only. Specifically, Windows 10 (version 1709 or newer) and Windows Server (version 1809 or newer) only.

Functional Area affected: Windows.

Customer Impact: High Importance.

IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

Item: IJ22158

Problem description: ccMsgGroupJoinPhaseN message is sent to all the nodes which are up during the join protocol, in this case this message is sent to the down gateway node causing the deadlock

Work around: None.

Problem trigger: Remote node joining the cluster with a down gateway node.

Symptom: Deadlock

Platforms affected: ALL Operating System environments

Functional Area affected: AFM and AFM DR

Customer Impact: Critical

IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

Item: IJ22207

Problem description: Kernel assert going off: bufOffset+len = iobP->ioBufLen in file cxiIOBuffer.c, resulting in a kernel panic.

Work around: None.

Problem trigger: Calling Spectrum Scale APIs to scan inodes in the file system. Note that some binaries delivered along with Spectrum Scale package are also calling such Spectrum Scale APIs, like policy rules to scan files in the file system, IE snapshot restore and sobar backup utilities.

Symptom: Abend/Crash

Platforms affected: ALL Operating System environments

Functional Area affected: applications using GPFS APIs, including policy, snapshot restore and sobar backup.

Customer Impact: High Importance.

IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

Item: IJ22261

Problem description: Due to the way mmfsck internally traverses reserved files and snapshots it is not able to report and fix duplicate addresses present among inode 0 files of the active file system and its snapshots. So as a result of this even though mmfsck -y runs successfully and reports file system as clean the duplicate address corruptions are not fixed and so the next mmfsck run will report some new corruptions like mismatched replicas present in inode 0. And there can also be fsstructs reported in the logs due to this after mmfsck -y

Work around: Delete all the snapshots in the file system and then run mmfsck repair

Problem trigger: ??

Symptom: Operation failure due to FS corruption Also on a file system having snapshots the fsck output shows the below signs after a successful mmfsck -y run: 1) Mismatch replicas in inode 0 Error in inode 0 snap 0: Inode block 289710225 has mismatched replicas 2) Even though no duplicates are reported fsck shows the below Checking for the first reference to a duplicate fragment. 3) Even though no duplicates are reported we see a non-zero duplicates count at the end of fsck output 896 duplicates

Platforms affected: ALL Operating System environments

Functional Area affected: FSCK is not able to repair the corruption

Customer Impact: Critical

IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

This update addresses the following APARs: IJ21097 IJ21257 IJ21258 IJ21260 IJ21261 IIJ21263 IJ21304 IJ21394 IJ21396 IJ21422 IJ21424 IJ21432 IJ21434 IJ21541 IJ21550 IJ21554 IJ21557 IJ21645 IJ21648 IJ21654 IJ21955 IJ21659 IJ21660 IJ21964 IJ21974 IJ21975 IJ21977 IJ21978 IJ22004 IJ22005 IJ22007 IJ22009 IJ22010 IJ22013 IJ22017 IJ22022 IJ22024 IIJ22034 IJ22036 IJ22041 IJ22158 IJ22207 IJ22261

Problems fixed in Spectrum Scale 5.0.4.1 [November 21, 2019]

Item: IJ20948

Problem description: On an AFM cache cluster using the AFM independent writer mode data maybe incompletely read if a file is modified before it is fully cached. Normally AFM reads a file from the AFM home cluster before allowing write operations to occur. However, if a file is not opened in append mode but a write is made at the end of a file, the data for the file may not be completely cached.

Work around: Run prefecth on the partially cached files.

Problem trigger: AFM caching modes and updating at the end of the file before fully caching it.

Symptom: Unexpected results

Platforms affected: All

Functional Area affected: AFM caching.

Customer Impact: HiPER

IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

Item: IJ20909

Problem description: When mmfsck scans and finds corrupted reserved file blocks it prints the list of blocks corrupted and due to a code bug in that path, the file system manager node asserts with Signal 11.

Work around: Do not run mmfsck

Problem trigger: This will happen when mmfsck is run on a file system having corrupted reserved file blocks

Symptom: File system manager node assert

Platforms affected: All

Functional Area affected: FSCK

Customer Impact: High

IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

Item: IJ20710

Problem description: FSSTRUCT error: FSErrCheckHeaderFailed could be issued while accessing some directory. This could happen on a file system with metadata replication where there is metadata disk in down state and node failure.

Work around: None

Problem trigger: Metadata disk in down state and node failure.

Symptom: Error output/message

Platforms affected: All

Functional Area affected: All

Customer Impact: Critical

IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

Item: IJ20678

Problem description: On a node with multiple file system mounted, DiskLeastThread could be blocked by a file system unmount causing delay in renewal of disk lease and potential quorum loss.

Work around: None

Problem trigger: File system unmount

Symptom: Node expel/Lost Membership

Platforms affected: All

Functional Area affected: Cluster Membership

Customer Impact: High

IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

Item: IJ20726

Problem description: After a system crash the configuration file /etc/sysconfig/ganesha contained only an entry for NOFILE but not for OPTIONS and EPOCH_EXEC any more. No Ganesha logs were created.

Work around: Since there is no backup file of /etc/sysconfig/ganesha by default, the file must be extracted either from RPM or fetched from another CES node.

Problem trigger: The /etc/sysconfig/ganesha file was modified in-place whenever NFS was started. The procedure used the sed -i command for this. The goal was to have always the latest NOFILE entry in the file, along with those of OPTION (startup options for Ganesha) and EPOCH_EXE. Some investigation indicate that during a system crash not all changes in the file were written to disk. So once this file is damaged or truncated, the only entry is then the added NOFILE data. Previously existing OPTIONS and EPOC_EXEC cannot be recovered since there is no mechanism to do so. After the code change the NOFILE data is updated on a copy of the original configuration file. If the changes are all done then this copy is restored back to the original file.

Symptom: Unexpected Results/Behavior

Platforms affected: ALL Linux OS environments (CES nodes)

Functional Area affected: CES

Customer Impact: High

IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

Item: IJ20811

Problem description: The application allows a regular user to inject OS commands in the "NFS Exports" Client field. The injected command is executed on the underlying operating system as "root" user.

Work around: None

Problem trigger: Using the GUI to add NFS exports allows this condition.

Symptom: Behavior - Security risk

Platforms affected: All

Functional Area affected: NFS

Customer Impact: High

IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

Item: IJ20941

Problem description: Removing message headers that are not utilized from librdkafka message to reduce message size sent to external sink.

Work around: None

Problem trigger: When running Clustered watch with a heavy workload producing many events, if the external kafka cluster gets overloaded, clustered watch may hit a timeout and auto disable. With this fix, the librdkafka message size reduction makes it less likely to hit this timeout.

Symptom: 45 Second timeout on clustered watch will hit causing conduit(s) to go down. Following error message in /var/adm/ras/mmwfclient.log 2019-08-26_00:51:49: [E] WF Producer: t: newtopic a: 3

Platforms affected:

Functional Area affected:

Customer Impact: High

IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

Item: IJ20797

Problem description: AFM Secondary mode filesets are passive in nature (and RO), since Primary can be the only one allowed to perform Write class operations on the secondary mode fileset. This bug allows Creates to be directly performed on the Secondary mode fileset even when afmSecondaryRW is set to no. However other write class operations like set times, chmod etc are not allowed on the file.

Work around: None

Problem trigger: User tries to perform IO Operations on an AFM Secondary mode fileset when afmSecondaryRW is set to no.

Symptom: Unexpected Results/Behavior

Platforms affected: All Linux and AIX environments.

Functional Area affected: AFM

Customer Impact: Suggested

IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

Item: IJ20733

Problem description: Node crashes with assert when the AFM fileset with active IO is unlinked.

Work around: Stop AFM fileset and then unlink the fileset.

Problem trigger: Fileset unlink with active IO.

Symptom: Abend/Crash

Platforms affected: ALL

Functional Area affected: AFM

Customer Impact: High

IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

Item: IJ20677

Problem description: Daemon crashes due to invalid config setting where enableStatUIDremap is enabled without enabling the enableUIDremap config option.

Work around: Enable both enableUIDremap and enableStatUIDremap options.

Problem trigger: UID remapping with invalid config options.

Symptom: Crash

Platforms affected: All

Functional Area affected: Remote cluster mount/UID remapping

Customer Impact: High

IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

Item: IJ20730

Problem description: When running "mmces node suspend/resume -N" with a list of nodes it might happen that not all of them are in the expected state afterwards.

Work around: Repeat the "mmces node suspend/resume -N" command with a list of nodes which were not set to the expected state previously.

Problem trigger: The cesiplist file has a unique serial number assigned when it is stored in CCR. Each node reads the cesiplist file (and its serial number) from CCR as a local copy and modifies the suspend flag in that local copy. After this all nodes which did this kind of local update try now to update their modified copy of the cesiplist file in CCR with an incremented (+1) serial number. That may fail when other nodes did this update already with the same serial number earlier. There is some randomness, since not all nodes try this update at the very same time. There could be a timespan of several seconds between the first and the last one, so that some nodes get updated cesiplist files and serial numbers, and work on those.

Symptom: Unexpected Results/Behavior

Platforms affected: ALL Linux OS environments (CES nodes)

Functional Area affected: CES

Customer Impact: Suggested

IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

Item: IJ20741

Problem description: Fix quota share revoke/reclaim delay when the quota usage is approaching the limits.

Work around: None

Problem trigger: When quota usage is approaching the limits (hard limit), the attempts to reclaim the remaining quota shares from other quota clients can lead to very slow quota management operations.

Symptom: Processes waiting for available quota, when the quota usage is approaching the limits, leading to apparent system hung.

Platforms affected: All

Functional Area affected: Quotas

Customer Impact: High Importance: an issue which will cause a degradation of the system in some manner, or loss of a less central capability

IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

Item: IJ20808

Problem description: On AIX, when trying to clear/write the primary GPT area, mmcrnsd does non-4k aligned writes to 4K disks while trying to preserve the OS PVID, causing a failure.

Work around: None

Problem trigger: Create an nsd out of 4kb sector size native disk(s) on AIX

Symptom: Error output/message

Platforms affected: AIX

Functional Area affected: All

Customer Impact: High

IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

Item: IJ20805

Problem description: The GPFS daemon (mmfsd) consumes high CPU load on a quorum node when Windows 2016 is used as the operating system. This is caused by a CCR thread listening to incoming CCR requests on cached connections from other quorum nodes by using the poll system call. This logic doesn't consider particular flags returned by the poll system call (in detail: POLLHUP, POLLERR, POLLNVAL). A second GPFS daemon (mmsdrserv) might be affected by this issue. This daemon is running when GPFS has been shutdown by the mmshutdown command. This issue doesn't occur on Linux or AIX.

Work around: Assign other nodes as quorum nodes which don't use Windows 2016 as the underlying operating system, if possible, e.g. nodes in the cluster running on Linux or AIX.

Problem trigger: GPFS startup (mmsdrserv starts automatic, mmfsd after 'mmstartup -a')

Symptom: -Performance Impact/Degradation -Unresponsiveness

Platforms affected: Windows 2016 (at least, earlier/later Windows version might be affected too)

Functional Area affected: CCR admin commands

Customer Impact: High

IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

Item: IJ20736

Problem description: Readdir is failed for pcache fileset root, due to cache bit set for first created pcache fileset (even it is not linked) on non-existing pcache fileset filesystem.

Work around: None

Problem trigger: Access files structure first time from first Pcache fileset on non-existing pcache fileset filesystem .

Symptom: File/dir tree mismatches.

Platforms affected: ALL Linux OS environments

Functional Area affected: AFM

Customer Impact: Suggested

IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

Item: IJ20742

Problem description: In a multicluster environment, a remote cluster client node is creating a file in a directory inode which has its metanode in a different remote client cluster. Live lock can happen in this case, if the directory is empty or small, due to a performance optimization.

Work around: Use the directory only from one remote cluster.

Problem trigger: Creating files in an empty or small directory from two remote clusters

Symptom: Hang

Platforms affected: ALL

Functional Area affected: All

Customer Impact: High

IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

Item: IJ20728

Problem description: Policy scans cannot be executed successfully through the Jobs framework

Work around: Manually change the commandtemplates json file: # mmccr fget _jobCommandTemplates.json /tmp/jct.json # vim /tmp/jct.json # mmccr fput _jobCommandTemplates.json /tmp/jct.json # /usr/lpp/mmfs/gui/bin/runtask GPFS_JOBS The two changes that need to be made are: change localWorkDir to localWorkDirectory in the command template - change fileListPathname to fileListPathName in the argument definition

Problem trigger: The policy-scan template is used in a job

Symptom: Command execution failure

Platforms affected: All

Functional Area affected: Jobs

Customer Impact: Suggested

IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

Item: IJ20725

Problem description: When writing to a memory mapped file that was compressed, it fails with SIGBUS when mmapRangeLock config variable is disabled.

Work around: Don't disable mmapRangeLock config variable

Problem trigger: Writing to memory mapped files that were compressed, and mampRangeLock config variable is disabled.

Symptom: Application fails with SIGBUS

Platforms affected: ALL Linux OS environments

Functional Area affected: All

Customer Impact: It is critical if customer disabled mmapRangeLock config variable.

IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

Item: IJ20709

Problem description: Updating ESS drive firmware on a live system can be blocked for long periods of time (and may timeout) due to a declustered array that shows up in "rebalance" state.

Work around: None

Problem trigger: This problem is seen when updating drive firmware.

Symptom: Error output/message

Platforms affected: ALL Linux OS environments

Functional Area affected: ESS/GNR

Customer Impact: Suggested

IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

Item: IJ20695

Problem description: A TCT enabled system can see gpfs waiters of type "LweAccessRightThread waiting for XW lock"

Work around: None

Problem trigger: If a dmapi right is acquired on a file, and the file gets deleted, then releasing the right would cause a waiter to appear

Symptom: appearance of gpfs waiters of type "LweAccessRightThread waiting for XW lock"

Platforms affected: ALL Linux OS environments

Functional Area affected: TCT

Customer Impact: High

IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

Item: IJ20675

Problem description: During the Copy On Write process in which a data block is copied to a snapshot, if the metanode fails, there is a chance for the assert to happen, due to the flush flag not being held.

Work around: None

Problem trigger: With debugDataControl set to heavy on AIX when automatic debug data collection on unexpected long waiter happens.

Symptom: Performance Impact/Degradation

Platforms affected: All non-Linux platforms.

Functional Area affected: All

Customer Impact: High

IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

Item: IJ20951

Problem description: There are some access problems to disks, causing the log recovery failure, eventually causing the file system to be panicked on all nodes. Since the incoming remote mounts prevented the offline fsck from running, users then moved the file system into maintenance mode and wanted to try offline fsck again. However, the log recovery was not skipped even when the file system was in maintenance mode, so resulted in the same result for the offline fsck running.

Work around: None

Problem trigger: The file system logs for some nodes are not clean before moving the file system into maintenance mode.* Symptom: Log recovery is attempted and failed.

Platforms affected: All

Functional Area affected: File System Maintenance Mode

Customer Impact: High

IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

Item: IJ20789

Problem description: When deleting a global snapshot, if the snapshot refers to a deleted fileset then the assert will be triggered.

Work around: None

Problem trigger: This problem only happens when deleting a global snapshot, while a fileset included in it has been deleted.

Symptom: Daemon abend

Platforms affected: All

Functional Area affected: Global snapshot deletion

Customer Impact: Suggested

IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

Item: IJ20809

Problem description: Daemon crashed with assert ofP->metadata.notAFragment(subblocks)> It may occur in appending data to a file after previous write was failed due to invalid data buffer in application.

Work around: Make sure the user data buffer is valid before write data into the scale file system

Problem trigger: An invalid user data buffer caused GPFS to fail when writing data to a file while leaving the invalid data in the buffer. A flush of the buffer incorrectly set the file's fragment to a full block which resulted in a failure to expand the last block of the file, triggering the assert.

Symptom: Scale daemon crashed with assert ofP->metadata.notAFragment(subblocks) in bufdesc.C

Platforms affected: All

Functional Area affected: Core

Customer Impact: High

IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

Item: IJ20739

Problem description: The RPO thread which takes care of creating RPO snapshots for AFM DR filesets, is taking locks on all filesets in filesystem before it can see which filesets require RPO snapshots to be taken. This includes any non-AFM independent/dependent filesets as well.

Work around: None

Problem trigger: Having multiple AFM DR Primary filesets with RPO intervals enabled.

Symptom: Performance Impact/Degradation Hang/Deadlock/Unresponsiveness/Long Waiters (Lesser probability)

Platforms affected: All Linux

Functional Area affected: AFM Snapshots

Customer Impact: Suggested

IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

Item: IJ20791

Problem description: After migrating a file from GPFS to external storage any indirect blocks used by the file are not freed.

Work around: None

Problem trigger: Migration of large files, requiring indirect blocks, to external storage.

Symptom: Metadata disk space is not freed after files are migrated to external storage.

Platforms affected: All

Functional Area affected: All

Customer Impact: Critical

IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

Item: IJ20674

Problem description: Due to a bug, fsck continues to process a deleted inode and marks it as an orphan which causes this assert.

Work around: Patch the problematic inode using tsdbfs so that the inode is no longer corrupt and retry fsck.

Problem trigger: A deleted inode is corrupt.

Symptom: Abend/Crash

Platforms affected: ALL

Functional Area affected: FSCK

Customer Impact: High

IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

Item: IJ20676

Problem description: During offline fsck multi-pass directory scan, if patch queue feature is disabled and --skip-inode-check option is used, then fsck tries to access an out of range entry in dotdotArray and hits this assert.

Work around: None

Problem trigger: Multi-pass offline fsck --skip-inode-check with patch queue feature disabled.

Symptom: Abend/Crash

Platforms affected: ALL

Functional Area affected: FSCK

Customer Impact: High

IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

Item: IJ20734

Problem description: Every 15 seconds the CES monitor daemon runs a helper script to analyze the state of CES. The change hardened the monitor to not die but to collect information about the malfunction. If the malfunction repeats this problem is reported to system health by an event and the customer will find the problem in the event logs of mmhealth.

Work around: Before the implementation of the fix the information had to be collected from the log files.

Problem trigger: Unexpected behavior of a helper script called by CES monitor daemon. The helper script may die because of low memory, blocked lock, etc.

Symptom: Performance Impact/Degradation

Platforms affected: ALL Linux OS environments

Functional Area affected: CES

Customer Impact: Suggested: has little or no impact on customer operation

IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

Item: IJ20735

Problem description: GPFS daemon could assert when trying to mount a file system. This could happen after a node failure and file system is being mounted again after daemon restart. File system manager node would also fail with an assert.

Work around: None

Problem trigger: A client node failure

Symptom: Abend/Crash

Platforms affected: All

Functional Area affected: All

Customer Impact: High

IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

Item: IJ20788

Problem description: After upgrading a cluster with a pre-4.1 file system which has quotas enabled, the old user visible quota files will be converted to GPFS internal files. This changes is kept in the stripe group descriptor for the file system. However, this change is not broadcast to all nodes and causes a metadata inconsistency leading to the assert

Work around: Method 1) "Run the commands mmumount -a", then "smmmount -a" after upgrading pre-4.1 fs which has quota enabled Method 2)Execute commands that update the stripe group descriptor for the file system, for example use mmchdisk to suspend then resume one of the disks of the file system."

Problem trigger: After upgrading pre-4.1 fs which has quota enabled, user.quota, group.quota and fileset.quota will be migrated to regular files. In rare cases, accessing them (through VFS interface or accessing internally by tools like mmrepairfs) may cause log assert.

Symptom: Abend

Platforms affected: All

Functional Area affected: Quotas

Customer Impact: High

IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

Item: IJ20737

Problem description: Create fileset can be called before inode manager recovery has started which hits sig11 when accessing uninitialized variable.

Work around: Wait for inode manager recovery to be completed as part of mount before create fileset.

Problem trigger: Create fileset before inode manager recovery has started.

Symptom: Abend/Crash

Platforms affected: All

Functional Area affected: Filesets

Customer Impact: Suggested

IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

Item: IJ20740

Problem description: If unmount interrupts inode manager recovery, it results in file system panic.

Work around: Wait for inode manager recovery to be completed as part of mount before unmount.

Problem trigger: Unmount while inode manager recovery is in progress.

Symptom: Error output/message

Platforms affected: All

Functional Area affected: All

Customer Impact: Suggested

IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

Item: IJ20802

Problem description: Attacker can inject arbitrary commands through Spectrum Scale GUI or CLI when using mmsmb commands

Work around: None

Problem trigger: Command injection

Symptom: May not be any errors, or you may see Unexpected Results/Behavior

Platforms affected: All

Functional Area affected: SMB

Customer Impact: High

IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

Item: IJ20807

Problem description: Attacker can inject arbitrary commands through Spectrum Scale GUI or CLI when using mmsmb commands

Work around: None

Problem trigger: Command injection

Symptom: May not be any errors, or you may see Unexpected Results/Behavior

Platforms affected: All

Functional Area affected: SMB

Customer Impact: High

IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

Item: IJ20888

Problem description: When updating the file size for preallocation, the new file size is calculated incorrectly, which results in an unexpected file size.

Work around: Do not try to preallocate the same block more than once.

Problem trigger: In an FPO cluster, the problem can be triggered if one tries to pre-allocate the same block more than once and the second request has a larger file size.

Symptom: Unexpected Results/Behavior

Platforms affected: ALL Linux OS environments

Functional Area affected: FPO

Customer Impact: High

IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

Item: IJ20889

Problem description: The mmvdisk server list command may fail if the servers involved have separate daemon and admin interfaces.

Work around: None

Problem trigger: Having GNR servers with separate admin and daemon interfaces.

Symptom: Error output/message

Platforms affected: All

Functional Area affected: ESS/GNR

Customer Impact: Suggested

IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

Item: IJ20890

Problem description: When suspending an ECE server, the server may be incorrectly identified as a quorum node, which may prevent the server from being suspended.

Work around: Do not issue the suspend command on a quorum node.

Problem trigger: Issuing the mmvdisk recoverygroup --suspend command on a quorum node.

Symptom: Error output/message

Platforms affected: All

Functional Area affected: ESS/GNR

Customer Impact: Suggested

IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

Item: IJ20891

Problem description: When the SG manager fails during a snapshot command, the new one cleans up incomplete operations during one-time async recovery. This depends on resetting its snapshot state to match the stripe group descriptor that is stored on disk. However, the snapshot state of non-SG-manager nodes is slightly ahead of the stripe group manager during the final stages of a snapshot deletion. The new SG manager needs to correct this discrepancy during takeover when it rereads the descriptor from disk. Otherwise, in rare cases, this inconsistency can lead to an FSSTRUCT error during subsequent snapshot commands.

Work around: There is no preventative measure. After problem occurs, however, restarting the new stripe group manager manually will resolve it.

Problem trigger: Stripe group manager crash during snapshot commands.

Symptom: Error output/message

Platforms affected: All

Functional Area affected: Snapshots

Customer Impact: Very rare, mysterious errors during snapshot commands

IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

Item: IJ20892

Problem description: mmfsck --estimate-only option shows unreasonable estimates for some file systems.

Work around: None

Problem trigger: File system with larger log file sizes.

Symptom: Unexpected Results/Behavior

Platforms affected: All

Functional Area affected: FSCK

Customer Impact: Suggested

IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

Item: IJ20940

Problem description: In certain configurations, where node name does not contain full domain name suffix, mmvdisk --server will return partial node name string which is not resolvable and cause mmvdisk to print out an error

Work around: None

Problem trigger: mmvdisk with --server option

Symptom: mmvdisk --server will return partial node name string which is not resolvable and cause mmvdisk to print out an error

Platforms affected: N/A

Functional Area affected: GNR

Customer Impact: Suggested

IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

This update addresses the following APARs: IJ20674 IJ20675 IJ20676 IJ20677 IJ20678 IJ20695 IJ20709 IJ20710 IJ20725 IJ20726 IJ20728 IJ20730 IJ20733 IJ20734 IJ20735 IJ20736 IJ20737 IJ20739 IJ20740 IJ20741 IJ20742 IJ20788 IJ20789 IJ20791 IJ20797 IJ20802 IJ20805 IJ20807 IJ20808 IJ20809 IJ20811 IJ20888 IJ20889 IJ20890 IJ20891 IJ20892 IJ20909 IJ20940 IJ20941 IJ20948 IJ20951.

Problems fixed in Spectrum Scale 5.0.4.2 for Protocols include the following:

nfs: Fix responding with NFS version mismatch

nfs: Fix accessing object handle after freeing its last state

nfs: call set_current_entry only after checking state_lock

nfs: Add LogEventLimited to trace in fsal_common_is_referral

nfs: Add Per client and per export stats

nfs: Hold latch in mdcache_new_entry() until mdcache_lru_insert() completes

nfs: ganesha version V2.7.5-ibm054.03

For RPCSEC_GSS handle messages for negotiation or with wrong creds

install-toolkit: RHEL 8.1 support

install-toolkit: Config populate support for ess3k environment

install-toolkit: BDA HDFS protocol support through toolkit.

smb: Version gpfs.smb 4.9.16_gpfs_34-1

Problems fixed in Spectrum Scale 5.0.4.1 for Protocols include the following:

zimon: Added missing encoding of special characters to prevent breakage of the REST APIs parsing

smb: Version gpfs.smb 4.9.13_gpfs_33-1

smb: Close ctdbd inflight connecting TCP sockets after fork.

smb: Avoid orphaning the TCP incoming queue

smb: Process all records not deleted on a remote node

nfs: ganesha version V2.7.5-ibm053.02

Problems fixed in Spectrum Scale Protocols Packages 5.0.4-0 [Oct 18, 2019]

Please see the "What's New" page in the IBM Knowledge Center

[{"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"STXKQY","label":"IBM Spectrum Scale"},"Component":"","ARM Category":[],"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"","Edition":"","Line of Business":{"code":"LOB26","label":"Storage"}}]

Was this topic helpful?

Document Information

Modified date:
18 February 2020

UID

isg400004819

Tips

Readme and Release notes for release 5.0.4.2 IBM Spectrum Scale 5.0.4.2 Spectrum_Scale_Data_Access-5.0.4.2-x86_64-Windows Readme

Fix Readme

Abstract

Content

Contents

Installation information

Download location

Fix Download for Windows

Prerequisites and co-requisites

Known issues

Installation information

Additional information

Was this topic helpful?

Document Information

UID

Share your feedback

Need support?