Fix Readme
Abstract
xxx
Content
Readme file for: Spectrum Scale
Product/Component Release: 5.0.4.2
Update Name: Spectrum_Scale_Data_Access-5.0.4.2-x86_64-Windows
Fix ID: Spectrum_Scale_Data_Access-5.0.4.2-x86_64-Windows
Publication Date: 30 January 2020
Last modified date: 30 January 2020
Installation information
Download location
Below is a list of components, platforms, and file names that apply to this Readme file.
| Product/Component Name: | Platform: | Fix: |
|---|---|---|
| IBM Spectrum Scale | Windows 2008 Windows 2012 | Spectrum_Scale_Data_Access-5.0.4.2-x86_64-Windows |
Prerequisites and co-requisites
Installation information
-
Downloading Images Choose the download option "Download using Download Director" to download the new Spectrum Scale package and place it in any location desired on the install node.
Note, if you must (not recommended) use download option "Download using your browser (HTTPS)", instead of clicking on the down arrow to the left of the package name, you must right-click on the package name and select the Save Link As.. option. If you just click on the download arrow, the browser will likely hang.
-
Installing IBM Spectrum Scale update for Microsoft Windows Server 2008+ Note: If your system has not had a full version of IBM Spectrum Scale installed, you must install the full version prior to performing these steps. When IBM Spectrum Scale is removed from a system, licensing and other information remains allowing the update package to install correctly.
Required packages (Windows):
- gpfs.ext-5.0.4-Windows-license.msi
- gpfs.ext-5.0.4.2-Windows.msi
- gpfs.gskit-8.0.50.86.msi
- Extract the contents of the ZIP archive so that the .msi file it includes is directly accessible to your system.
- Uninstall the system's current version of GPFS using the Programs and Features control panel. If prompted to reboot the system, do this before installing the update package.
- Follow the installation instructions in your IBM Spectrum Scale Installing and upgrading.
If you are upgrading directly from version 3.4 or prior (any level), or installing version 5.0.4.2 for the first time on your system, you must first install the GPFS license package (gpfs.ext-5.0.4-Windows-license.msi) before installing this update.
Additional information
-
Package information The update image listed below and contained in the ZIP archive is a maintenance package for IBM Spectrum Scale. The update image can be directly applied to your system.
The images can be used for new install or update from a prior level of IBM Spectrum Scale.
After the Windows Installer package (.msi) is installed, you have successfully updated your IBM Spectrum Scale product.
Before installing IBM Spectrum Scale on Windows nodes, verify that all the installation prerequisites have been met. For more information, see the IBM Spectrum Scale Concepts, Planning and Installation Guide in IBM® Knowledge Center .
Update to Version:
5.0.4.2
Update from Version:
4.2.0.0 - 5.0.4.1 (If upgrading node by node )
3.5.0 - 5.0.4.1 (If you shutdown and upgrade the entire cluster)
Update (zip file) contents:
- gpfs.ext-5.0.4-Windows-license.msi
- gpfs.ext-5.0.4.2-Windows.msi
- gpfs.gskit-8.0.50.86.msi
-
Summary of changes for IBM Spectrum Scale Unless specifically noted otherwise, this history of problems fixed for IBM Spectrum Scale 5.0.x applies for all supported platforms.
Problems fixed in IBM Spectrum Scale 5.0.4.2 [January 30, 2020]
- Item: IJ21257
- Problem description: GPFS daemon assert: err == E_OK dirop.C. This could happen after GPFS runs out of file cache entries and is forced to move a directory from file cache to stat cache.
- Work around: Increase maxFilesToCache will reduce chance of hitting this assert.
- Problem trigger: Directory is being moved from file cache to stat cache.
- Symptom: Abend/Crash
- Platforms affected: All
- Functional Area affected: All
- Customer Impact: Critical
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ21258
- Problem description: Running mmsdrrestore against a quorum node in a CCR-enabled cluster will crash the GPFS daemon.
- Work around: Shutdown GPFS before performing mmsdrrestore
- Problem trigger: Running mmsdrrestore against a quorum node in a CCR-enabled cluster
- Symptom: Unexpected Results/Behavior
- Platforms affected: All
- Functional Area affected: CCR, Admin Commands
- Customer Impact: Critical
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ21260
- Problem description: GPFS daemon assert: !(ccP->hasJoined() && ccP->isXClust(destNode())). This could happen after moving a node from one remote cluster to another while both clusters have remote mounted a file system from a home cluster.
- Work around: Disable ialloc function ship via "mmchconfig iallocFuncshipEnabled=false -i"
- Problem trigger: Moving a node from one remote cluster to another.
- Symptom: Abend/Crash
- Platforms affected: All
- Functional Area affected: Remote cluster mount/UID remapping
- Customer Impact: High
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ21396
- Problem description: On RHEL 7 nodes (pre-Linux kernel v3.18), in the GPFS kernel NFS support environment, GPFS may try to acquire some mutex, while holding an inode spin lock, which may be detected as a soft lockup issue by the kernel NMI watchdog.
- Work around: None
- Problem trigger: GPFS breaks a spin lock holding policy in NFS support environment
- Symptom: Performance Impact/CPU stuck
- Platforms affected: All RHEL 7.x
- Functional Area affected: Users of KNFS/CNFS only
- Customer Impact: High
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ21304
- Problem description: If encryption is not configured properly, starting down disks could result in mismatched replicas.
- Work around: None
- Problem trigger: During the "start disk", repairing mismatched replicas failed on certain files because encryption context was not available, and the error E_ENC_CTX_NOT_READY was treated as a SEVERE error which means that the code continues to repair the replicas to the degree possible. In the final phase of repair, the missupdate flag was incorrectly cleared from the inode even though we did not synchronize the replicas, as the repair failed due to unavailable encryption context. As the missupdate flag was cleared from the inode, a subsequent "start disk" brought up all down disks, but the file still had mismatched replicas. A later "mmrestripefs -c" may then pick up the wrong replica and overwrite the good replicas.
- Symptom: Encrypted replicas mismatch after start disk.
- Platforms affected: All
- Functional Area affected: Core
- Customer Impact: Critical
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ21261
- Problem description: On one side of an AFM relationship, an AFM fileset is being deleted and on the other side there's a getstate to show AFM fileset states. The getstate command picks the fileset being deleted to print its stats, and causes the Assert.
- Work around: Do not run "mmafmctl
getstate/mmdiag" commands when AFM filesets are being Deleted. - Problem trigger: n one side an AFM fileset is being deleted (which could take time depending on number of inodes in the fileset and amount of data). While this is happening, another node in the cluster queries AFM stats on the AFM filesets (mmafmctl
getstate (or) an mmdiag running). - Symptom: Abend/Crash
- Platforms affected: ALL Linux and AIX environments.
- Functional Area affected: AFM
- Customer Impact: High
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ21263
- Problem description: Starting from 5.0, a few special afmIOFlags were introduced to make AFM behave in special ways (for migration and replication). The flags started getting out of control, and needed a human readable format to understand what flags are set.
- Work around: None
- Problem trigger: "mmlsfileset
-L --afm" does not print human readable IO Flags. - Symptom: Error output/message.
- Platforms affected: ALL Linux and AIX environments.
- Functional Area affected: AFM
- Customer Impact: Suggested
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ21422IJ21422
- Problem description: EACCESS error is returned to NFS client from Ganesha Server and it can cause IO failure for metadata access (ls command) for file/directory or can fail rm operation on the directory.
- Work around: None
- Problem trigger: It is difficult to recreate but possible reason could be file/directory move/deletion from parent directory which leaves a disconnected dentry in the linux kernel.
- Symptom: IO failure
- Platforms affected: Linux Only
- Functional Area affected: NFS Ganesha
- Customer Impact:
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ21394
- Problem description: Correct description of the resumeRequeued command to indicate that the filesetName is a required argument.
- Work around: None
- Problem trigger: Running the mmafmctl command as recommended in the man
- Symptom: mmafmctl shows wrong help - not mandating the filesetName for mmafmctl
resumeRequeued sub command. - Platforms affected: ALL Linux and AIX environments.
- Functional Area affected: AFM
- Customer Impact: Suggested
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ21432
- Problem description: A linux mknod operation for a FIFO object can encounter this assert if the object is opened before the operation completely finishes.
- Work around: The assert can be disabled with the assistance of service via "mmchconfig disableAssert"
- Problem trigger: A linux mknod operation to create a FIFO object while another process attempts to open the same object (not actually waiting for the create to complete).
- Symptom: Abend/Crash
- Platforms affected: ALL Linux OS environments
- Functional Area affected: All
- Customer Impact: High
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ21424
- Problem description: mmfsadm (safer)dump afm (fset
) - which displays the AFM handler of an AFM fileset - reports incorrect negative values for numAsyncLookups column. Also the same is collected as part of internal dumps that are collected for gpfs.snap. - Work around: None
- Problem trigger: The "mmfsadm (safer)dump afm (fset
)" command that displays the handler for an AFM fileset is issued. Also the same is collected as part of internal dumps that are collected for gpfs.snap. - Symptom: mmfsadm (safer)dump afm (fset
) - which displays the AFM handler of an AFM fileset - reports incorrect negative values for numAsyncLookups column. - Platforms affected: ALL Linux OS environments
- Functional Area affected: AFM
- Customer Impact: Suggested
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ21541
- Problem description: AFM deletes the orphan file when the home is not reachable during the lookup. Orphan file is created during the readdir and is repaired during the lookup. It is possible that multiple threads deleting the same orphan file and runs into FSStruct error as the same inode is attempted for deallocation multiple times.
- Work around: None
- Problem trigger: Doing readdir and lookup on the AFM cache fileset when the home is disconnected after the readdir.
- Symptom: Error output/message
- Platforms affected: ALL
- Functional Area affected: AFM
- Customer Impact: High
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ21550
- Problem description: Deadlock could happen if quorum loss occurs on a newly appointed stripe group manager. Threads could be stuck in 'waiting for stripe group takeover' and 'waiting for SG cleanup'.
- Work around: None
- Problem trigger: Quorum loss just as a node start taking over file system manager role
- Symptom: Hang/Deadlock/Unresponsiveness/Long Waiters
- Platforms affected: ALL
- Functional Area affected: All
- Customer Impact: Critical
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ21554
- Problem description: Enable the uage of a list of groups for --ces-group option of the mmces command
- Work around: Repeat the command using one ces-group for each command
- Symptom: Without the fix the user cannot choose a combination of groups when filtering the command output for ces groups.
- Platforms affected: ALL Linux OS environments
- Functional Area affected: CES
- Customer Impact: Suggested
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ21557
- Problem description: Make timeout of commMsgCheckMessages RPC consistent on all nodes and issue a warning message if it took more than one third of the timeout to get the reply of commMsgCheckMessages RPC.
- Work around: None
- Problem trigger: Network is not good which leads to sending commMsgCheckMessages RPC
- Symptom: Error output/message
- Platforms affected: All
- Functional Area affected: All
- Customer Impact: Suggested
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ21645
- Problem description: On Linux node with kernel version 4.7 or later, when copy one source file with command cp -p, the ACL data is lost in the destination file, if the source file contains many ACL entries, for example, 20+ ACL entries.
- Work around: None
- Problem trigger: Defect in porting of GPFS to Linux kernel version 4.7.
- Symptom: Unexpected Results/Behavior
- Platforms affected: Linux nodes with kernel version 4.7 or later
- Functional Area affected: All
- Customer Impact: High
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ21647
- Problem description: The systemhealth monitor fails to start.
- Work around: None
- Problem trigger: The problem depends on the provided python packages in the various linux distributions. It seems that not all distros provided the required packages. During development and internal test RHEL 7.6 was used without issues.
- Symptom: Unexpected Results/Behavior
- Platforms affected: ALL Linux OS environments (CES nodes)
- Functional Area affected: CES
- Customer Impact: High
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ21648
- Problem description: mmces address add is failing when both object attributes are assigned to one CES IP address
- Work around: cat /var/mmfs/gen/cesAddressPoolfile will show the requested information,
- Problem trigger:
- Symptom: Customer gets incorrect information using the mmces list command.
- Platforms affected: ALL Linux OS environments
- Functional Area affected: CES
- Customer Impact: Suggested
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ21434
- Problem description: GPFS user space daemon crashed during read/write through NFS or mmapplypolicy.
- Work around: None
- Problem trigger: In openNFS the first lockFile put a hold on the cachObj, the next lockFile in the openNFS skipped the lookup of the file from the hash table which means the cachObjMutex will not acquired, as a result the releaseCacheObjMutex in the end of lockFile wrongly cleared the lockWordCopy in the mutex, unfortunately this mutex was acquired by a daemon thread before the lockFile called releaseCacheObjMutex. So the daemon thread continued to do its work and hit the assert when it called ASSERT_MUTEX_HELD to check it did acquired the mutex. Because the lockWordCopy in the mutex was wrongly cleared by the kernel lockFile, the assert went off in the daemon thread.
- Symptom: Daemon crash
- Platforms affected: All
- Functional Area affected: All
- Customer Impact: High
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ21654
- Problem description: AFM dependent filesets does not have .afm/.ptrash/.pconflicts/.afmtrash which are used for storing the conflicting files. This .afmtrash dir is used to move non empty directory during the directory deletion.
- Work around: None
- Problem trigger: Replication to dependent filesets
- Symptom: Unexpected Results/Behavior
- Platforms affected: ALL Operating System environments
- Functional Area affected: AFM and AFM DR
- Customer Impact: High
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ21659
- Problem description: Revalidation on the fileset root path might not happen correctly if the gateway is running some operating systems like RHEL 7.7. This causes the new data from the target path not to be fetched from the home.
- Work around: None
- Problem trigger: Revalidation on the fileset root path in the AFM caching modes.
- Symptom: Unexpected Results/Behavior
- Platforms affected: Certain Linux OS environments, like RHEL 7.7
- Functional Area affected: AFM
- Customer Impact: High
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ21660
- Problem description: An upgrade to a major release of Postgres SQL server will trigger a new health event informing the user that the database will be reinitialized.
- Work around: Manually drop the database and allow the GUI to create it.
- Problem trigger: Upgrade Postgres SQL to a new major release
- Symptom: Component Level Outage
- Platforms affected: ALL Linux OS environments
- Functional Area affected: REST APIs, GUI
- Customer Impact: High
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ21964
- Problem description: Harden mmces command against injection vulnerability
- Work around: None
- Problem trigger:
- Symptom: For some mmces command it is possible to inject a shell command by adding "|
" to the parameter list to execute . This injection is possible on the command line and from the gui. - Platforms affected: ALL Linux OS environments
- Functional Area affected: CES
- Customer Impact: Critical: security issue
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- *Item: IJ21974
- Problem description: AFM gateway daemon asserts when the request arrives before the filesystem is mounted.
- Work around: Remove gateway designation from the gateway, start GPFS, mount filesystem, and make the node a gateway again using the "mmchnode --gateway -N
" command. - Problem trigger: Start the gateway node while IO is running on the AFM fileset.
- Symptom: Crash
- Platforms affected: Linux
- Functional Area affected: AFM
- Customer Impact: Critical
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ21975
- Problem description: AFM gets the sparse information from the home before reading the file and the actual data size is used to set the cached bit. It is possible that data blocks allocated at the cache are more than the actual data size if the file is sparse in between and cached bit is set without fully reading the file.
- Work around: Disable sparse file detection using the afmReadSparseThreshold=disable command
- Problem trigger: AFM read on the sparse files with afmReadSparseThreshold set (default on)
- Symptom: Unexpected result
- Platforms affected: Linux
- Functional Area affected: AFM
- Customer Impact: HiPER
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ21977
- Problem description: AFM gateway daemon asserts if the remote mount initialization fails during the first access to the fileset
- Work around: None
- Problem trigger: Remote mount failure
- Symptom: Crash
- Platforms affected: Linux
- Functional Area affected: AFM
- Customer Impact: High
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item IJ21978
- Problem description: When deleting a snapshot, the process may miss to move the data blocks of being delete snapshot files with small inode numbers. The inodes with small numbers must be in the same inode block with fileset metadata file, and not in the first inode block of inode 0 file.
- Work around: None
- Problem trigger: Deleting a snapshot which contains a file with small inode number
- Symptom: Data corruption
- Platforms affected: All
- Functional Area affected: Snapshot
- Customer Impact: Critical
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ22022
- Problem description: When handling page fault GPFS didn't detach I/O buffer segment. This later caused kernel crash.
- Work around: None
- Problem trigger: Multiple threads doing both normal I/O and mmap I/O on the same file at the same time.
- Symptom: Kernel crash
- Platforms affected: AIX
- Functional Area affected: Mmap I/O
- Customer Impact: Critical
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ22004
- Problem description: AFM gateway daemon crashes during resync operations due to the race between the thread which is monitoring the stuck messages and threads replicating the data.
- Work around: Increase the afmAsyncOpWaitTimeout value
- Problem trigger: AFM resync
- Symptom: Crash
- Platforms affected: All Linux OS environments
- Functional Area affected: AFM
- Customer Impact: Critical
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ22005
- Problem description: Customer data showed GPFS asserted when trying to open a disk on calling mmadddisk/mmrpldisk because the disk was assigned a valid storage pool. The root of the problem (why the disk was associated with an invalid storage pool during mmadddisk/mmrpldisk) was not discovered due to lack of data. The logic is: by the time GPFS tries to open a disk due to stripe group descriptor update from mmadddisk/mmrpldisk, the disk should be assigned to a valid storage pool. It is decided to safeguard GPFS not to open a disk when the disk is assigned to an invalid storage pool.
- Work around: None
- Problem trigger: This problem has not surfaced internally and there is not enough data from customer to find out why this could happen. From examining the code, GPFS should have assigned a valid storage pool during mmadddisk/mmrpldisk even though the disk was created without specifying the storage pool.
- Symptom: Abend/Crash
- Platforms affected: All
- Customer Impact: High
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ22007
- Problem description: Online replica compare function (mmrestripefs -c) could give incorrect replica mismatch error on directories. This could happen if subblock size for metadata is greater than 256K.
- Work around: None
- Problem trigger: Run mmrestripefs -c on file system with metadata subblock size greater than 256K.
- Symptom: Error output/message
- Functional Area affected: Admin commands
- Customer Impact: Suggested
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ21097
- Problem description: The ports of 2nd (and later) IB adapter on the node which starts the verbs connection might be mis-recognized as RDMA CM disabled ports, and fails to be connected. The nodes that start the verbs connection are nsd clients if verbsRdmaSend=no, but they also could be other nodes if verbsRdmaSend=yes. You will see "ibv_modify_qp init err 22" error message in the mmfs.log file if it happens.
- Work around: None. But if RDMA-CM is not really needed in your environment, you can just disable it.
- Problem trigger: Users having
- Platforms affected: ALL Linux OS environments
- Functional Area affected: RDMA
- Customer Impact: High
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ22034
- Problem description: File creation could fail unexpectedly with EFBIG error. This could happen when multiple nodes access the same directory while 1 node repeatedly create and delete the same file in the directory.
- Work around: Perform rename on a file in the directory after encounter EFBIG error.
- Problem trigger: Repeatedly create and delete the same file in a directory.
- Symptom: Unexpected Results/Behavior
- Platforms affected: ALL
- Functional Area affected: All
- Customer Impact: High
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ22009
- Problem description: GPFS command mmchattr stores extended attribute name value pair into the inode itself the same even for ACL xattr, which should be stored into GPFS internal ACL file. This behavior of ACL xattr handling may confuse users.
- Work around: None.
- Problem trigger: None
- Symptom: Confused output
- Platforms affected: Linux
- Functional Area affected: All
- Customer Impact: Suggested
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ22036
- Problem description: On a file system with unavailable metadata disks, log recovery failure prevents file system from being mounted or disks from being started. Either mmfsck -xk should allow repair of logs in this case or tsdbfs -f should allow user to patch the disks states. Fixed code to bypass disk availability check if fsck is invoked in read-only mode. This allows both mmfsck -xk and tsdbfs -f to run in such situations.
- Work around: Use a node at version less than 5.0.2 to either run mmfsck -xk or tsdbfs -f to patch disk states. This only works if the file system version is less than 5.0.2.
- Problem trigger: File system disks are down and log recovery has failed.
- Symptom: Error output/message
- Platforms affected: ALL Operating System environments
- Functional Area affected: FSCK
- Customer Impact: High
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ22010
- Problem description: Log recovery error after node failure can cause recovery buffer to be overwritten which will most likely lead to GPFS daemon assert.
- Work around: None.
- Problem trigger: Node failure
- Symptom: Abend/Crash
- Platforms affected: ALL Operating System environments
- Functional Area affected: All
- Customer Impact: HiPER
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ22013
- Problem description: In AFM Stopped and Queue dropped states, when file/directory are removed at the cache site the inode is still seen as USERFILE and is not reclaimed.
- Work around: None.
- Problem trigger: Running applications/workload when AFM fileset is in Stopped state.
- Symptom: Unexpected Results/Behavior
- Platforms affected: All Linux and AIX operating systems. Functional Area affected: AFM
- Customer Impact: High
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ22017
- Problem description: TSM client version can contain 2 or more digits in any position of V.R.M.F but mmbackup cannot handle such case. As a result, mmbackup fails while parsing TSM client version.
- Work around: None.
- Problem trigger: Executing mmbackup with TSM client 8.1.10.
- Symptom: Component Level Outage
- Platforms affected: ALL Operating System environments
- Functional Area affected: mmbackup
- Customer Impact: High
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ22024
- Problem description: mmprotocol trace timers of manually stopped traces would unexpectedly stop newly initiated mmprotocol traces.
- Work around: 1. Either wait for the duration of the previous protocol trace (default: 10 min) before starting a new trace for the same component 2. or kill all mmprotocoltrace processes on all CES nodes, which participate in the trace (by default: all CES nodes)
- Problem trigger: Starting the second protocol trace via mmprotocoltrace for the same component after the first trace was manually stopped and the timeout of the first trace was not yet reached.
- Symptom: Unexpected Results/Behavior
- Platforms affected: ALL Linux OS environments
- Functional Area affected: Trace CES
- Customer Impact: High
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ22041
- Problem description: When running "mmdiag --waiters" or "mmfsadm dump waiters", or the periodical health check performs long waiters detection, the code could run into memory overflow for a local buffer, then triggers the signal 6 to mmfsd daemon and causes it restarted abnormally.
- Work around: None.
- Problem trigger: mmdiag --waiters or mmfsadm dump waiters, or the periodical health check inside mmfsd daemon.
- Symptom: Daemon crash
- Platforms affected: ALL Operating System environments
- Functional Area affected: Long waiters detection and dump
- Customer Impact: High
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ21955
- Problem description: When using Microsoft Office applications such as Word and Excel on Windows 10 version 1709 or newer, any attempt to modify and save an xisting file (.docx, .xlsx etc) will fail with sharing violation error.
- Work around: None.
- Problem trigger: This issue is triggered when installing or upgrading to Windows 1 version 1709 or newer. It is also hit in Windows Server version 1809 or newer.
- Symptom: Sharing violation errors when attempting to modify and save existing *.docx, *.xlsx (and other Office) files using Microsoft Office applications such as Word and Excel. Saving as a different name works.
- Platforms affected: Windows/x86_64 only. Specifically, Windows 10 (version 1709 or newer) and Windows Server (version 1809 or newer) only.
- Functional Area affected: Windows.
- Customer Impact: High Importance.
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ22158
- Problem description: ccMsgGroupJoinPhaseN message is sent to all the nodes which are up during the join protocol, in this case this message is sent to the down gateway node causing the deadlock
- Work around: None.
- Problem trigger: Remote node joining the cluster with a down gateway node.
- Symptom: Deadlock
- Platforms affected: ALL Operating System environments
- Functional Area affected: AFM and AFM DR
- Customer Impact: Critical
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ22207
- Problem description: Kernel assert going off: bufOffset+len = iobP->ioBufLen in file cxiIOBuffer.c, resulting in a kernel panic.
- Work around: None.
- Problem trigger: Calling Spectrum Scale APIs to scan inodes in the file system. Note that some binaries delivered along with Spectrum Scale package are also calling such Spectrum Scale APIs, like policy rules to scan files in the file system, IE snapshot restore and sobar backup utilities.
- Symptom: Abend/Crash
- Platforms affected: ALL Operating System environments
- Functional Area affected: applications using GPFS APIs, including policy, snapshot restore and sobar backup.
- Customer Impact: High Importance.
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ22261
- Problem description: Due to the way mmfsck internally traverses reserved files and snapshots it is not able to report and fix duplicate addresses present among inode 0 files of the active file system and its snapshots. So as a result of this even though mmfsck -y runs successfully and reports file system as clean the duplicate address corruptions are not fixed and so the next mmfsck run will report some new corruptions like mismatched replicas present in inode 0. And there can also be fsstructs reported in the logs due to this after mmfsck -y
- Work around: Delete all the snapshots in the file system and then run mmfsck repair
- Problem trigger: ??
- Symptom: Operation failure due to FS corruption Also on a file system having snapshots the fsck output shows the below signs after a successful mmfsck -y run: 1) Mismatch replicas in inode 0 Error in inode 0 snap 0: Inode block 289710225 has mismatched replicas 2) Even though no duplicates are reported fsck shows the below Checking for the first reference to a duplicate fragment. 3) Even though no duplicates are reported we see a non-zero duplicates count at the end of fsck output 896 duplicates
- Platforms affected: ALL Operating System environments
- Functional Area affected: FSCK is not able to repair the corruption
- Customer Impact: Critical
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- This update addresses the following APARs: IJ21097 IJ21257 IJ21258 IJ21260 IJ21261 IIJ21263 IJ21304 IJ21394 IJ21396 IJ21422 IJ21424 IJ21432 IJ21434 IJ21541 IJ21550 IJ21554 IJ21557 IJ21645 IJ21648 IJ21654 IJ21955 IJ21659 IJ21660 IJ21964 IJ21974 IJ21975 IJ21977 IJ21978 IJ22004 IJ22005 IJ22007 IJ22009 IJ22010 IJ22013 IJ22017 IJ22022 IJ22024 IIJ22034 IJ22036 IJ22041 IJ22158 IJ22207 IJ22261
Problems fixed in Spectrum Scale 5.0.4.1 [November 21, 2019]
- Item: IJ20948
- Problem description: On an AFM cache cluster using the AFM independent writer mode data maybe incompletely read if a file is modified before it is fully cached. Normally AFM reads a file from the AFM home cluster before allowing write operations to occur. However, if a file is not opened in append mode but a write is made at the end of a file, the data for the file may not be completely cached.
- Work around: Run prefecth on the partially cached files.
- Problem trigger: AFM caching modes and updating at the end of the file before fully caching it.
- Symptom: Unexpected results
- Platforms affected: All
- Functional Area affected: AFM caching.
- Customer Impact: HiPER
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ20909
- Problem description: When mmfsck scans and finds corrupted reserved file blocks it prints the list of blocks corrupted and due to a code bug in that path, the file system manager node asserts with Signal 11.
- Work around: Do not run mmfsck
- Problem trigger: This will happen when mmfsck is run on a file system having corrupted reserved file blocks
- Symptom: File system manager node assert
- Platforms affected: All
- Functional Area affected: FSCK
- Customer Impact: High
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ20710
- Problem description: FSSTRUCT error: FSErrCheckHeaderFailed could be issued while accessing some directory. This could happen on a file system with metadata replication where there is metadata disk in down state and node failure.
- Work around: None
- Problem trigger: Metadata disk in down state and node failure.
- Symptom: Error output/message
- Platforms affected: All
- Functional Area affected: All
- Customer Impact: Critical
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ20678
- Problem description: On a node with multiple file system mounted, DiskLeastThread could be blocked by a file system unmount causing delay in renewal of disk lease and potential quorum loss.
- Work around: None
- Problem trigger: File system unmount
- Symptom: Node expel/Lost Membership
- Platforms affected: All
- Functional Area affected: Cluster Membership
- Customer Impact: High
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ20726
- Problem description: After a system crash the configuration file /etc/sysconfig/ganesha contained only an entry for NOFILE but not for OPTIONS and EPOCH_EXEC any more. No Ganesha logs were created.
- Work around: Since there is no backup file of /etc/sysconfig/ganesha by default, the file must be extracted either from RPM or fetched from another CES node.
- Problem trigger: The /etc/sysconfig/ganesha file was modified in-place whenever NFS was started. The procedure used the sed -i command for this. The goal was to have always the latest NOFILE entry in the file, along with those of OPTION (startup options for Ganesha) and EPOCH_EXE. Some investigation indicate that during a system crash not all changes in the file were written to disk. So once this file is damaged or truncated, the only entry is then the added NOFILE data. Previously existing OPTIONS and EPOC_EXEC cannot be recovered since there is no mechanism to do so. After the code change the NOFILE data is updated on a copy of the original configuration file. If the changes are all done then this copy is restored back to the original file.
- Symptom: Unexpected Results/Behavior
- Platforms affected: ALL Linux OS environments (CES nodes)
- Functional Area affected: CES
- Customer Impact: High
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ20811
- Problem description: The application allows a regular user to inject OS commands in the "NFS Exports" Client field. The injected command is executed on the underlying operating system as "root" user.
- Work around: None
- Problem trigger: Using the GUI to add NFS exports allows this condition.
- Symptom: Behavior - Security risk
- Platforms affected: All
- Functional Area affected: NFS
- Customer Impact: High
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ20941
- Problem description: Removing message headers that are not utilized from librdkafka message to reduce message size sent to external sink.
- Work around: None
- Problem trigger: When running Clustered watch with a heavy workload producing many events, if the external kafka cluster gets overloaded, clustered watch may hit a timeout and auto disable. With this fix, the librdkafka message size reduction makes it less likely to hit this timeout.
- Symptom: 45 Second timeout on clustered watch will hit causing conduit(s) to go down. Following error message in /var/adm/ras/mmwfclient.log 2019-08-26_00:51:49: [E] WF Producer: t: newtopic a: 3
- Platforms affected:
- Functional Area affected:
- Customer Impact: High
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ20797
- Problem description: AFM Secondary mode filesets are passive in nature (and RO), since Primary can be the only one allowed to perform Write class operations on the secondary mode fileset. This bug allows Creates to be directly performed on the Secondary mode fileset even when afmSecondaryRW is set to no. However other write class operations like set times, chmod etc are not allowed on the file.
- Work around: None
- Problem trigger: User tries to perform IO Operations on an AFM Secondary mode fileset when afmSecondaryRW is set to no.
- Symptom: Unexpected Results/Behavior
- Platforms affected: All Linux and AIX environments.
- Functional Area affected: AFM
- Customer Impact: Suggested
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ20733
- Problem description: Node crashes with assert when the AFM fileset with active IO is unlinked.
- Work around: Stop AFM fileset and then unlink the fileset.
- Problem trigger: Fileset unlink with active IO.
- Symptom: Abend/Crash
- Platforms affected: ALL
- Functional Area affected: AFM
- Customer Impact: High
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ20677
- Problem description: Daemon crashes due to invalid config setting where enableStatUIDremap is enabled without enabling the enableUIDremap config option.
- Work around: Enable both enableUIDremap and enableStatUIDremap options.
- Problem trigger: UID remapping with invalid config options.
- Symptom: Crash
- Platforms affected: All
- Functional Area affected: Remote cluster mount/UID remapping
- Customer Impact: High
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ20730
- Problem description: When running "mmces node suspend/resume -N" with a list of nodes it might happen that not all of them are in the expected state afterwards.
- Work around: Repeat the "mmces node suspend/resume -N" command with a list of nodes which were not set to the expected state previously.
- Problem trigger: The cesiplist file has a unique serial number assigned when it is stored in CCR. Each node reads the cesiplist file (and its serial number) from CCR as a local copy and modifies the suspend flag in that local copy. After this all nodes which did this kind of local update try now to update their modified copy of the cesiplist file in CCR with an incremented (+1) serial number. That may fail when other nodes did this update already with the same serial number earlier. There is some randomness, since not all nodes try this update at the very same time. There could be a timespan of several seconds between the first and the last one, so that some nodes get updated cesiplist files and serial numbers, and work on those.
- Symptom: Unexpected Results/Behavior
- Platforms affected: ALL Linux OS environments (CES nodes)
- Functional Area affected: CES
- Customer Impact: Suggested
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ20741
- Problem description: Fix quota share revoke/reclaim delay when the quota usage is approaching the limits.
- Work around: None
- Problem trigger: When quota usage is approaching the limits (hard limit), the attempts to reclaim the remaining quota shares from other quota clients can lead to very slow quota management operations.
- Symptom: Processes waiting for available quota, when the quota usage is approaching the limits, leading to apparent system hung.
- Platforms affected: All
- Functional Area affected: Quotas
- Customer Impact: High Importance: an issue which will cause a degradation of the system in some manner, or loss of a less central capability
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ20808
- Problem description: On AIX, when trying to clear/write the primary GPT area, mmcrnsd does non-4k aligned writes to 4K disks while trying to preserve the OS PVID, causing a failure.
- Work around: None
- Problem trigger: Create an nsd out of 4kb sector size native disk(s) on AIX
- Symptom: Error output/message
- Platforms affected: AIX
- Functional Area affected: All
- Customer Impact: High
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ20805
- Problem description: The GPFS daemon (mmfsd) consumes high CPU load on a quorum node when Windows 2016 is used as the operating system. This is caused by a CCR thread listening to incoming CCR requests on cached connections from other quorum nodes by using the poll system call. This logic doesn't consider particular flags returned by the poll system call (in detail: POLLHUP, POLLERR, POLLNVAL). A second GPFS daemon (mmsdrserv) might be affected by this issue. This daemon is running when GPFS has been shutdown by the mmshutdown command. This issue doesn't occur on Linux or AIX.
- Work around: Assign other nodes as quorum nodes which don't use Windows 2016 as the underlying operating system, if possible, e.g. nodes in the cluster running on Linux or AIX.
- Problem trigger: GPFS startup (mmsdrserv starts automatic, mmfsd after 'mmstartup -a')
- Symptom: -Performance Impact/Degradation -Unresponsiveness
- Platforms affected: Windows 2016 (at least, earlier/later Windows version might be affected too)
- Functional Area affected: CCR admin commands
- Customer Impact: High
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ20736
- Problem description: Readdir is failed for pcache fileset root, due to cache bit set for first created pcache fileset (even it is not linked) on non-existing pcache fileset filesystem.
- Work around: None
- Problem trigger: Access files structure first time from first Pcache fileset on non-existing pcache fileset filesystem .
- Symptom: File/dir tree mismatches.
- Platforms affected: ALL Linux OS environments
- Functional Area affected: AFM
- Customer Impact: Suggested
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ20742
- Problem description: In a multicluster environment, a remote cluster client node is creating a file in a directory inode which has its metanode in a different remote client cluster. Live lock can happen in this case, if the directory is empty or small, due to a performance optimization.
- Work around: Use the directory only from one remote cluster.
- Problem trigger: Creating files in an empty or small directory from two remote clusters
- Symptom: Hang
- Platforms affected: ALL
- Functional Area affected: All
- Customer Impact: High
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ20728
- Problem description: Policy scans cannot be executed successfully through the Jobs framework
- Work around: Manually change the commandtemplates json file: # mmccr fget _jobCommandTemplates.json /tmp/jct.json # vim /tmp/jct.json # mmccr fput _jobCommandTemplates.json /tmp/jct.json # /usr/lpp/mmfs/gui/bin/runtask GPFS_JOBS The two changes that need to be made are: change localWorkDir to localWorkDirectory in the command template - change fileListPathname to fileListPathName in the argument definition
- Problem trigger: The policy-scan template is used in a job
- Symptom: Command execution failure
- Platforms affected: All
- Functional Area affected: Jobs
- Customer Impact: Suggested
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ20725
- Problem description: When writing to a memory mapped file that was compressed, it fails with SIGBUS when mmapRangeLock config variable is disabled.
- Work around: Don't disable mmapRangeLock config variable
- Problem trigger: Writing to memory mapped files that were compressed, and mampRangeLock config variable is disabled.
- Symptom: Application fails with SIGBUS
- Platforms affected: ALL Linux OS environments
- Functional Area affected: All
- Customer Impact: It is critical if customer disabled mmapRangeLock config variable.
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ20709
- Problem description: Updating ESS drive firmware on a live system can be blocked for long periods of time (and may timeout) due to a declustered array that shows up in "rebalance" state.
- Work around: None
- Problem trigger: This problem is seen when updating drive firmware.
- Symptom: Error output/message
- Platforms affected: ALL Linux OS environments
- Functional Area affected: ESS/GNR
- Customer Impact: Suggested
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ20695
- Problem description: A TCT enabled system can see gpfs waiters of type "LweAccessRightThread waiting for XW lock"
- Work around: None
- Problem trigger: If a dmapi right is acquired on a file, and the file gets deleted, then releasing the right would cause a waiter to appear
- Symptom: appearance of gpfs waiters of type "LweAccessRightThread waiting for XW lock"
- Platforms affected: ALL Linux OS environments
- Functional Area affected: TCT
- Customer Impact: High
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ20675
- Problem description: During the Copy On Write process in which a data block is copied to a snapshot, if the metanode fails, there is a chance for the assert to happen, due to the flush flag not being held.
- Work around: None
- Problem trigger: With debugDataControl set to heavy on AIX when automatic debug data collection on unexpected long waiter happens.
- Symptom: Performance Impact/Degradation
- Platforms affected: All non-Linux platforms.
- Functional Area affected: All
- Customer Impact: High
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ20951
- Problem description: There are some access problems to disks, causing the log recovery failure, eventually causing the file system to be panicked on all nodes. Since the incoming remote mounts prevented the offline fsck from running, users then moved the file system into maintenance mode and wanted to try offline fsck again. However, the log recovery was not skipped even when the file system was in maintenance mode, so resulted in the same result for the offline fsck running.
- Work around: None
- Problem trigger: The file system logs for some nodes are not clean before moving the file system into maintenance mode.* Symptom: Log recovery is attempted and failed.
- Platforms affected: All
- Functional Area affected: File System Maintenance Mode
- Customer Impact: High
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ20789
- Problem description: When deleting a global snapshot, if the snapshot refers to a deleted fileset then the assert will be triggered.
- Work around: None
- Problem trigger: This problem only happens when deleting a global snapshot, while a fileset included in it has been deleted.
- Symptom: Daemon abend
- Platforms affected: All
- Functional Area affected: Global snapshot deletion
- Customer Impact: Suggested
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ20809
- Problem description: Daemon crashed with assert ofP->metadata.notAFragment(subblocks)> It may occur in appending data to a file after previous write was failed due to invalid data buffer in application.
- Work around: Make sure the user data buffer is valid before write data into the scale file system
- Problem trigger: An invalid user data buffer caused GPFS to fail when writing data to a file while leaving the invalid data in the buffer. A flush of the buffer incorrectly set the file's fragment to a full block which resulted in a failure to expand the last block of the file, triggering the assert.
- Symptom: Scale daemon crashed with assert ofP->metadata.notAFragment(subblocks) in bufdesc.C
- Platforms affected: All
- Functional Area affected: Core
- Customer Impact: High
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ20739
- Problem description: The RPO thread which takes care of creating RPO snapshots for AFM DR filesets, is taking locks on all filesets in filesystem before it can see which filesets require RPO snapshots to be taken. This includes any non-AFM independent/dependent filesets as well.
- Work around: None
- Problem trigger: Having multiple AFM DR Primary filesets with RPO intervals enabled.
- Symptom: Performance Impact/Degradation Hang/Deadlock/Unresponsiveness/Long Waiters (Lesser probability)
- Platforms affected: All Linux
- Functional Area affected: AFM Snapshots
- Customer Impact: Suggested
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ20791
- Problem description: After migrating a file from GPFS to external storage any indirect blocks used by the file are not freed.
- Work around: None
- Problem trigger: Migration of large files, requiring indirect blocks, to external storage.
- Symptom: Metadata disk space is not freed after files are migrated to external storage.
- Platforms affected: All
- Functional Area affected: All
- Customer Impact: Critical
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ20674
- Problem description: Due to a bug, fsck continues to process a deleted inode and marks it as an orphan which causes this assert.
- Work around: Patch the problematic inode using tsdbfs so that the inode is no longer corrupt and retry fsck.
- Problem trigger: A deleted inode is corrupt.
- Symptom: Abend/Crash
- Platforms affected: ALL
- Functional Area affected: FSCK
- Customer Impact: High
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ20676
- Problem description: During offline fsck multi-pass directory scan, if patch queue feature is disabled and --skip-inode-check option is used, then fsck tries to access an out of range entry in dotdotArray and hits this assert.
- Work around: None
- Problem trigger: Multi-pass offline fsck --skip-inode-check with patch queue feature disabled.
- Symptom: Abend/Crash
- Platforms affected: ALL
- Functional Area affected: FSCK
- Customer Impact: High
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ20734
- Problem description: Every 15 seconds the CES monitor daemon runs a helper script to analyze the state of CES. The change hardened the monitor to not die but to collect information about the malfunction. If the malfunction repeats this problem is reported to system health by an event and the customer will find the problem in the event logs of mmhealth.
- Work around: Before the implementation of the fix the information had to be collected from the log files.
- Problem trigger: Unexpected behavior of a helper script called by CES monitor daemon. The helper script may die because of low memory, blocked lock, etc.
- Symptom: Performance Impact/Degradation
- Platforms affected: ALL Linux OS environments
- Functional Area affected: CES
- Customer Impact: Suggested: has little or no impact on customer operation
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ20735
- Problem description: GPFS daemon could assert when trying to mount a file system. This could happen after a node failure and file system is being mounted again after daemon restart. File system manager node would also fail with an assert.
- Work around: None
- Problem trigger: A client node failure
- Symptom: Abend/Crash
- Platforms affected: All
- Functional Area affected: All
- Customer Impact: High
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ20788
- Problem description: After upgrading a cluster with a pre-4.1 file system which has quotas enabled, the old user visible quota files will be converted to GPFS internal files. This changes is kept in the stripe group descriptor for the file system. However, this change is not broadcast to all nodes and causes a metadata inconsistency leading to the assert
- Work around: Method 1) "Run the commands mmumount -a", then "smmmount -a" after upgrading pre-4.1 fs which has quota enabled Method 2)Execute commands that update the stripe group descriptor for the file system, for example use mmchdisk to suspend then resume one of the disks of the file system."
- Problem trigger: After upgrading pre-4.1 fs which has quota enabled, user.quota, group.quota and fileset.quota will be migrated to regular files. In rare cases, accessing them (through VFS interface or accessing internally by tools like mmrepairfs) may cause log assert.
- Symptom: Abend
- Platforms affected: All
- Functional Area affected: Quotas
- Customer Impact: High
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ20737
- Problem description: Create fileset can be called before inode manager recovery has started which hits sig11 when accessing uninitialized variable.
- Work around: Wait for inode manager recovery to be completed as part of mount before create fileset.
- Problem trigger: Create fileset before inode manager recovery has started.
- Symptom: Abend/Crash
- Platforms affected: All
- Functional Area affected: Filesets
- Customer Impact: Suggested
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ20740
- Problem description: If unmount interrupts inode manager recovery, it results in file system panic.
- Work around: Wait for inode manager recovery to be completed as part of mount before unmount.
- Problem trigger: Unmount while inode manager recovery is in progress.
- Symptom: Error output/message
- Platforms affected: All
- Functional Area affected: All
- Customer Impact: Suggested
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ20802
- Problem description: Attacker can inject arbitrary commands through Spectrum Scale GUI or CLI when using mmsmb commands
- Work around: None
- Problem trigger: Command injection
- Symptom: May not be any errors, or you may see Unexpected Results/Behavior
- Platforms affected: All
- Functional Area affected: SMB
- Customer Impact: High
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ20807
- Problem description: Attacker can inject arbitrary commands through Spectrum Scale GUI or CLI when using mmsmb commands
- Work around: None
- Problem trigger: Command injection
- Symptom: May not be any errors, or you may see Unexpected Results/Behavior
- Platforms affected: All
- Functional Area affected: SMB
- Customer Impact: High
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ20888
- Problem description: When updating the file size for preallocation, the new file size is calculated incorrectly, which results in an unexpected file size.
- Work around: Do not try to preallocate the same block more than once.
- Problem trigger: In an FPO cluster, the problem can be triggered if one tries to pre-allocate the same block more than once and the second request has a larger file size.
- Symptom: Unexpected Results/Behavior
- Platforms affected: ALL Linux OS environments
- Functional Area affected: FPO
- Customer Impact: High
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ20889
- Problem description: The mmvdisk server list command may fail if the servers involved have separate daemon and admin interfaces.
- Work around: None
- Problem trigger: Having GNR servers with separate admin and daemon interfaces.
- Symptom: Error output/message
- Platforms affected: All
- Functional Area affected: ESS/GNR
- Customer Impact: Suggested
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ20890
- Problem description: When suspending an ECE server, the server may be incorrectly identified as a quorum node, which may prevent the server from being suspended.
- Work around: Do not issue the suspend command on a quorum node.
- Problem trigger: Issuing the mmvdisk recoverygroup --suspend command on a quorum node.
- Symptom: Error output/message
- Platforms affected: All
- Functional Area affected: ESS/GNR
- Customer Impact: Suggested
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ20891
- Problem description: When the SG manager fails during a snapshot command, the new one cleans up incomplete operations during one-time async recovery. This depends on resetting its snapshot state to match the stripe group descriptor that is stored on disk. However, the snapshot state of non-SG-manager nodes is slightly ahead of the stripe group manager during the final stages of a snapshot deletion. The new SG manager needs to correct this discrepancy during takeover when it rereads the descriptor from disk. Otherwise, in rare cases, this inconsistency can lead to an FSSTRUCT error during subsequent snapshot commands.
- Work around: There is no preventative measure. After problem occurs, however, restarting the new stripe group manager manually will resolve it.
- Problem trigger: Stripe group manager crash during snapshot commands.
- Symptom: Error output/message
- Platforms affected: All
- Functional Area affected: Snapshots
- Customer Impact: Very rare, mysterious errors during snapshot commands
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ20892
- Problem description: mmfsck --estimate-only option shows unreasonable estimates for some file systems.
- Work around: None
- Problem trigger: File system with larger log file sizes.
- Symptom: Unexpected Results/Behavior
- Platforms affected: All
- Functional Area affected: FSCK
- Customer Impact: Suggested
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ20940
- Problem description: In certain configurations, where node name does not contain full domain name suffix, mmvdisk --server will return partial node name string which is not resolvable and cause mmvdisk to print out an error
- Work around: None
- Problem trigger: mmvdisk with --server option
- Symptom: mmvdisk --server will return partial node name string which is not resolvable and cause mmvdisk to print out an error
- Platforms affected: N/A
- Functional Area affected: GNR
- Customer Impact: Suggested
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- This update addresses the following APARs: IJ20674 IJ20675 IJ20676 IJ20677 IJ20678 IJ20695 IJ20709 IJ20710 IJ20725 IJ20726 IJ20728 IJ20730 IJ20733 IJ20734 IJ20735 IJ20736 IJ20737 IJ20739 IJ20740 IJ20741 IJ20742 IJ20788 IJ20789 IJ20791 IJ20797 IJ20802 IJ20805 IJ20807 IJ20808 IJ20809 IJ20811 IJ20888 IJ20889 IJ20890 IJ20891 IJ20892 IJ20909 IJ20940 IJ20941 IJ20948 IJ20951.
Problems fixed in Spectrum Scale 5.0.4.2 for Protocols include the following:
- nfs: Fix responding with NFS version mismatch
- nfs: Fix accessing object handle after freeing its last state
- nfs: call set_current_entry only after checking state_lock
- nfs: Add LogEventLimited to trace in fsal_common_is_referral
- nfs: Add Per client and per export stats
- nfs: Hold latch in mdcache_new_entry() until mdcache_lru_insert() completes
- nfs: ganesha version V2.7.5-ibm054.03
- For RPCSEC_GSS handle messages for negotiation or with wrong creds
- install-toolkit: RHEL 8.1 support
- install-toolkit: Config populate support for ess3k environment
- install-toolkit: BDA HDFS protocol support through toolkit.
- smb: Version gpfs.smb 4.9.16_gpfs_34-1
Problems fixed in Spectrum Scale 5.0.4.1 for Protocols include the following:
- zimon: Added missing encoding of special characters to prevent breakage of the REST APIs parsing
- smb: Version gpfs.smb 4.9.13_gpfs_33-1
- smb: Close ctdbd inflight connecting TCP sockets after fork.
- smb: Avoid orphaning the TCP incoming queue
- smb: Process all records not deleted on a remote node
- nfs: ganesha version V2.7.5-ibm053.02
Problems fixed in Spectrum Scale Protocols Packages 5.0.4-0 [Oct 18, 2019]
- Please see the "What's New" page in the IBM Knowledge Center
Was this topic helpful?
Document Information
Modified date:
18 February 2020
UID
isg400004819
