IC SunsetThe developerWorks Connections Platform is now in read-only mode and content is only available for viewing. No new wiki pages, posts, or messages may be added. Please see our FAQ for more information. The developerWorks Connections platform will officially shut down on March 31, 2020 and content will no longer be available. More details available on our FAQ. (Read in Japanese.)
Topic
  • 38 replies
  • Latest Post - ‏2019-11-22T19:54:11Z by gpfs@us.ibm.com
gpfs@us.ibm.com
gpfs@us.ibm.com
662 Posts

Pinned topic IBM Spectrum Scale V5.0 announcements

‏2017-12-18T19:06:21Z |
  • gpfs@us.ibm.com
    gpfs@us.ibm.com
    662 Posts

    Re: IBM Spectrum Scale V5.0 announcements

    ‏2017-12-18T19:07:09Z  

    Flash (Alert):  IBM Spectrum Scale: NFS operations may fail with IO-Error

    Abstract

    IBM has identified an issue with IBM Spectrum Scale 5.0.0.0 Protocol support for NFSv3/v4 in which IO-errors may be returned to the NFS client if the NFS server accumulates file-descriptor resources beyond the defined limit. Accumulation of file descriptor resources will occur when NFSv3 file create operations are sent against files that are already in use.


    See the complete Flash at:  http://www.ibm.com/support/docview.wss?uid=ssg1S1011791

  • gpfs@us.ibm.com
    gpfs@us.ibm.com
    662 Posts

    Re: IBM Spectrum Scale V5.0 announcements

    ‏2017-12-18T19:26:37Z  

    IBM Spectrum Scale 5.0.0.0 is now available from IBM Fix Central:

    http://www-933.ibm.com/support/fixcentral

    This topic summarizes changes to the IBM Spectrum Scale licensed
    program and the IBM Spectrum Scale library.

    Summary of changes
    for IBM Spectrum Scale version 5 release 0.0
    as updated, April 2017

    Changes to this release of the IBM Spectrum Scale licensed
    program and the IBM Spectrum Scale library include the following:

    Added DMPs for TIP events

           A topic is added listing the directed maintenance procedures for TIP events.
           The DMPs help users resolve issues caused due to TIP events.

    AFM and AFM DR

           - Compression and snapshot ILM policy supported.
           - A general recommendation added for the Gateway node.
           - Configuration parameters added - afmMaxParallelRecoveries, afmAsyncOpWaitTimeout,
               afmSyncOpWaitTimeout, and afmRevalOpWaitTimeout.
           - Configuration parameters modified - afmRPO and afmHashVersion.

    Authentication: Authentication packages

           Updated the authentication page to include packages specific to Ubuntu.

    Authentication: AD-based authentication

           New information is added on NFS with server-side group
           lookup and Active Directory authentication.

    Authentication: Primary group selection configurable for AD + RFC2307 based authentication

           Ability to choose primary group as set in "UNIX attributes" of a user on
           Active Directory introduced with AD + RFC2307 based authentication scheme.
           Earlier, Windows primary group was by default selected as the primary group.

    Big data and analytics
           - The GPFS Ambari integration package is now called the IBM Spectrum
             Scale Ambari management pack (in short, management pack or MPack).
           - IBM Spectrum Scale Ambari management pack version 2.4.2.1 with
             HDFS Transparency version 2.7.3.1 supports BI 4.2/BI 4.2.5 IOP migration
             to HDP 2.6.2.
           - Supports the remote mount configuration in Ambari.
           - Supports the multiple file systems configuration. In management
             pack version 2.4.2.1, the current limit is two file systems.
           - The Short circuit write is supported for better performance.
           - In HDFS Transparency, the Ranger performance is enhanced.

    Changes to IBM Spectrum Scale management API

       Added the following new commands:

           GET /perfmon/data
           GET /filesystems/{filesystemName}/afm/state
           DELETE /nodes/{name}
           POST /nodes
           GET /nodeclasses
           POST /nodeclasses
           DELETE /nodeclasses/{nodeclassName}
           GET /nodeclasses/{nodeclassName}
           PUT /nodeclasses/{nodeclassName}
           DELETE /jobs/jobId
           POST /filesystems/{filesystemName}/filesets/{filesetName}/psnaps
           DELETE /filesystems/{filesystemName}/filesets/{filesetName}/psnaps/{snapshotName}
           GET /thresholds
           GET /thresholds/{name}
           POST /thresholds
           DELETE /thresholds/{name}

    IBM Spectrum Scale GUI changes

       - Added new Networks page to monitor the performance, configuration, and
          adapters of network configurations in the cluster. You can monitor the
          network performance with respect to the IP and RDMA interfaces used in the configuration.
        - Added new Monitoring > Thresholds page to create and monitor the threshold rules
          that are defined in the system.
        - Added Access > Remote Connections page to enable the GUI node of the
          local cluster to monitor the remote cluster by establishing a connection
          with the GUI node of the remote cluster.
        - Added Settings > Call Home page to configure call home. Configuring
          the call home feature helps the IBM® Support to monitor the system.
          Configuring call home also helps to reduce the response time of the
          IBM Support to resolve any issues.
          The diagnostic data that is downloaded through the Settings > Diagnostic Data
          can be uploaded to a problem management record (PMR) by using
          the call home feature in the backend. To upload the diagnostic data, right-click
          the relevant data set from the Previously Collected Diagnostic Data, and
          select Upload to PMR.
        - Added file system creation capabilities in GUI. Use the Files > File Systems > Create File System
          option to launch the Create File System wizard. In the Create File System wizard, you can specify
          the following details of the file system:
             - File system name
             - Storage pools
             - NSDs for the file systems
             - Failure groups
             - NSD order for data writes
             - Maximum number of Spectrum Scale clients
             - Maximum number of inodes of the root fileset
             - Whether to enable quota and scope for the quota definition
             - Whether to enable DMAPI
             - Mount point and automatic mount mode
        - Added the aggregation levels Access Point and Filesets and removed Account
          for the resource type Transparent Cloud Tiering in the Monitoring > Statisitics page.
        - The Files > Transparent Cloud Tiering page now displays the file systems
          and filesets that are mapped with the cloud service. It also shows the
          connection of such a container pair configuration to a cloud account and
          the corresponding CSAP that is configured for the cloud account.
        - Changes to capacity monitoring in the GUI
             - Moved the Capacity page from Monitoring to Files menu in the navigation
               and renamed the GUI page to User Capacity.
             - Only the file data user capacity can be viewed from the Files > User Capacity page.
               Removed the pools, filesets, file system capacity monitoring options from the Files > User Capacity page.
               You can monitor the capacity of these components from the respective GUI pages.
             - Replaced the GPFSPoolCap sensor with the GPFSPool sensor and separate data and metadata
               level capacity monitoring are introduced in the performance charts available in the

               Files > File Systems and Storage > Pools pages.
             - New GPFSPool-based data and metadata performance monitoring metrics are available
               for selection in the Files > Statistics > Edit > Capacity section.
               You need to select the aggregation level as Pool to view these metrics.
        - AFM monitoring changes in the GUI
             - Provides the number of AFM filesets and the corresponding export server maps.
               Each export map establishes a mapping between the gateway node and the NFS
               host name to allow parallel data transfers from cache to home.
             - By using the Request Access option available in the Files > Active File Management
               or Access > Remote Connection page in the GUI, you can now establish connection with remote clusters.
               After establishing the connection, you can monitor the following AFM and AFM DR
               configuration details across clusters:
                  * On home and secondary, you can see the AFM relationships configuration,
                    health status, and performance values of the Cache and Disaster Recovery grids.
                  * On the Overview tab of the detailed view, the available home and secondary inodes are available.
                  * On the Overview tab of the detailed view, the details such as NFS throughput,
                    IOPs, and latency details are available, if the protocol is NFS.
        - New option to create AFM peer snapshots through GUI. Use the Create Peer Snapshot option
          in the Files > Snapshots page to create peer snapshots. You can view and delete these peer snapshots from
          the Snapshots page and also from the detailed view of the Files > Active File Management page.

    Encryption: GSKit V8 improves cryptographic performance on IBM POWER8

       The IBM Global Security Kit (GSKit) Version 8 and later improves cryptographic performance
       on IBM POWER8 hardware. The version of GSKit that is shipped with IBM Spectrum Scale v5.0.0
       offers better performance on POWER8, compared with the versions shipped with earlier releases.

    File compression: The lz4 library provides fast access to compressed data

       File compression supports the lz4 compression library. Lz4 is intended primarily for active data and
       favors read-access speed over maximized space saving.

    File data: Block and subblock sizes improve I/O performance and reduce fragmentation

       The default block size is larger, 4 MiB instead of 256 KiB, and the sizes of subblocks relative to blocks
       s are smaller, for example, 8 KiB subblocks in a 4 MiB block. A larger block size improves the
       file system performance and a smaller subblock size reduces the amount of unused space. For many business
       applications, the default value of 4 MiB provides the best balance of improved performance
       and reduced fragmentation.

    File encryption: AES-XTS encryption is faster on x86 in non-FIPS mode

        On x86 architecture in non-FIPS mode, file encryption with the AES algorithm in XTS mode
        is faster than it was.

    File systems: File system rebalancing is faster

        Rebalancing is implemented by a lenient round-robin method that typically runs faster than the
        previously used method of strict round robin. The strict round robin method is available as an option.

    installation toolkit changes

        - The installation toolkit has added support for the installation and the deployment
          of IBM Spectrum Scale in a cluster containing Elastic Storage Server (ESS).
        - The installation toolkit has added support for enabling and configuring call home.
        - The installation toolkit has added support for enabling and configuring file audit logging.
        - The installation toolkit has added support for the installation and the
          deployment of IBM Spectrum Scale on Ubuntu 16.04 LTS nodes.
        - The installation toolkit has added verification of passwordless SSH during
          prechecks before installation, deployment, or upgrade.
        - The installation toolkit has added support for cumulative object upgrade.

    mmafmctl command

        The --outband parameter is deprecated.

    mmcallhome command: Enhancements

        - Addition of -Y option
            * The -Y displays the command output in a parseable format with a colon (:) as a field delimiter.
        - Addition of --pmr option
            * The --pmr option allows you to upload data to existing PMRs using the
              mmcallhome run SendFile command.

    mmchconfig command: Enhancements

        - Encrypted files can be copied into an LROC device
            * With the lrocEnableStoringClearText attribute, you can control whether file
              data from encrypted files, which is held in memory as cleartext, is
              copied into a local read-only cache (LROC) device.
        - InfiniBand addresses can be specified for RDMA transfers
            * In the verbsPorts attribute, you can specify InfiniBand addresses
              for RDMA transfers between an NSD client and server.

    mmchnsd command: Change NSDs without unmounting the file system

        When you add or remove NSDs or do other operations with mmchnsd,
        you do not need to unmount the file system.

    mmcrfs command: Enhancements

         - The default data block size is 4 MiB with an 8 KiB subblock size
             * If no block size is specified, a file system is created with a 4 MiB block size and an
               8 KiB subblock size. The minimum release level (minReleaseLevel) of the
               cluster must be 5.0.0 or greater when the file system is created.
         - The default log file size depends on block size and metadata size
             * If the block size is 512 KiB or larger and the metadata block size 256 KiB or larger,
               then the default log file size is 32 MiB. Otherwise, the default log file
               size is 4 MiB or the metadata block size, whichever is larger.
         - The default method for updating atime is relatime
             * If the minimum release level (minReleaseLevel) of the cluster is 5.0.0 or greater
               when the file system is created, the default method for updating atime is relatime.

    mmdsh command: Several options are no longer supported

         The --ignoreSignal, -I, and -d options are no longer supported.
         Do not use these options unless instructed to by IBM support personnel.

    mmfsck command: Display an interim status report at any time

         While a long-running instance of mmfsck is in progress, you can start another instance
         of mmfsck with the --status-report parameter to display current
         status information from all the nodes that are participating in the mmfsck run.

    mmgetstate command: Display the unresponsive state

         The command returns the unresponsive state when the GPFS
         daemon is running but is not responding.

    mmhealth command: Addition to measurement options

         Measurement options for filesystem, SMB node, and NFS node
         has been added to the mmhealth command.

    mmkeyserv command: The simplified method supports certificate chains from a certificate authority.

         In the simplified method, with the --kmip-cert parameter, you can set up encryption with IBM®
         Security Key Lifecycle Manager (SKLM) as the key management server and with a certificate signed
         by a certificate authority (CA) on the KMIP port of the SKLM server.

    mmnetverify command: Enhancements

         - Verify the network operation of nodes in a subnet
             * With the --subnets parameters, you can specify the subnet
               addresses of the nodes that you want to verify.
         - Verify that nodes can handle a new MTU size
             * With the ping-packet-size parameter, you can specify the size
               of the ICMP echo request packets that are sent between local node and the target
               node during the ping test.

    mmtracectl command: Display the tracing status

         The --status parameter displays the tracing status of the specified nodes.

    New feature for threshold monitoring

          Starting from IBM Spectrum Scale version 5.0.0, if multiple thresholds rules
          have overlapping entities for the same metrics, only one of the concurrent rules is
          made actively eligible.

    NFS: Dynamic export changes

          You can dynamically change the export configuration without restarting the NFS service.

    Object

          - Support for Ubuntu
          - Support for sudo wrapper for Object on Ubuntu
          - Support for cumulative upgrades from older versions
          - Object snap enhancement to contain keystore logs

    Protocol support: Enhanced

          Protocol support is extended to Ubuntu 10.04

    Setting up a system for storing crash files for Ubuntu

          A topic is added to describe how to set up a system for storing crash files for Ubuntu.

    SMB: DFS redirects for SMB shares

          New option to configure DFS redirects for SMB shares.

    SMB: SMB server upgrade changes

          Two events on CTDB version match/mismatch are added to the RAS events.

    Sudo wrappers: Root-level processes can call administration commands directly

          Root-level background processes, such as cron and callback programs, can
          successfully call administration commands directly rather than through sudo
          when sudo wrappers are enabled.

    Supported clients for NFS

          A topic is added listing the clients that are supported by NFS protocol.

    Transparent cloud tiering

          - Support for multiple cloud storage accounts
          - Support for multiple file systems or filesets per node group
          - Enhanced support for large file systems provided by container spillover
          - Support for associating file sets with containers for enhanced granularity
          - Multiple URL and region support at the node level
          - Support for creating a cloud service separately for tiering and sharing operations.
          - Unique encryption key per cloud container
          - Support for remotely mounted clients.
          - Support for Amazon S3 regions requiring Sigv4 security support,
            including the US government cloud region.
          - Ability to enable or disable transparent recall for files for a given file
            system instantly, without having to rewrite a policy.
          - Support for backing up and restoring the Cloud services configuration in case of any disaster.
          - Support for backing up the Cloud services database to the cloud.
          - Support for restoring Transparent cloud tiering service on an identical backup cluster.
          - Support for checking the integrity of the Cloud services
            database after any system crash or outage.
          - Support for auditing events relating to each operation performed in Cloud services.

    New commands

          mmaudit mmmsgqueue

    Changed commands

          mmafmctl mmadddisk mmcallhome mmchattr mmchcluster mmchconfig mmchfs mmchnsd
          mmcloudgateway mmcrcluster mmcrfs mmdeldisk mmdsh mmfsck mmgetstate
          mmkeyserv mmnetverify mmnfs mmrestripefile mmrestripefs mmsmb mmtracectl
          mmuserauth

    Deleted commands

          mmrest

    New messages

          6027-1264, 6027-1757, 6027-2394, 6027-2395, 6027-2396, 6027-2397
          6027-2398, 6027-2399, 6027-2400, 6027-2401, 6027-3259, 6027-3408
          6027-3597, 6027-3598, 6027-3599, 6027-3600, 6027-3601, 6027-3602
          6027-3603, 6027-3604, 6027-3730, 6027-3921, 6027-3922, 6027-3923
          6027-3924, 6027-3925, 6027-3926, 6027-3927, 6027-3928
          6027-3929, 6027-3930, 6027-3931, 6027-4019

    Changed messages

          6027-928

     

  • gpfs@us.ibm.com
    gpfs@us.ibm.com
    662 Posts

    Re: IBM Spectrum Scale V5.0 announcements

    ‏2018-02-02T18:03:08Z  

    Flash (Alert):  IBM Spectrum Scale (GPFS):  Undetected corruption of archived sparse files (Linux)

    Abstract

    IBM has identified an issue with IBM GPFS and IBM Spectrum Scale for Linux environments, in which a sparse file may be silently corrupted during archival, resulting in the file being restored incorrectly.

     

    See the complete Flash at:  http://www.ibm.com/support/docview.wss?uid=ssg1S1012054

  • gpfs@us.ibm.com
    gpfs@us.ibm.com
    662 Posts

    Re: IBM Spectrum Scale V5.0 announcements

    ‏2018-02-14T19:37:47Z  

    GPFS 5.0.0.1 is now available from IBM Fix Central:

    http://www-933.ibm.com/support/fixcentral

    Problems fixed in GPFS 5.0.0.1

    February 7, 2018

    * Fix offline fsck deadlocks that can occur when there is orphaning inodes IJ02867.
    * Fix the pending log file migration assert that can occur when doing file system restripe operations or adding/deleting/changing the file system disks IJ02867.
    * Fix a problem in which if inode expansion is interrupted, it may leave nAllocatedInodes inconsistent between sg descriptor and the fileset metadata file IJ03146.
    * Change code to recover corrupted files in CCRs committed directory during GPFS startup to prevent these corrupted files causing other components to fail IJ03144.
    * Address a problem where a user cannot make changes to the afmTarget once he has created the fileset with a wrong mapping name (or) host name in the afmTarget field IJ02867.
    * Fix node hangs due to consumption of DMAPI event mailboxes IJ03141.
    * Fix a log assert which may happen during mmdelsnapshot if a file in the snapshot has a DITTO xattr overflow block address IJ02867.
    * If HAWC is enabled for a file system for which log recovery has failed, the recovery log is no longer dumped, because the recovery log may contain user data. Also, dump files are now created with more restricted permissions IJ02867.
    * Fix a gpfsReserveDelegation exception which can occur if a kworker returns a nfs4 lease IJ02867.
    * Address the issue to find latest common RPO snapshot between acting and old primary to trigger a restore in failbacktoprimary IJ02867.
    * Fix a rare case long waiters 'waiting for new SG mgr' which may happen if a file system has no external mounts and 'tsstatus -m <fsname>' command is run on the fs manager node in a specific time window IJ02867.
    * Fix a sample script filehist that may fail with divide by zero error IJ03142.
    * Fix the orphan inode issue found after a deletion of the dependent fileset. This only refers to encrypted clone files while doing fileset deletion and only if the key management server is unavailable IJ02867.
    * Address the issue of a valid return code being returned even if the failback command failed to execute. This does not occur with IW filesets IJ02867.
    * Fix a mmfsd assert at: Assert exp(mdiBlockP != __null) ts/vdisk/mdIndex.C 2299. This can happen during a vdisk creation and the repair thread trying to harden the metadata onto the disk IJ02867.
    * Fix erratic inode expansion behavior and spurious 'Expanded inode space' log messages under multi-node create workload IJ02867.
    * Address a problem where a deadlock can happen if there is application IO occurring to the AFM fileset when the home/secondary site fileset has gone stale IJ02867.
    * Fix Assert exp(inodeFlushFlag) openinst-vfs.C 1560 that can occur while updating extended attributes IJ02867.
    * Fix assert exp(!"oldDiskAddrFound.compAddr(*oldDiskAddrP)") which may happen when you preallocate data in an inode file. Note, fallocate() on GPFS file system or write()/fallocate() on a FPO file system can trigger preallocations IJ03163.
    * Fix an issue in the AFM environment where inband trucking tries to copy the data back to secondary even though the data already exists IJ02867.
    * Fix a problem in which incorrect node address managed in RGCM keeps causing RG to be resigned and recovery failure when primary node is down IJ03247.
    * Fix an issue in the AFM environment where some files are moved to .ptrash directory intermittently over GPFS backend IJ03148.

    * Fix a problem where the Receive Worker threads go CPU bound after a kernel crash IJ03147.
    * Fix a ganesha kernel crash IJ02867.
    * Fix a hang during unmount which can occur if QOS has been enabled on any file system IJ02867.
    * Fix a deadlock that can occur during file system repair IJ02867.
    * Fix a problem in which mmfileid fails to list small files IJ03156.
    * Fix a deadlock involving a failed "mmfsctl resume" command, a SG panic and while having disk issues IJ02867.
    * Fix a rare race during DA integrity manager service state transition, which may cause assert like "Assert exp(nIMTasks == 0" IJ02867.
    * Fix a rare case that truncate() does not set file size correctly. The file size is set to full block boundary incorrectly and the fragment is lost IJ03149.
    * Fix a mmapplypolicy/tsapolicy core dump: ThreadThing::check mutexthings.C:170 and an improper recovery from helper failure during a directory scan IJ02867.
    * Fix an issue in the AFM environment where already existing uncached files are not prefetched correctly IJ03150.
    * Fix a log assert "Assert exp(totalSteps >= 0) in file workthread.C". It happens when running mmlsfileset -r or mmlsfileset -i command against a file system which has a huge inode number or lots of independent filesets IJ03233.
    * Fix a problem where pmsensor service crashes because there are NULL entries returned from mmpmon for AFM filesets IJ03151.
    * Fix a problem in which the fileset failed to run recovery during a failover IJ02867.
    * Fix an issue in the AFM environment where daemon deadlocks during recovery with recursive dependent rename operations IJ02867.
    * Fix a NUMA discovery problem for nodes with GPUs IJ03161.
    * Fix a problem in which a directory inside an IW fileset cannot fetch new changes to the directory made at its home counterpart. This can happen following a recovery or failover that has been run on the IW fileset at the cache site IJ02867.
    * Fix a problem in which mmfsck will fail with "There is not enough free memory available for use by mmfsck in ..." due to a memory leak IJ02867.
    * Fix problem that could cause count of read() or write() calls to be under-counted in application I/O mmpmon and performance monitor metrics IJ02867.
    * Fix an issue in the AFM environment where files are moved to .ptrash during the rename on independent-writer mode filesets IJ02867.
    * Fix a mount failure by allowing getEFOptions to work even if it can't get a local environment lock as long as it can access the latest mmsdrfs file IJ03236.
    * Fix fcntl performance issue IJ03152.
    * Fix a locking issue during prefetching of a directory block that can lead to a FSSTRUCT error being incorrectly issued. This could happen when there is a race between expanding the first directory block on one node and prefetching of the same block on another node IJ02867.
    * Fix a recall problem on AIX that can occur during reads and writes of a non-resident file IJ02867.
    * Fix a problem that could cause mmpmon histograms not to be updated while doing sequential I/O of less than the file system block size IJ02867.
    * Fix an issue in the AFM environment where gateway nodes crashes intermittently. Also fix an issue where lookup returns incorrect results IJ02867.
    * Fix a rare timing assertion when the file system is forced to unmounted at the same time that quota files are being flushed to disk IJ02867.
    * Fix problems with network monitoring caused by names like loop@loop IJ02867.
    * This fix will ensured that server reachability is accurately reported for multiple servers with CES stack configured for LDAP authentication IJ03796.

    * Fix a problem where on AIX mmcrnsd call clears out the PVID that was assigned by the OS IJ03159.
    * Fix an unnecessary file system panic and unmount on client nodes during mmchdisk start command. The file system panic/unmount could occur when a disk which has been started becomes unavailable again in the middle of mmchdisk start command IJ03238.
    * Fix a problem where, when a recovery policy fails with an error 2, we need to rerun the policy with higher debug level for policy IJ02867.
    * Fix an issue where recovery was stuck in the local cluster due to GW node changes in a remotecluster environment IJ02867.
    * This fix adds a protection to prevent a compressed fragment from being expanded without being uncompressed first in some unexpected conditions of having inconsistent compression flags. This fix also replaces an assert with an IO error to minimize the user impact IJ03153.
    * Fix an issue that remote NSD clients drop into a long time retry loop during an ESS outage. This can occur when there are multiple ESS building blocks and GPFS replication is enabled in the cluster. When shutting down both servers of a ESS building block simultaneously, remote NSD clients can experience a long retry loop like 'waiting for stateful NSD server error takeover (1)' IJ03154.
    * Fix the code to call the tiebreakerCheck user exit script in case the CCR is enabled IJ02867.
    * Fix a problem where recovery keeps failing with an error 2 because the AFM recovery script wasn't able to handle directory names in the fileset that had trailing spaces in them IJ03157.
    * This fix Adds a new config option numactlOptioni for setting NUMA nodes the GPFS daemon can allocate from IJ03158.
    * This fix Adds a hidden option to set a file/dir as local using the mmafmctl script so that it doesn't replicate the changes on the file/dir to the secondary/home site IJ02867.
    * This fix synchronizes write failures and read operations to avoid reading stale data from disk IJ02867.
    * RGCK might be used very rarely in the case of RG recovery failure. Fix an assert like "logAssertFailed: OWNED_BY_CALLER(lockWordCopy, lockWordCopy)" when trying to revive a defective pdisk in RGCK IJ02867.
    * This fix Addresses a problem where reading a symbolic links pointing to nothing at home can cause an Assert at the cache site IJ03239.
    * Fix a Object Authentication(Keystone) configuration failure that occurs on External Keystone with the v3 api IJ02867.
    * Fix a problem where AFM/DR recovery can cause buffer overflows for path names that are really long (beyond 1024 characters) IJ02867.
    * Fix an issue in the AFM environment where a file listing during readdir fails for dirty files in local-updates mode. This problem happens with a ganesha NFS server having AFM local-updates mode fileset exports IJ03425.
    * Fix a problem in which gpfs.snap stops with an error message when it stores (TARs) log files IJ03847.
    * Fix a problem in which open(O_TRUNC) returns with an error but still truncates the file IJ03850.
    * This update addresses the following APARs: IJ02867 IJ03141 IJ03142 IJ03144 IJ03146 IJ03147 IJ03148 IJ03149 IJ03150 IJ03151 IJ03152 IJ03153 IJ03154 IJ03156 IJ03157 IJ03158 IJ03159 IJ03161 IJ03163 IJ03233 IJ03236 IJ03238 IJ03239 IJ03247 IJ03368 IJ03425 IJ03688 IJ03796 IJ03847 IJ03850

     

  • gpfs@us.ibm.com
    gpfs@us.ibm.com
    662 Posts

    Re: IBM Spectrum Scale V5.0 announcements

    ‏2018-02-26T15:29:16Z  

    Security Bulletin: A vulnerability has been identified in IBM Spectrum Scale that could allow a local unprivileged user access to information located in dump files. User data could be sent to IBM during service engagements (CVE-2017-1654).

    Summary
    A vulnerability has been identified in IBM Spectrum Scale that could allow a local unprivileged user access to information located in dump files. User data could be sent to IBM during service engagements (CVE-2017-1654).

    See the complete bulletin at http://www-01.ibm.com/support/docview.wss?uid=ssg1S1010869

  • gpfs@us.ibm.com
    gpfs@us.ibm.com
    662 Posts

    Re: IBM Spectrum Scale V5.0 announcements

    ‏2018-03-02T17:23:12Z  

    Security Bulletin:  Vulnerabilities in Samba affect IBM Spectrum Scale SMB protocol access method (CVE-2017-14746, CVE-2017-15275)

    Summary
    Vulnerabilities in Samba affect IBM Spectrum Scale SMB protocol access method that:
    - could allow a remote attacker to execute arbitrary code on the system, caused by a use-after-free memory error (CVE-2017-14746)
    - could allow a remote attacker to obtain sensitive information, caused by a heap memory information leak (CVE-2017-15275)

    See complete bulletin at:  http://www.ibm.com/support/docview.wss?uid=ssg1S1012067

  • gpfs@us.ibm.com
    gpfs@us.ibm.com
    662 Posts

    Re: IBM Spectrum Scale V5.0 announcements

    ‏2018-04-10T18:37:12Z  

    GPFS 5.0.0.2 is now available from IBM Fix Central:

    http://www-933.ibm.com/support/fixcentral

    Problems fixed in GPFS 5.0.0.2

    March 19, 2018

    * Fix a "logAssertFailed: (SGFilesetId)recordNum <= ((SGFilesetId)999999999)" that can happen when NFS clients access the same files in a snapshot of an independent fileset IJ04124.
    * Fix an assert that can occur while running mmrestripefs and creating a new replacement when a recovery log ion one of the disks becomes suspended IJ03961.
    * Fix a possible memory corruption that can occur when group quota information is retrieved by multiple clients concurrently IJ03961.
    * This fix allows mmbackup to continue even if the shadow database file is determined to be a binary file by the grep command IJ04127.
    * Fix a problem in which mmlsmount fs will always show that the fs is in the internal mount state on the SG mgr node IJ03961.
    * Fix a "struct error: Invalid XAttr overflow" which may occur during a snapshot IJ03961.
    * Fix a problem in which a file is failed to make objective. This can occur when you have large amounts of openstack projects and accounts, like over 5000 IJ04188.
    * Fix an problem in which a node is being expelled while there are multiple network reconnects occurring IJ03961.
    * The documentation for mmchqos was updated to be more clear IJ04315.
    * Fix a core dump that can occur during a memory free IJ03961.
    * This fix improves the health state monitoring for SMB components IJ04186.
    * Fix a replica mismatch which can occur if a restripefs -r attempts to migrate data off a suspended disk IJ04123.
    * Fix a problem that results in a corrupt entry is created in the Ganesha exports configuration file. This can occur when running the mmnfs export add command and entering all white space for the --client option IJ03961.
    * Fix an exception in AclDataFile::findAcl() that can occur during a node expell IJ03961.
    * Fix a problem in which a read prefetch was not triggered and sparse blocks were not fetched. This can happen when reading sparse files IJ03961.
    * Fix a SEGSIZE ("mallocSize < SEGSIZE" assert) that can occur on very large ACL files on AIX IJ03961.
    * Fix assert "exp(isAllocListChanging())" which may fail during SGPanic IJ03961.
    * Fix a crash on a Gateway node that can occur while updating the policy attribute changed at home IJ04202.
    * Fix the inode indirection level assert issue that can happen during the failure process of clone file creation IJ03961.
    * Fix a memory leak issue when there is silent data corruption that can be caught and fixed by GNR buffer trailer compare mismatch. This memory leak may experience an assert like "logAssertFailed: *nReservedP == 0, 5932, vbufmgr2.C" IJ03961.
    * Fix an issue in AFM DR environment to not copy the already synched data that can occur during the role reversal process IJ03961.
    * Address a problem where a psnap creation on a gateway node, also serving as the FS manager can deadlock when the fileset in question is in need of a recovery IJ03961.
    * Fix a problem to allow * and / characters to be used in the option list of mmnfs command IJ03961.
    * Fix CCR code to avoid assertion of type 'cachedFIdMap.size() == committed.fileList.size()' IJ03961.
    * Enable buffer dirty bits debug data to be collected under "debugDataControl heavy" and trace level including "fs 4". The default trace level or "all 4" would work. Shorthands have been used to drastically reduce the amount of buffer dirty bits data and to make the data easier to read IJ04126.

    * Fix a signal 11 which can result from a race condition between daemon startup, file system mount and snapshot quiesce rpc handling IJ03961.
    * Fix a deadlock that can occur during fileset recovery where prepop recovery file should not be accessed until a recovery or queue transfer is completed IJ03961.
    * Fix a corner-case ESS and GSS deadlock that was observed during RG failover IJ03961.
    * Fix a potential assert when a compressed file is updated in the last data block causing a COW to the snapsot that was recently read IJ03961.
    * Fixed a performance issue in the AFM environment where small size files replication is improved over high latency networks. Feature can be enabled by setting the afmIOFlags=4 IJ04131.
    * Fix data loss on archived snapshots and cloned files archiving by using the "tar" command with "-S" or "--sparse" options IJ03961.
    * Fix a problem where mmapplypolicy crashes or loops indefinitely after apparently completing all the the work it should have done. Should be unusual, as it only applies when there was a failure of a "helper" during the execution phase IJ03961.
    * Fix a problem in which the PaxosChallengeCheck thread reported as a long waiter in the GPFS log and/or dump file IJ04129.
    * Fix a problem in which gpfs_igetattrs with 1M bufferSize fails with ENOMEM IJ04130.
    * Fix callhome not sending scheduled tasks on some Ubuntu systems IJ03961.
    * Fix an assert that can occur during a large data copy on a large compressed file IJ03961.
    * Fix a potential assert when a compressed file is extended due to a truncation operation beyond original file size IJ03961.
    * Fix a problem in which mmshutdown can't cleanup bind mounts and mmmount can't umount bind mounts. This can occur on older kernels IJ03961.
    * Fix mmbackup and mmimgbackup problem that can occur when used with IBM Spectrum Protect that includes incompatible gskit library IJ04132.
    * Fix a mmcrvdisk failure that can occur when a recoverygroup name contains a period IJ03961.
    * Fix a problem in which mmcloudgateway files import does not work IJ03961.
    * Fix the "bgP == __null" assert that can occur on truncation operations on compressed files IJ03961.
    * Fix "mmcallhome group auto" creating unchangeable dummy group settings on SLES 12 IJ03961.
    * Fix a double memory free issue, which may cause assert like "Assert exp(vHoldCount > 0) in vbufmgr.C:280" which can occur during pdisk errors IJ03961.
    * Fix CCR client code to avoid segmentation fault during backup command IJ03961.
    * Fix code to avoid a segmentation fault during PaxosSharedDisk::readDblocks() in the GPFS mmfsd IJ03961.
    * This fix avoids a dereference of a NULL pointer during dumping of buffer dirty bits IJ03961.
    * Fix a temporary file system busy state when mounting the file system right after the file system name was changed IJ03961.
    * Fix a logAssert "exp(errP != NULL)" which may happen while accessing a gpfs snapshot on an nfs client. The log assert is caused by a race of the file access and a snapshot deletion IJ04206.
    * Fix a rare deadlock which can happen between commands that change the cluster manager (like mmchmgr, mmexpelnode, mmchnode --nonquorum) and a quorum lost event IJ04189.
    * Fix a kernel panic problem when we get an E_NOT_METANODE error when doing a mmap read/write on a compressed file IJ03961.
    * Fix assert "(client->state & RecLockMsgFree) != 0" which can occur during heavy fcntl activity during a quorum loss IJ03961.
    * Fix a "logAssertFailed: !isCfgMgr()" error which may happen after a node failure event IJ04187.
    * Fix recently introduced slow command performance. It affects server base clusters that disable mmsdrservPort IJ04133.
    * Fix a problem that results in file tree inconsistency between cache and home that can result during massive file and directory renames and a dropped queue which results in a recovery IJ03961.
    * Fix a GPFS daemon signal 6 when flushing data in the presence of too many permanent pdisk faults. This is ESS and GSS IJ03961.

    * Fix code to avoid unexpected GPFS cluster manager changes to other quorum nodes. This seems to occur on very large clusters over 500 nodes during heavy IO stress while the CCR on the remaining quorum nodes had massive stress with lots of vputs/vgets/fputs/fgets IJ04192.
    * The mmhealth monitoring daemon was running some commands twice although not necessary. This was fixed to reduce unnecessary system calls IJ03961.
    * Fix a "TSChFilesetCmdThread: on ThCond 0x123DE0F0 (MsgRecordCondvar), reason 'RPC wait' for ccMsgNodeState" long waiter which can occur during unlinking or disabling pcache for filesets on Ganesha IJ03961.
    * This fix corrects inconsistent behavior of mmnfs export list command when -Y is used IJ03961.
    * Fix a problem in which "mmhealth node show -v" showed the event "callhome_enabled" even if it was not enabled. Even Spectrum Scale GUI showed this event too IJ03961.
    * Fix a problem in which pdisks are being incorrectly declared "slow" by the Spectrum Scale RAID disk hospital. This patch must be applied to all systems using Spectrum Scale RAID with write caching drives IJ04201.
    * The fix will prevent a daemon from crashing if a user uses a large value for the number of subblocks IJ04604.
    * Fix a python exception in the mmnetverify command which can occur when the flood operation is used with a target node whose GPFS node number is greater than 255 (will not fit in an 8-bit byte) IJ04316.
    * Fix a problem in which mmadquery utility was not able to list users. This can occur when the "AD domain name" is different from the "AD domain shortname" IJ04782.
    * Address a problem where cleanup on handlers can happen twice (1 called from unmount of the filesystem and other from a panic on the FS at the same time), and this could result in a bad memory access causing a Signal 11 IJ03961.
    * Fix an issue in the AFM environment where a daemon asserts at the gateway node when a file is being removed. This happens when a file is deleted immediately after the creation and the filesystem is already quiesced IJ03961.
    * Fix a problem in which AFM orphaned entries cannot be cleaned up online with an AFM disabled fileset IJ03961.
    * Fix a mmfsd core dump which can occur when mmpmonSocket is receiving events IJ04589.
    * Fix the false ENOENT error when operating files in an AFM fileset. This could happen when the inode being operated on is being evicted from the inode cache for a cache shrink IJ03961.
    * Fix an assert "Kernel Assertion: (nfsP->refCount >= 0) kernext/nfs.C:line 434". This can happen during snapshot restore stress IJ04811.
    * This fix prevents mmfsadm dump all from a seg fault that can occur if the numa node has no cpu defined IJ03961.
    * When pool usage exceeds the warning threshold configured by mmhealth, the message in /var/log/messages talks about "metadata" but it should be "data" IJ04800.
    * Fix a problem in which a GPFS cluster manager does not fail over. This can happen when the number of quorum is reduced in conjunction with tiebreaker disks IJ03961.
    * Fix a GPFS daemon abort of type: "signal 6, ccr/paxosserver.C assertion 'false' failed" This can occur during a CCR file commit IJ03961.
    * Fix a long waiter ProbeClusterThread: on ThCond 0x116D2F80 (0x116D2F80) (StripeGroupTableCondvar), reason 'waiting for SG cleanup'. This can occur if there are multiple threads doing AFM initialization IJ03961.
    * Fix a problem in which we fail to report that repeated options for mmnfs export add/list/change are not allowed IJ03961.
    * Fix a !ofP->destroyOnLastClose assert that can occur in an AFM environment while running mmunlinkfileset IJ04969.
    * This update addresses the following APARs: IJ03961 IJ04123 IJ04124 IJ04126 IJ04127 IJ04129 IJ04130 IJ04131 IJ04132 IJ04133 IJ04186 IJ04187 IJ04188 IJ04189 IJ04192 IJ04201 IJ04202 IJ04206 IJ04315 IJ04316 IJ04589 IJ04604 IJ04782 IJ04800 IJ04811 IJ04969.

     

  • gpfs@us.ibm.com
    gpfs@us.ibm.com
    662 Posts

    Re: IBM Spectrum Scale V5.0 announcements

    ‏2018-05-14T16:40:17Z  

    IBM Spectrum Scale 5.0.1.0 is now available from IBM Fix Central:

    http://www-933.ibm.com/support/fixcentral

    This topic summarizes changes to the IBM Spectrum Scale licensed
    program and the IBM Spectrum Scale library.

    Summary of changes
    for IBM Spectrum Scale version 5 release 0.1
    as updated, April 2018

    Changes to this release of the IBM Spectrum Scale licensed
    program and the IBM Spectrum Scale library include the following:

    AFM and AFM DR-related changes

            Added a new topic Role reversal.
            A new configuration parameter added - afmEnableNFSSec.

    Authentication-related changes

            A password saved in a stanza file can be retrieved by using the mmuserauth command .

    Changes in HDFS Transparency 2.7.3-2

            Snapshot from a remote mounted file system is supported.
            IBM Spectrum Scale fileset-based snapshot is supported.
            HDFS Transparency and IBM Spectrum Scale Protocol SMB can
              coexist without the SMB ACL controlling the ACL for files or directories.
            HDFS Transparency rolling update is supported.
            Zero shuffle for IBM ESS is supported.
            Manual update of file system configurations when root password-less access is not
              available for remote cluster is supported.

    Changes in Mpack version 2.4.2.4

            HDP 2.6.4 is supported.
            IBM Spectrum Scale admin mode central is supported.
            The /etc/redhat-release file workaround for CentOS deployment is removed.

    Changes in Mpack version 2.4.2.3

            HDP 2.6.3 is supported.

    Changes in Mpack version 2.4.2.2


            The Mpack version 2.4.2.2 does not support migration from IOP to HDP 2.6.2.
                For migration, use the Mpack version 2.4.2.1.
            From IBM Spectrum Scale Mpack version 2.4.2.2, new configuration parameters
                have been added to the Ambari management GUI. These configuration parameters are as follows:
                    gpfs.workerThreads defaults to 512.
                    NSD threads per disk defaults to 8.
                    For IBM Spectrum Scale version 4.2.0.3 and later, gpfs.workerThreads field takes effect
                        and gpfs.worker1Threads field is ignored. For versions lower than 4.2.0.3,
                        gpfs.worker1Threads field takes effect and gpfs.workerThreads field is ignored.
                    Verify if the disks are already formatted as NSDs - defaults to yes
            Default values of the following parameters have changed. The new values are as follows:
                    gpfs.supergroup defaults to hdfs,root now instead of hadoop,root.
                    gpfs.syncBuffsPerIteration defaults to 100. Earlier it was 1.
                    Percentage of Pagepool for Prefetch defaults to 60 now. Earlier it was 20.
                    gpfs.maxStatCache defaults to 512 now. Earlier it was 100000.
            The default maximum log file size for IBM Spectrum Scale has been increased to 16 MB from 4 MB

    Changes in Mpack version 2.4.2.1 and HDFS Transparency 2.7.3-1
            The GPFS™ Ambari integration package is now called the IBM Spectrum Scale Ambari management
                pack (in short, management pack or MPack).
            Mpack 2.4.2.1 is the last supported version for BI 4.2.5.
            IBM Spectrum Scale Ambari management pack version 2.4.2.1 with HDFS Transparency version
                2.7.3.1 supports BI 4.2/BI 4.2.5 IOP migration to HDP 2.6.2.
            The remote mount configuration in Ambari is supported. (For HDP only)
            Support for two Spectrum Scale file systems/deployment models under one Hadoop
                cluster/Ambari management. (For HDP only)
                    This allows you to have a combination of Spectrum Scale deployment models
                    under one Hadoop cluster. For example, one file system with shared-nothing storage
                    (FPO) deployment model along with one file system with shared storage (ESS) deployment
                    model under single Hadoop cluster.
            Metadata operation performance improvements for Ranger enabled configuration.
            Introduction of Short circuit write support for improved performance where HDFS client and
                Hadoop data nodes are running on the same node.

    Cloud services changes

            Support for backup and restore using SOBAR
            Support for automated Cloud services maintenance service for the following operations:
                Background removal of deleted files from the object storage
                Backing up the Cloud services full database to the cloud
                Reconciling the Cloud services database
            Support for setting up a customized maintenance window, overriding the default values

    File systems: Integration with systemd is broader
            You can now use systemd to monitor and manage IBM Spectrum Scale systemd services on
                configured systems. IBM Spectrum Scale automatically installs and configures GPFS
                as a suite of systemd services on systems that have systemd version 219 or later
                installed. Support for the IBM Spectrum Scale Cluster Configuration Repository
                (CCR) is included. For more information, see Planning for systemd.

    File systems: Traditional NSD nodes and servers can use checksums
            NSD clients and servers that are configured with IBM Spectrum Scale can use
                checksums to verify data integrity and detect network corruption of file data
                that the client reads from or writes to the NSD server. For more information,
                see the nsdCksumTraditional and nsdDumpBuffersOnCksumError attributes
                in the topic mmchconfig command.

    File systems: Concurrent updates to small shared directories are faster
            Fine-grained directory locking significantly improves the performance of concurrent
                updates to small directories that are accessed by more than one node. "Concurrent"
                updates means updates by multiple nodes within a 10-second interval. A "small"
                directory is one with fewer than 8 KiB entries.

    File systems: NSD disk discovery on Linux now detects NVMe devices
            The default script for NSD disk discovery on Linux, /usr/lpp/mmfs/bin/mmdevdiscover,
                now automatically detects NVM Express (NVMe) devices. It is no longer necessary
                to create an nsddevices user exit script to detect NVMe devices on a node.
                For more information, see NSD disk discovery and nsddevices user exit.

    mmapplypolicy command: New default values are available for the parameters -N (helper nodes) and -g (global work directory)
            If the -N parameter is not specified and the defaultHelperNodes attribute is not set,
                then the list of helper nodes defaults to the managerNodes node class. The target
                file system must be at format version 5.0.1 (format number 19.01) or later.
            If the -g parameter is not specified, then the global work directory defaults to
                the path (absolute or relative) that is stored in the new sharedTmpDir attribute.
                The target file system can be at any supported format version.
            For more information, see mmapplypolicy command and mmchconfig command.

    mmbackup command: A new default value is available for the -g (global work directory) parameter
            If the -g parameter is not specified, then the global work directory defaults to the
                path (absolute or relative) that is stored in the new sharedTmpDir attribute.
                The target file system can be at any supported format version. For more information,
                see mmbackup command and mmchconfig command.

    mmcachectl command: You can list the file and directory entries in the local page pool
            You can display the number of bytes of file data that are stored in the local page pool
                for each file in a set of files, along with related information. You can display
                information for a single file, for the files in a fileset, for all the files in a
                file system, or for all the file systems that are mounted by the node.
                For more information, see mmcachectl command.

    IBM Spectrum Scale functionality to support GDPR requirements
            To understand the requirements of EU General Data Protection Regulation (GDPR)
                compliance that are applicable to unstructured data storage and how IBM Spectrum
                Scale helps to address them, see the IBM Spectrum Scale functionality to
                support GDPR requirements technote.

    IBM Spectrum Scale management API changes

    Added the following API commands:

        GET /nodes/{name}/services
        GET /nodes/{name}/services/{serviceName}
        PUT /nodes/{name}/services/{serviceName}
        GET /filesystems/{filesystemName}/policies
        PUT /filesystems/{filesystemName}/policies
        GET /perfmon/sensors
        GET /perfmon/sensors/{sensorName}
        PUT /perfmon/sensors/{sensorName}
        GET /cliauditlog

    IBM Spectrum Scale GUI changes

        Added a new Services page that provides options to monitor, configure, and manage various services that are available in the IBM Spectrum Scale system. You can monitor and manage the following services from the Services page:
            GPFS daemon
            GUI
            CES
            CES network
            Hadoop connector
            Performance monitoring
            File auditing
            Message queue
            File authentication
            Object authentication
            Added a new Access > Audit Logs page that lists the various actions
              that are performed on the system. This page helps the system administrator
              to audit the commands and tasks the users and administrators are performing.
              These logs can also be used to troubleshoot issues that are reported in the system.
            Moved the NFS Service, SMB Service, Object Service, and Object Administrator
              pages from the Settings menu to the newly created Services page.
            Removed GUI Preference page and moved the options in that page to the GUI section
              of the Services page.
            New option is added in the GUI section of Services page to define session timeout
              for the GUI users.
            Support for creating and installing a self-signed or CA-certified SSL certificates
              is added in the GUI section of the Services page.
            Remote cluster monitoring capabilities are added. You can now create customized
              performance charts in the Monitoring > Statistics page and use them in the
              Monitoring > Dashboard page. If a file system is mounted on the remote cluster node,
              the performance of the remote node can be monitored through the detailed view of
              file systems in Files > File Systems page.
            Modified the Files > Transparent Cloud Tiering page to display details of the container
              pairs and cloud account.
            Added support for creating and modifying encryption rules in the Files > Information
              Lifecycle page. You can now create and manage the following types of encryption rules:
                Encryption
                Encryption specification
                Encryption exclude
            Added ILM policy run settings in the Files > Information Lifecycle page.
            Added the Provide Feedback option in the user menu that is available at the upper
              right corner of the GUI.

    Installation toolkit changes
            The installation toolkit supports the installation and the deployment of
              IBM Spectrum Scale on Ubuntu 16.04.4 (x86_64)
            The installation toolkit config populate option supports for call home and file audit logging.
            The installation toolkit performance monitoring configuration for protocols sensors has been improved.

    mmhealth command: Enhancements
            Changes to the mmhealth node show and mmhealth cluster show options
            New options have been added to the mmhealth node show and mmhealth cluster show commands.
              For more information, see mmhealth command.

    Documented commands, structures, and subroutines

    New commands
        The following commands are new in this release:

            mmcachectl

    New structures
        There are no new structures.
    New subroutines
        There are no new subroutines.
    Changed commands
        The following commands were changed:
           mmapplypolicy
           mmbackup
           mmces
           mmcallhome
           mmchconfig
           mmchfileset
           mmcloudgateway
           mmhealth
           mmsmb
           mmuserauth
           spectrumscale

    Changed structures
        There are no changed structures.
    Changed subroutines
        There are no changed subroutines.
    Deleted commands
        mmrest
        Deleted structures
            There are no deleted structures.
        Deleted subroutines
            There are no deleted subroutines.
    New messages
        6027-307, 6027-2402, 6027-2403, 6027-2404, 6027-2405, 6027-2406, 6027-2407, and 6027-3932.

  • gpfs@us.ibm.com
    gpfs@us.ibm.com
    662 Posts

    Re: IBM Spectrum Scale V5.0 announcements

    ‏2018-05-14T17:10:02Z  

    Security Bulletin: A vulnerability has been identified in IBM Spectrum Scale with CES stack enabled that could allow sensitive data to be included with service snaps.  This data could be sent to IBM during service engagements (CVE-2018-1512)

    Summary:
    A security vulnerability has been identified in IBM Spectrum Scale with CES stack enabled that could allow sensitive data to be included with service snaps.  This data could be sent to IBM during service engagements (CVE-2018-1512).

    See the complete bulletin at:  http://www.ibm.com/support/docview.wss?uid=ssg1S1012325

  • gpfs@us.ibm.com
    gpfs@us.ibm.com
    662 Posts

    Re: IBM Spectrum Scale V5.0 announcements

    ‏2018-06-20T17:43:57Z  

    Technote (troubleshooting):  IBM Spectrum Scale: mmbuildgpl fails on Linux (for example: RHEL 7.4 with kernel 3.10.0-693.19.1 or later)

    Problem(Abstract)

    When building the GPFS portability layer on a RHEL Linux node, you may encounter error or warning messages similar to "CONFIG_RETPOLINE=y, but not supported by the compiler..." with the mmbuildgpl command due to an incompatible gcc compiler. To fix the problem, use a retpoline-aware gcc compiler.

     

    See the complete bulletin at:  http://www.ibm.com/support/docview.wss?uid=ssg1S1012412

  • gpfs@us.ibm.com
    gpfs@us.ibm.com
    662 Posts

    Re: IBM Spectrum Scale V5.0 announcements

    ‏2018-06-20T20:27:48Z  

    GPFS 5.0.1.1 is now available from IBM Fix Central:

    http://www-933.ibm.com/support/fixcentral

    Problems fixed in GPFS 5.0.1.1

    June 13, 2018

    * Fix a mmfsck failure due to file system panic. IJ06116
    * Fix a mmfsd assert exp(de.getNameP()[0] != 0) direct.C:654 that can occur when running fsck with patch-file option twice. IJ06116
    * Fix an Assert: !"search long and hard in getSnapP" which can occur when running mmfsck fs1 -c -vn --patch-file. IJ06116
    * Fix an Assert exp(readRepIndex == -1 OR (readRepIndex >= 0 which can occur when running mmfsck -yv:. IJ06116
    * Fix a FSCK:sig8 on FsckDirCache::readBlockDA which can occur when running mmfsck with --skip-inode-check -n -v. IJ06116
    * Fix an Assert exp(!"Assert on Structure Error") in Logger.C which can occur when running fsck. IJ06116
    * Fix Long waiter : PCacheMsgHandlerThread: on ThCond that can occur when running 'mmafmctl gpfs1 getstate', parallelly to a link cmd. IJ06116
    * Fix Corruption: InodeMetadata/Critical which can occur during a heavy workload of create, list, delete of snapshots and filesets. IJ06116
    * Fix assert: cachedcomm.C:182: assertion 'iter->second' in the GPFS daemon (CCR) during mmshutdown on Windows based quorum nodes. IJ06116
    * Fix Assert exp(!"Assert on Structure Error, called from the kernel") in Logger.C which can occur during a heavy workload. IJ06116
    * Fix a case in which mmlsdisk returned incorrect data. This can occur when /var/mmfs/etc/ignoreAnyMount or /var/mmfs/etc/ignoreAnyMount.<filesystem> exist. IJ06479
    * Fix a fsck: SGPanic at line 11801 in /ts/pfsck/cache.C. IJ06116
    * Existing afmctl file at home fileset should not be changed after running ConvertoSecondary command. IJ06116
    * Fix lease overdue with unsuccessful replies to lease requests. Probing cluster. IJ06116
    * Update logging code to prevent possible long waiters where thread can get stuck waiting on 'force wait on active buffer to become stable'. This can happen if file system panic occurs while a thread is actively appending records to log file. IJ06116
    * Fix a problem in which fsck: does not reclaim unused metadata when fixing fs. IJ06116
    * Fix an endless loop which can happen if the mmlsquota command has a syntax error. IJ06219
    * Fix a zLinux: Assert exp(!synchedStale) in line 2770 of file bufdesc. This can happen if compression is involved. IJ06116
    * Fix Assert exp(secSendCoalBuf != __null && secSendCoalBufLen > 0.  IJ06116
    * Add unavailable disk warning message at end of fsck output. IJ06116
    * Fix Assert exp(cleanupOnFailure == 0 || nNodes > 0) which can occur running fsck with not enough memory. IJ06116
    * Fix a problem in which Pcache: hit Oops while running workload stress. IJ06116
    * Fix Assert exp(dm != inv) which can occur after a failback to secondary and then trying to get back to primary. IJ06116
    * Fix Assert:fileAllocPool[kind].fapFree == 0 qualloc.C 591 which can occur after suspending a disk. IJ06116
    * Fix a deadlock that can occur during a file close when the file system is quiesce. IJ06116
    * Fix a problem in which the file system can not be unmounted that can occur during a large amount of file deletes and node asserting. IJ06116
    * Fix a Oops: Kernel access of bad area, sig: 11 that can occur on P9 running an exported GPFS filesystem through NFS. IJ06116
    * Fix a deadlock that can occur on afmHashVersion 4 enabled clusters during fileset deletions and creations. IJ06116
    * Address a problem where a race between Create and remove on the same fileset (with a sandwiched write message), can filter the create and play just the remove .. Meanwhile the Write in between tries to write the file and finds that the file has not been created at home and causes the queue to be dropped. IJ06116
    * Fix a vmcore: oops bad area, crashed at func ganesha_grant_deferred. IJ06116
    * gpfs.snap: improve performance on large cluster. IJ06213
    * Fix "ganesha.nfsd-29872[work-239] lookup :FSAL :CRIT :DOTDOT error, inode: xxxxx". IJ06216
    * Fix "Assert exp(!synchedStale)" problem that can occur during access of compressed files. IJ06116
    * Fix Assert !addrDirty OR synchedStale OR allDirty bufdesc.C 7416. IJ06116

    * Fix a problem in which trace starts itself. This can occur when CCR is disabled, adminMode=allToAll, and mmsdrservPort=0. IJ06214
    * Fix logAssertFailed: !isMsgOptionFlagsCallback(optionFlagsArg). IJ06116
    * Fix a problem in which a hanging NFS process (Ganesha) was not clearly detected. IJ06116
    * Fix a problem in which you can not set the value 'desired' for smb option "smb encrypt" through Spectrum Scale CLI. IJ06116
    * Fix a problem which could cause I/O worloads to hang longer than expected if the cluster manager and fs manager fails at the same time. IJ06116
    * Address a problem where STOP command on the fileset (mmafmctl stop) can cause deadlock when there's a parallel Write in the queue taking the SplitWrite path. IJ06116
    * Following fix will change the mmhealth node show -Y output so that the GUI is able to process specific health events again, that weren't in the right format in a mixed cluster environment. This affects only cluster with a cluster minRelease level lower than 4.2.0 and nodes higher than or equal to 4.2.0. Affected events are: all pool_ and pool- events of the FILESYSTEM component. IJ06217
    * Fix "Signal 11 at location ... PaxosServer::handleCheck ... at paxosserver.C". IJ06116
    * Fix Error in `/usr/lpp/mmfs/bin/mmfsd': double free or corruption (!prev): signal 6. This can happen running fsck in too small of a page pool. IJ06116
    * Fix a problem that when QOSio is enabled for a filesystem, occasionally GPFS daemon deadlocks during unmount. IJ06116
    * Fixed issue with un-escaped dollar signs in config values in the Object protocol. When configuring Object protocol with values containing dollar signs, the command would fail with messages such as "[E] Keystone role add command failed". Checking the log file /var/log/keystone/keystone.log may have messages indicating that authentication was incorrect. IJ06116
    * Fix an issue in AFM environment where failover/resync runs slower for write operations due to connecting the file dentry to the parent. IJ06116
    * Fix deadlock:SGExceptionAdjustServeTMThread on(MsgRecordCondvar). IJ06116
    * Fix a hang condition on Linux when mmfsd is executed from a shell. IJ06116
    * This fix makes sure mmcommon deadlockBreakup is canceled if the long waiters disappear. IJ06116
    * Fix a deadlock, mmwaiters show SharedHashTabFetchHandlerThread: on ThCond 0x1800BDC7EF0 (LkObjCondvar), reason 'change_lock_shark waiting to set acquirePending flag'. This can occur with heavy IO with errors. IJ06116
    * Fix a problem in which callhome stopped sending data on Ubuntu after and upgrade. IJ06116
    * Fix GPFS assert "logAssertFailed: !isRead" which can happen doing a data prefetch. IJ06450
    * Fix an AFM:data mismatch in read if replication factor is more then 1. IJ06220
    * Fix an error in the mmsysmonitor log and "ks_url_warn" event when upgrading to 5.0.1. Issue will disappear after a mmsysmoncontrol restart. While error is 'active' the keystone server (object authentication) is not monitored correctly. IJ06116
    * Fix mmbackup handling of -g option. It should use config var sharedTmpDir to determine the relative or absolute paths. IJ06116
    * Fix an mmfsd shutdown. The log shows ReadMap: Cannot open map file /usr/lpp/mmfs/bin/mmfsd, not enough memory. This can occur during lots of fsync syscalls at the same time as stat calls. IJ06242
    * Fix a rare case logAssert "Assert:(indIndex & 0xFF00000000000000ULL)==0 IndDesc.h" which can happen when write beyond EOF of a file which has lots EA entries. IJ06116
    * Fix a problem in which mmcrfs --profile returns an error when both defaultMetadataReplicas and maxMetadataReplicas are specified in the profile. IJ06222
    * Fix a problem in which a gpfs_set_share call is incorrectly failing. IJ06449
    * Fix a mmimgbackup assert problem when there is a symbolic link and the full pathname length is 1023 bytes. IJ06223
    * When you update the NSD type via mmchconfig updateNsdtype command, this fix will also update the NSD type of any tiebreakerdisks in CCR cluster. IJ06116
    * Fix a signal 11 which can occur when a node is being added to a cluster. IJ06843
    * Fix a problem in which statfs/df is reporting free disk space that includes fragmented disk space. It should only be reporting full disk blocks. IJ06116
    * Fix a problem where GPFS can potentially get stuck on dumping a kernel thread stack during file system panic. IJ06255
    * Fix a "Assert on Structure Error": Direct::invalidDirBlock which is caused by a race condition between log wrap and directory expansion. IJ06863

    * Fix a problem in which /usr/lpp/mmfs/bin/mmchattr --compact=73315124 fails with Compact directory failed: Invalid argument. IJ06116
    * Fix a deadlock that can occur when a new node is taking over the cluster manager role. IJ06116
    * Fix a problem in which a NSD server shutdown could cause a disk failure and file system unmounted on a NSD client. IJ06761
    * Fix a problem in which POSIX ACLs set on a GPFS file or directory from Unix may not be correctly translated to access permissions on a GPFS Windows node. IJ06451
    * Fix a problem in which mmafmctl prefetch failure does not list the failed files. IJ06478
    * Address an issue in Prefetch (migration) where filenames containing '\\' and '\\n' characters need to be handled better. IJ06116
    * This fix corrects an error in the mmchfirmware command. This fix applies to Lenovo GSS/DSS customers. IJ06116
    * Fix a problem in which mmadddisk failed on aix with return value of 5. IJ06116
    * Improve the cluster monitoring in case of a GPFS cluster manager move. IJ06116
    * This fix corrects an error in the mmlsenclosure command. This fix applies to GSS/ESS customers that have DCS3700 storage enclosures. IJ06116
    * Fix a bug in "mmkeyserv server update" that may cause encryption policy fail. IJ06116
    * Fix a recovery failure and a resign from recovery. IJ06974
    * This fix increases the maximum supported number of extra IP addresses to 64. IJ06762
    * Fix a problem in which mmchpolicy -I test incorrectly declares policy installed. IJ06116
    * Fix a problem where mmchdisk incorrectly requires disks in 'system.log' pool to have disk usage type 'dataOnly'. IJ06116
    * This fix disables writing protocol tracing debug messages to mmfs.log, since they were irrelevant to the user and inconsistently formatted. IJ06116
    * Fix Assert: getQueue()->getRenameDepTable() == __null that can occur after renaming active AFM filesets and directories. IJ06116
    * Fix the issue that the extra IP addresses cannot be propagated to other nodes. IJ06116
    * This change improves the performance of 'mmcloudgateway files ..." commands like migrate or recall. It does so by removing a rpm db check for installed TCT software. IJ06116
    * The manpage for mmchconfig command ('subnets' section) has been updated to describe limitation in the number of subnets a given node may be part of. IJ06116
    * Workaround a GNR VCD (vdisk configuration data) inconsistency issue that two vtrack tracks may map to the same physical location in very rare cases when recovering free ptracks which causes RG recovery to fail with error like "[E] Vdisk xxx recoverFreePTracks failure: Error 214 code 2063". With this fix, the RG can be recovered with minimal data lost vs. losing the whole RG. IJ06856
    * Fixed an issue in AFM environment where if root user have supplementary GID greater than value 42949676, replication might fail and messages are requeued. IJ06972
    * This update addresses the following APARs: IJ06116 IJ06213 IJ06214 IJ06216 IJ06217 IJ06219 IJ06220 IJ06222 IJ06223 IJ06242 IJ06255 IJ06449 IJ06450 IJ06451 IJ06478 IJ06479 IJ06761 IJ06762 IJ06843 IJ06856 IJ06863 IJ06972 IJ06974.

     

  • gpfs@us.ibm.com
    gpfs@us.ibm.com
    662 Posts

    Re: IBM Spectrum Scale V5.0 announcements

    ‏2018-06-29T14:28:07Z  

    Flashes (Alerts):  IBM Spectrum Scale Active File Management (AFM) and AFM Asynchronous Disaster Recovery (ADR) issues which may result in undetected data corruption

    Abstract
    IBM has identified certain issues affecting Active File Management (AFM) and AFM Asynchronous Disaster Recovery (ADR) in IBM Spectrum Scale which may result in undetected data corruption.
    1. AFM may intermittently read files from the home cluster incorrectly if the replication factor is more than one at the cache cluster, which may result in undetected data corruption.
    2. AFM cache may incorrectly read an HSM migrated file from the home cluster due to the incorrect calculation of the file sparseness information, potentially resulting in undetected data corruption.
    3. AFM mmafmctl Device resync/failover and AFM ADR mmafmctl Device changeSecondary commands may miss copying data to the home or secondary cluster (from the other cluster) when the in-memory queue is dropped with pending in-place writes.
    4. AFM Asynchronous Disaster Recovery (ADR) could cause some files to be missing from the RPO snapshot at the secondary if recovery was run from the recovery+RPO snapshot.
    5. AFM may not replicate the data when the dm_write_invis() API is used to write. In addition, the dm_read_invis() API may read incorrect data if the file is not already cached.

    See complete bulletin at:  http://www-01.ibm.com/support/docview.wss?uid=ibm10713675

  • gpfs@us.ibm.com
    gpfs@us.ibm.com
    662 Posts

    Re: IBM Spectrum Scale V5.0 announcements

    ‏2018-06-29T14:28:43Z  

    Flashes (Alerts): On Spectrum Scale, when any of the quorum nodes are under high load, the cluster manager may unexpectedly lose its membership from the cluster resulting in unexpected cluster manager elections

    Abstract:
    On Spectrum Scale, when any of the quorum nodes are under high load, the cluster manager may unexpectedly lose its membership from the cluster resulting in unexpected cluster manager elections.

    See complete bulletin at:  http://www-01.ibm.com/support/docview.wss?uid=ibm10713707

  • gpfs@us.ibm.com
    gpfs@us.ibm.com
    662 Posts

    Re: IBM Spectrum Scale V5.0 announcements

    ‏2018-07-13T20:24:16Z  

    Flashes (Alerts):  IBM Spectrum Scale V4.2 and 5.0.0 levels, Linux only: combined usage of compression and LROC may result in undetected data corruption

    Abstract:

    In the processes of either decompressing or truncating compressed files, some data blocks may be de-allocated. A problem has been identified in which the data may be recalled from LROC devices if the data for these de-allocated blocks was stored into LROC devices before de-allocation. As a result of the data being recalled from the LROC devices, data in memory may become corrupted, with potential for the data on disks to also become corrupted. Note that many types of file modifications (e.g., write, punch hole) of the data of compressed files could trigger an on-the-fly transparent uncompression operation, including GPFS's command line or policy interfaces (e.g., mmrestripefs -z, mmchattr --compression no ).

     

    See complete bulletin at: http://www-01.ibm.com/support/docview.wss?uid=ibm10713659

  • gpfs@us.ibm.com
    gpfs@us.ibm.com
    662 Posts

    Re: IBM Spectrum Scale V5.0 announcements

    ‏2018-08-15T19:18:53Z  

    GPFS 5.0.1.2 is now available from IBM Fix Central:

    http://www-933.ibm.com/support/fixcentral

    Problems fixed in GPFS 5.0.1.2

    August 15, 2018

    * Fix Assert exp(!"Assert on Structure Error") in line 365 of file /project/sprelttn423/build/rttn423s005a/src/avs/fs/mmfs/ts/logger/Logger.C which can occur during a restore of many file systems. IJ07355
    * Fix an issue in the AFM environment where gateway node crashes if remote is not responding. IJ07355
    * Fix a deadlock FileBlockWriteFetchHandlerThread: on ThCond 0x1800598FEB8 (IndBlockAccessCondvar), reason 'Waiting for access to indirect block'. This can occur when replication is set to 2 for both data and metadata and both servers for one building block is down. IJ07355
    * Fix a problem in which mmrestoreconfig wrongly failed because mmcheckquota failed and returned E_NODEV. IJ07355
    * Fix Assert exp(thisSnap.isSnapOkay() || thisSnap.isSnapEmptying() || thisSnap.getSnapId() == sgP->getEaUpgradeSnapId()) in line 3677 of file /project/spreltac/build/rtac1807e/src/avs/fs/mmfs/ts/fs/metadata.C which can occur when closing a file in a snapshot that is being deleted. IJ07410
    * Fix GNR Assert exp(vtBufValid.getBit(index) == 1) vtrackBuf.C:2625 which can occur during I/O in dio mode. IJ07355
    * Fix a problem in which no failover takes place when Ethernet cable is unplugged. IJ07355
    * With this change mmhealth node show will include unique identifiers in the reason column of NATIVE_RAID events again. Before, information like the enclosure was missing. IJ07355
    * Fix a problem in which mmshutdown might cause a Kernel assert in gpfsCleanup(). IJ07355
    * Fix a problem in which the file system can not mount. This can occur if asserts occur on multiple nodes that have the file system mounted. IJ07355
    * Fix Assert exp(Remote ASSERT from node <c2n1>: SGNotQuiesced snap 9/0 ino 2851912 reason 1 code 0) in line 3447 of file /project/spreltac501/build/rtac5011814e/src/avs/fs/mmfs/ts/cfgmgr/sgmrpc.C which can occur on an AFM file set with huge amounts of small writes during a snapshot. IJ07355
    * Fix a problem in which the output of mmces address move commands are inconsistent. IJ07355
    * Fix a hang in mmcesnetworkmonitor.  IJ07355
    * Fix LOGASSERT(getPhase() == snapCmdDone) which can happen if more than one request to delete the same snapshot is run concurrently and the fs SGPanic during the delsnapshot process. IJ07355
    * Fix firmware monitoring to disregard missing disks, it caused erroneous events like this: drive_firmware_notavail(DRV-5-12, DRV-1-3). IJ07355
    * Fix a problem in which sysmonitor breaks after an upgrade of gpfs.  IJ07355
    * Fix a problem in which the mmhealth show command on mixed clusters that have Windows nodes will show the windows nodes as failed. IJ07355
    * Fix the AIX kernel crash which can happen when there is I/O against inconsistent compressed files. IJ07409
    * Fix a deadlock which may happen when threads of one process uses mmap while doing multiple reads on the same file. IJ07782
    * Fix local disk monitoring on AIX, issue causes erroneous "local_fs_unknown" event. IJ07443

    * Fix a problem in which recovery was triggered on a stopped AFM fileset. IJ07355
    * Fix viInUse assert in gpfsOpen for NFS file access.  IJ07411
    * Fix AFM: recovery stuck and dropped fileset with error 2. This can occur when the directory name has special characters. IJ07412
    * Fix a bug that could cause the file size to be incorrectly updated to a smaller than expected value. This could happen if a node failure occurs when a hole is being punched at the end of the file. IJ07413
    * Fix an issue in the AFM environment where leading spaces in file names causes recovery to fail. IJ07444
    * Change mmcrnodeclass to use CLUSTER_PERF_SENSOR_CANDIDATES to manually configure the list of node candidates for the single node sensors. IJ07355
    * Fix EBADHANDLE (521) errno which can occur doing NFS file operations on RHEL 7.5. IJ07355
    * This fix is for customer that use a mixed cluster with a minimum release level lower than 4.2.2-0. It will fix the machine-readable output of the mmhealth node show command and the false or inconsistent information in the GUI. IJ07355
    * Fix logAssertFailed: "useCount >= 0" in file alloc.h which can occur running mmrestripefile -c repeatedly. IJ07414
    * Fix mmapplypolicy -L 3 showing garbage characters.  IJ07784
    * Fix a rare AIX kernel crash LOGASSERT(size != 0) in cxiAttachSharedMemory. This can happen if you run a gpfs utility during startup before the kernel is fully loaded. IJ07783
    * Fix an issue in AFM environment where control file setup used for transferring EAs/ACLs might hang if remote is not responding. This causes node to run out of RPC handler threads handling the incoming messages. IJ07751
    * Fix an assert in openlog.C. This can occur as a result of a mmdeldisk failure. IJ08031
    * Fix mmfsd terminating because of network failure or hardware failure. Assert exp(nConns== nReplyConns) in line 1727 of file verbsClient.C. IJ08018
    * Fix Waiters: VdiskFlusherThread.  IJ07355
    * Fix assert bufP->pageP != NULL gpl-linux/mmap.c, 194.  IJ08204
    * RDMA reconnect can fail when RDMA port is used by multiple verbsPorts definition using different fabnum. IJ08144
    * This update addresses the following APARs: IJ07355 IJ07409 IJ07410 IJ07411 IJ07412 IJ07413 IJ07414 IJ07443 IJ07444 IJ07751 IJ07782 IJ07783 IJ07484 IJ07784 IJ08018 IJ08031 IJ08144 IJ08204.
     

  • gpfs@us.ibm.com
    gpfs@us.ibm.com
    662 Posts

    Re: IBM Spectrum Scale V5.0 announcements

    ‏2018-08-22T19:11:20Z  

    Flashes (Alerts):  IBM Spectrum Scale (GPFS): mismatched replicas with possible undetected data corruption following restripe operations (restripefs)

    Problem Summary:

    In a file system where data or metadata replication is used and "rapid repair" is enabled, and when there are "update in place" activities after disk(s) go down, followed by use of the restripefs command with options -r/-R/-m, mismatched replicas may be created after some disks are started up.  Some replicas with stale data could result in metadata corruption in the file system, or data loss.

     

    See complete bulletin at:  http://www.ibm.com/support/docview.wss?uid=ibm10718849

     

  • gpfs@us.ibm.com
    gpfs@us.ibm.com
    662 Posts

    Re: IBM Spectrum Scale V5.0 announcements

    ‏2018-10-18T20:25:08Z  

    Technote (Troubleshooting):  IBM Spectrum Scale: remotely mounted filesystem panic on accessing cluster after upgrading the owning cluster first

    Problem:
    When running a remotely mounted cluster environment (multi-cluster with remotely mounted filesytems) and the owning cluster and accessing cluster are at 5.0.0.x or 5.0.1.x code level with File Audit Logging enabled, and the owning cluster is upgraded first to 5.0.2.x, and "mmchconfig --release=LATEST" is run, then the remotely mounted file systems on the accessing clusters will panic and not be able to mount.


    See complete bulletin at:  https://www-01.ibm.com/support/docview.wss?uid=ibm10734629

  • gpfs@us.ibm.com
    gpfs@us.ibm.com
    662 Posts

    Re: IBM Spectrum Scale V5.0 announcements

    ‏2018-10-22T15:33:01Z  

    Technote (Troubleshooting):  IBM Spectrum Scale: Vormetric DSM V6.0.2, V6.0.3 and V6.1.x releases are not supported with IBM Spectrum Scale Encryption

    Problem:
    Vormetric DSM V6.0.2, V6.0.3 and V6.1.x user interface do not support creation of KMIP objects such as the Master Encryption Keys (MEKs) used by Spectrum Scale encryption, and as a result, Spectrum Scale encryption cannot use these DSM releases.

     

    See the complete bulletin at:  https://www-01.ibm.com/support/docview.wss?uid=ibm10734479

  • gpfs@us.ibm.com
    gpfs@us.ibm.com
    662 Posts

    Re: IBM Spectrum Scale V5.0 announcements

    ‏2018-10-29T17:18:26Z  

     5.0.2.1 is now available from IBM Fix Central:

    http://www-933.ibm.com/support/fixcentral

    Problems fixed in GPFS 5.0.2.1

    The APAR can be found at the following URL, just plug in your APAR # here
                                                               |
                                                               |
    https://w3-03.ibm.com/systems/techlink/rr/getApar?aparNo=IJ06045&ibm-submit.x=0&ibm-submit.y=0

    October 26, 2018

    * The process of mmrestripefs with -b option to rebalance file system hang infinitely.
    * Work around: Resume these suspended disks who were just suspended while rebalancing is in progress.
    * Problem trigger: Start the file system rebalance with mmrestripefs -b command, then suspended some disks.
    * Symptom: File system rebalance operation hang finitely.
    * Platforms affected: ALL Operating System environments.
    * Functional Area affected: 5.0 or later Scale Users with traditional file system configuration.
    * Customer Impact: High.  IJ09758
    * IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    * If creating a new RDMA connection fails while connecting, the connection is still in INIT state. In this case, the code should check for the INIT state. The current version did not handle this correctly resulting in the assertion message "[X] logAssertFailed: localConnP->useCnt == 1" and stopping the mmfsd process.
    * Work Around: None.
    * Problem trigger: The problem gets triggered, if the underlying operating system level RDMA code returns an error code while creating a new RDMA connection. In this case the GPFS connection setup code gets this error message in the middle of setting up a connection. Since this can happen at any time, for example due to networking issues, GPFS needs to be able to handle this situation.
    * Symptom: Abend/Crash.
    * Platforms affected: ALL Operating System environments.
    * Functional Area affected: RDMA.
    * Customer Impact: High Importance.  IJ09770
    * IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    * The policy engine aborts the process due to encounter a valid directory as to be processed target but it is not applicable to compress.
    * Work around: Refine the compression policy rule to exclude the directories from the to-be-processed targets.
    * Problem trigger: Files are being compressed through compression policy rules and some directories are selected as valid targets based on policy rule.
    * Symptom: The compression policy rule process is interrupted.
    * Platforms affected: ALL Operating System environments.
    * Functional area affected: compression/policy.
    * Customer Impact: Suggested.  IJ10412
    * IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    * "LOGSHUTDOWN :1 sgmMsgCopyBlock RPCs are still pending" appears in mmfs.log before the GPFS daemon is shut down.
    * Work Around: None.
    * Problem trigger : mmchdisk, mmrestripefs or mmdeldisk running under low free buffer conditions.
    * Symptom: Abend/Crash.
    * Platforms affected: all.
    * Functional Area affected: All Scale Users.
    * Customer Impact: High Importance.  IJ09786
    * IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    * File system cleanup can't finish as file system panic. This prevent file system remount and could also prevent node from joint cluster after quorum loss.
    * Work Around: Restart GPFS daemon on the node.
    * Problem trigger: File system panic occurs during certain phase of mmrestripefs or mmchpolicy command.
    * Symptom: Cluster/File System Outage. Node expel/Lost Membership.
    * Platforms affected: All.
    * Functional Area affected: Admin Commands.
    * Customer Impact: High Importance.  IJ09787
    * IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    * Assert going off: !addrdirty or synchedstale or alldirty.
    * Work Around: None.
    * Problem trigger: Certain customer workload can run into the problem in a specific code path when the part of the allocated disk space beyond the end of the file is not zeroed out. It's rare and timing related.
    * Symptom: Abend/Crash.
    * Platforms affected: all.
    * Functional Area affected: All Scale Users.
    * Customer Impact: High Importance  IJ09549.
    * IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    * Assert or segmentation fault.
    * Work Around: None.
    * Problem trigger: Manager nodes going down while some of the manager nodes are low in memory in a cluster hosting multiple file systems.
    * Symptom: Abend/Crash.
    * Platforms affected: all.
    * Functional Area affected: All Scale Users.
    * Customer Impact: High Importance.  IJ09792
    * IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    * Daemon crashes when user run mmrestripefs with -b option to rebalance the file system.
    * Work around: Restart the file system rebalance operation and do not add new disks into file system while rebalancing is in progress. Or run the mmrestripefs command with "-b --strict" options to do old stile file system rebalance.
    * Problem trigger: Adding new disks into file system while file system rebalance is in progress.
    * Symptom: Daemon crashed.
    * Platforms affected: ALL Operating System environments.
    * Functional Area affected: 5.0 and later Scale Users with traditional file system configuration.
    * Customer Impact: Critical.  IJ09589
    * IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    * Values like "01" or "02", etc. are accepted as arguments for the "mmces log level" command, but yield to a "No such file or directory" error message finally.
    * Work Around: provide the correct one-digit log level numbers.
    * Problem trigger: any number with leading zeros These values were checked for an integer range between 0 and 3, which was passed. 01, 001 etc. is valid as a numeric '1'. However those values were used in some code branches as strings, where it makes a difference if '1' is used or '01'. So the failure was triggered because of that.
    * Symptom: Error output/message.
    * Platforms affected:  Linux Only.
    * Functional Area affected: CES.
    * Customer Impact: has little or no impact on customer operation.  IJ09423
    * IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    * There would be some threads waiting for the exclusive use of the connection for a long time even though no thread is sending on the connection, for example: Waiting 7192.9293 sec since 07:53:38, monitored, thread 2155 Msg handler ccMsgPing: on ThCond 0x7FE0A80012D0 (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg'.
    * Work around: None.
    * Problem trigger: Lots of threads are waiting for sending on one connection, if reconnect happens at that time, it should cause this long waiter.
    * Symptom: Hang/Deadlock/Unresponsiveness/Long Waiters.
    * Platforms affected: All.
    * Functional Area affected: All.
    * Customer Impact: High Importance.  IJ09796
    * IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    * Fix a protential signal 11 problem that might occurred when running mmrestripefs -r.
    * Work Around: The problem was caused by invalid DAs so chaning the DA manually could fix the problem too.
    * Problem trigger: Users whose files contain invalid DAs and will run mmrestripefs -r are potentially affected.
    * Symptom: Unexpected Results/Behavior.
    * Platforms affected: All.
    * Functional Area affected: All.
    * Customer Impact: Suggested.  IJ09548
    * IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    * Changes is in the port configuration of object services may fail if they do not match to the expected default values.
    * Work Around: None.
    * Problem trigger: Currently the object service have default ports hardcoded declared in the CES code. proxy-server: 8080, account-server: 6202, container-server: 6201, object-server: 6200, object-server-sof: 6203, Whenever one of these settings changes in a newer object distribution, we run in to issues.
    * Symptom: Error output/message.
    * Platforms affected: ALL Linux OS environments  (CES nodes).
    * Functional Area affected: System Health.
    * Customer Impact: High Importance, if the port settings differ from the expected defaults. IJ09797
    * IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    * Race condition between relinquish and migrate threads causing long waiters.
    * Work Around: Restart mmfsd.
    * Problem trigger: If migrate is running and user issues a 'mmvdisk rg delete' then this problem can occur.
    * Symptom: Abend/Crash.
    * Platforms affected:  Linux Only.
    * Functional Area affected: GNR / Mestor.
    * Customer Impact: Suggested.  IJ09809
    * IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    * Daemon is hitting SIGBUS error and crashes when using overwrite mode trace on Power 9 system with kernel version 4.14.0-49.9.1 and later. This is a Linux kernel bug on Power 9 CPU system.
    * Work around: Disable the overwrite mode trace.
    * Problem trigger: Enable the overwrite mode tracing on Power 9 system with kernel versions 4.14.0-49.9.1 and later.
    * Symptom: Daemon crashes and file system outages.
    * Platforms affected: Power 9 Linux system with kernel version 4.14.0-49.9.1 and later.
    * Functional Area affected: Overwrite mode tracing on Power 9 Linux system.
    * Customer Impact: Medium.  IJ09810
    * IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    * The commands "mmlsperfdata" and "mmperfmon" showed an error message at the end of the regular command output, like Exception TypeError: "'NoneType' object is not callable" in <function _removeHandlerRef at 0x7f9b6aa78c80> ignored.
    * Work Around: None.
    * Problem trigger: The issue was reported on Ubuntu, was not seen on RHEL. It showed up whenever mmlsperfdata and mmperfmon was executed. The reason was a Python-internal list which was not cleared, and so the reported error text showed up only right before the program terminated finally.
    * Symptom: Error output/message.
    * Platforms affected: Ubuntu reported, but probably all Operating System environments.
    * Functional Area affected: System Health. Customer Impact: Suggested: has little or no impact on customer operation. IJ09814
    * IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    * File system snapshot create or delete commands don't return over long time, when DMAPI operations are busy, then causes the file system outage because file system was quiesced during the snapshot create or delete operation.
    * Work around: Restart Spectrum Scale on DMAPI session node, or wait for the completion of the in-progress DMAPI operations.
    *Problem trigger: After DMAPI is enabled and being busy with access operations, do snapshot create or delete operations.
    * Symptom: File system outages that no access to it is allowed.
    * Platforms affected: ALL Operating System environments.
    * Functional Area affected: All Scale Users.
    * Customer Impact: High.  IJ09558
    * IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    * When running with QOS enabled, GPFS daemon may fault with signal 11.
    * Work Around: Disable QOS, until fix can be applied.
    * Problem trigger: UMALLOC returns NULL.
    * Symptom: signal 11 fault in QosIdPoolHistory::setNslots.
    * Platforms affected: All.
    * Functional Area affected: QOS.
    * Customer Impact: Critical for customers using QOS, especially if there are many nodes or pid-stats have been enabled. IJ09571
    * IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    * Directory prefetch reports some directories are failed to prefetch even though they are cached.
    * Work around: Check if the listed dirs are not really cached.
    * Problem trigger:  Directory prefetch.
    * Symptom:  Unexpected Results/Behavior.
    * Platforms affected: Linux only.
    * Functional Area affected: AFM.
    * Customer Impact: Suggested.  IJ09815
    * IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    * When the AFM relationship is rendered Sick because the remote site is not responding for an IW cache fileset and later we try to enable replication again, wherein the relationship is moved out of Sick - there is a possibility of catching an assert designed to catch a certain state of the inode. Removing an out-of-place Assertion.
    * Work around: None.
    * Problem trigger: An unhealthy network between the home and cache, that can elongate operations in the AFM queue sometimes.
    * Symptom: Fileset moves to Unmounted state with message that home is taking long to respond. For IW filesets, such conditions may lead to the Assert when the network stabilizes between the 2 sites.
    * Platforms affected: Linux only.
    * Functional Area affected: AFM.
    * Customer Impact: Suggested.  IJ09827
    * IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    * Under rare circumstances it can happen, that mm-commands talking unexpected long (up to 2 minutes) caused by a slow CCR RPCs between CCR server and client.
    * Work around: None.
    * Problem trigger: CCR server expects a final RPC handshake the client does not provide.
    * Symptom: Performance Impact/Degradation.
    * Platforms affected: Just seen on a Linux OS environment (RHEL).
    * Functional Area affected: Admin Commands and CCR.
    * Customer Impact: High Importance.  IJ09552
    * IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    * The disk addresses of blocks for a compressed file are corrupted to be as non-compressed file's disk addresses when extending the small file or setting extended attributes to it, thus lead to the compressed file cannot be read or decompressed.
    * Work around: Run offline fsck to fix the corrupted disk address for compressed files.
    * Problem trigger: The mmfsd daemon or the system crashes when a small compressed file is being extended to large file or being set big EAs.
    * Symptom: The compressed files cannot be read or decompressed.
    * Platforms affected: ALL Operating System environments.
    * Functional area affected: File compression.
    * Customer Impact: Critical.  IJ10414
    * IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    * Erroneous display of the event " ib_rdma_port_width_low".
    * Work Around: On each affected node: edit /var/mmfs/mmsysmon/mmsysmonitor.conf file and "add ib_rdma_monitor_portstate = false" to the "[network]" section. restart monitoring with "mmsysmoncontrol restart".
    * Problem trigger: Running Spectrum Scale > 5.0.1 and a IB driver which cause ibportstate to report a LinkWidth of "undefined (19)".
    * Symptom: Unexpected Results/Behavior.
    * Platforms affected: ALL Linux OS environments.
    * Functional Area affected: System Health.
    * Customer Impact: Suggested: has little or no impact on customer operation. IJ09587
    * IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    * A four-node cluster with Object service configured and declared grouping for four CES-IPs and nodes showed a "ces_network_affine_ips_not_defined" event for nodes which hosted IPs having additional object attributes assigned. So an IP was hosted in fact, but that warning came up additionally.
    * Work Around: The "ces_network_affine_ips_not_defined" event could be declared as to be "ignored" in the "events" section of the mmsysmonitor.conf file. This skips any triggered warning for this event.
    * Problem trigger: Whenever a single IP address is assigned to a node, based on group membership roles and having an additional object-attribute. In case of multiple assigned IPs the issue does not show up, if there is at least one IP with the correct grouping, but without further object attribute.
    * Symptom: Error output/message.
    * Platforms affected: ALL Linux OS environments  (CES nodes).
    * Functional Area affected: System Health.
    * Customer Impact: Medium impact, since the IP addresses are indeed hosted, and only the event is misleading. IJ09560
    * IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    * The progress indicator of restripe for user files doesn't match the real processed data.
    * Work around: Set pitWokerThreadsPerNode to 1, but this will slow down the progress of restripe operation.
    * Problem trigger: There are many big files in file system and do restripe operation against it.
    * Symptom: Restripe progress could jump to 100% completion from a very small indicator(e.g 5%).
    * Platforms affected: ALL Operating System environments.
    * Functional Area affected: All Scale Users.
    * Customer Impact: Medium.  IJ09829
    * IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    * The default value of forceLogWriteOnFdatasync parameter from mm command does not match the value in the daemon. IJ09561
    * IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    * A filesystem, which was not mounted at the default mountpoint (as shown by mmlsfs), was reported as "stale mount" which leads to a "false-negative" health state report. The mount procedure was declared in a "mmfsup" script, which is executed at GPFS startup.
    * Work Around: mount the filesystem on the declared default mountpoint or change the declared default mountpoint.
    * Problem trigger: Whenever the declared mountpoint differs from the real mountpoint for a GPFS filesystem.
    * Symptom: Error output/message.
    * Platforms affected: ALL Operating System environments.
    * Functional Area affected: System Health.
    * Customer Impact: High Importance.  IJ09562
    * IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    * With NSD protocol, incorrect remote cluster mount is panicked when the fileset is not responding. AFM kills the stuck requests on the remote mount by panicking the remote filesystem. If there are multiple remote filesystems, it is possible that remote filesystem panicked may not be correct one for the fileset.
    * Work around: None.
    * Problem trigger: Usage of multiple remote filesystems and the network issues between cache and home.
    * Symptom:  Unexpected Results/Behavior.
    * Platforms affected: Linux only.
    * Functional Area affected: AFM.
    * Customer Impact: High Importance.  IJ09557
    * IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    * User may see a lot of threads are blocked at 'wait for GNR buffers from steal thread' on GNR server side. This is possible to happen when running very heavy small writes in parallel to eat up the GNR buffers very quickly.
    * Work Around: None.
    * Problem trigger: When running very heavy small writes workloads.
    * Symptom: Hang/Deadlock/Unresponsiveness/Long Waiters.
    * Platforms affected: ALL Operating System environments.
    * Functional Area affected: ESS/GNR.
    * Customer Impact: High Importance.  IJ09870
    * IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    * User may see a lot of threads are blocked at 'wait for GNR buffers from steal thread' on GNR server side with very small GNR buffer setting. This is possible to happen when running very heavy small writes in parallel to eat up the GNR buffers very quickly.
    * Work Around: None.
    * Problem trigger: When running very heavy small writes workload with small GNR buffer setting.
    * Symptom: Hang/Deadlock/Unresponsiveness/Long Waiters.
    * Platforms affected: ALL Operating System environments.
    * Functional Area affected: ESS/GNR.
    * Customer Impact: High Importance.  IJ09752
    * IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    * This problem can happen when the Object protocol is installed using an external Keystone database, and administrator role in that database is not specified as the all lowercase string admin (e.g. Admin). Although Keystone supports case-insensitive role values, the Object protocol configuration command only checks for the lowercase value. When this condition exists, the Object protocol installation will fail with a message similar to "Swift user does not have admin role in service project".
    * Work around: If this fix is not applied, a work around would be to change the name of the administrator role to be the all lowercase value "admin" so that the Object protocol configuration scripts will match the value correctly.
    * Problem trigger: Object installed using an external Keystone database, where the administrator role "admin" is in all uppercase or mixedcase.
    * Symptom:  Upgrade/Install failure.
    * Platforms affected: ALL Linux OS environments.
    * Functional Area affected: Object.
    * Customer Impact: Suggested.  IJ09735
    * IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    * There is a deadlock where mutex is not released by ping thread which is in while loop and the same time another thread is waiting to acquired this mutex to set the state.
    * Work Around:   None.
    * Problem trigger: when homelist is being unregistered and meanwhile other handler trying to register the same homelist.
    * Symptom: Deadlock.
    * Platforms affected:  Linux Only.
    * Functional Area affected: AFM.
    * Customer Impact: Suggested.  IJ09753
    * IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    * Lookup and readdir performance issues with AFM ADR after converting the regular independent fileset to the AFM ADR fileset as the asynchronous lookups are sent to gateway node in the application path.
    * Work around: None.
    * Problem trigger: AFM ADR inband conversion.
    * Symptom: Performance Impact/Degradation.
    * Platforms affected: Linux Only.
    * Functional Area affected: AFM.
    * Customer Impact: High Importance.  IJ09756
    * IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    * The loghome vdisk partition distribution are not even among the disks of the vdisk. This causes the performance degration during IO.
    * Work around: None.
    * Problem trigger: If any disk goes down/fail then uneven partition distribution occurs which causes IO performance degradation.
    * Symptom: Performance Impact/Degradation.
    * Platforms affected: All supported Operation systems.
    * Functional Area affected: ESS/GNR.
    * Customer Impact: High Importance.  IJ10416
    * IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    * If pdisk corruption occurs, for example if a bad SAS HBA card or bad CPU chip causes silent data corruption on writes to pdisks, then after the problem hardware has been repaired, the system can continue to report misleading "I/O error", "err 110" messages, and may continually resign and recover service of the recovery group, causing recovery from the corruption to take an unexpectedly long time.
    * Work around: None.
    * Problem trigger: The problem is triggered by checksum errors detected on pdisks. This can be triggered by faulty hardware that writes incorrect data to disk without reporting any errors back or it may be caused by a malicious program writing over the disk drives.
    * Symptom: Performance Impact/Degradation.
    * Platforms affected: ALL Operating System environments.
    * Functional Area affected: ESS/GNR.
    * Customer Impact: High Importance.  IJ10418
    * IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    * Not all the error messages are getting printed from mmlsfirmware. In particular a warning should be issued if not all the targeted nodes could be reached. Also put out an informational message if the targeted node doesn't have any components that apply to mmlsfirmare. Such as issuing the command on a client node.
    * Work around: None.
    * Platforms affected: ESS/GSS configurations.
    * Functional Area affected: ESS.
    * Customer Impact: Suggested.  IJ10039
    * IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    * No audit records are logged for SMB CLI commands.
    * Work Around: None.
    * Problem trigger: Applies to all SMB CLI commands.
    * Symptom: No audit records seen when checking with lscommonevent command or in GUI Command Audit Log.
    * Platforms affected: All.
    * Functional Area affected: SMB GUI.
    * Customer Impact: Suggested - has little or no impact on customer operation. IJ10419
    * IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    * Under very very low maxFilesToCache(100) and low maxStatCache (4k) settings, certain race windows were exposed which resulted in kernel panic/ daemon crashes on nodes with LROC-devices.
    * Work around: none
    * Problem trigger: On nodes with LROC-devices, under extremely low stat-cache settings, certain race windows are exposed which causes daemon/kernel crashes
    * Symptom:  Abend/Crash
    * Platforms affected: x86_64-linux only, those that support LROC
    * Functional Area affected: LROC
    * Customer Impact: Critical (could cause data corruption) IJ10308
    * IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    * When running a remote cluster environment and the owning cluster and accessing cluster are at 5.0.0.x or 5.0.1.x code level with File Audit Logging enabled, and the owning cluster is upgraded first to 5.0.2.x and mmchconfig --release=LATEST is run, then the remotely mounted filesystems on the accessing clusters will panic and not be able to mount.
    * Work around: If this happens, users should either upgrade the accessing cluster to the 5.0.2.x code level or disable File Audit Logging on the owning cluster until user is able to upgrade the accessing cluster to the 5.0.2.x code stream.
    * Problem trigger: This issue affects customers with file audit logging enabled on one or more filesystems on an owning cluster at 5.0.0.x or 5.0.1.x code level with the same file system remotely mounted on and accessing cluster at 5.0.0.x or 5.0.1.x code level where the owning cluster is upgraded to code level.
    * Symptom: Unexpected Results/Behavior
    * Platforms affected: x86_64-linux and ppc64le-linux
    * Functional Area affected: File audit logging
    * Customer Impact: Critical  IJ10318
    * IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    * The reported health state of a component (e.g. FILESYSTEM) which has multiple entities (individual filesystems) is not reported correctly if some of them are HEALTHY, and an other is in TIPS state. The expectation is that the overall state for the component is TIPS in this case.
    * Work Around:      None
    * Problem trigger : Have multiple filesystems in HEALTHY state and one or more filesystems in TIPS state. The TIPS state could be reached because the mountpoint of the filesystem is different from its declared mountpoint (check with mmlsfs).
    * Symptom: Error output/message
    * Platforms affected:  ALL Operating System environments
    * Functional Area affected: System Health
    * Customer Impact: has little or no impact on customer operation  IJ10373
    * IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    * After a "mmchnode --ces-disable" of a CES node using the SMB protocol, there are still SMB/CTDB specific files on the system. This may yield to unexpected side effects if those nodes are moved to a different cluster.
    * Work Around:    Manuell cleanup of "tdb" files in /var/lib/samba
    * Problem trigger : Run "mmchnode --ces-disable" on a CES node which has the SMB protocol installed. The expectation is that all protocol specific configuration files are removed, but that is not the case. There are remaining "tdb" files which were not deleted.
    * Symptom: Unexpected Results/Behavior
    * Platforms affected:  ALL Linux OS environments  (CES nodes)
    * Functional Area affected: CES
    * Customer Impact: High Importance: an issue which might cause a degradation of the system in some manner IJ10374
    * IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    * GUI will still show SMB Shares after deletion
    * Work Around: None
    * Problem trigger: A SMB Share is deleted through GUI.
    * Symptom: GUI will still show SMB Shares after deletion.
    * Platforms affected: All
    * Functional Area affected: SMB GUI
    * Customer Impact: Suggested - has little or no impact on customer operation. IJ10378
    * IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    * GUI activity log files are being partially removed by call home scheduled data collection.
    * Work around: mmcallhome schedule delete --task DAILY, mmcallhome schedule delete --task WEEKLY. Please read the schedules after the issue was fixed.
    * Problem trigger: running daily or weekly call home schedules
    * Symptom: Unexpected Results/Behavior
    * Platforms affected: All
    * Functional Area affected: GUI + Callhome
    * Customer Impact: Suggested (as long as there are no issues with GUI, truncating logs is irrelevant; otherwise this could be really bad) IJ10401
    * IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    * RPC messages may be got twice after reconnect, then hit some sanity check, such as the below assert: logAssertFailed: err == E_OK, at dirop.C 6389
    * Work around: None
    * Problem trigger: Network is not good which leads to reconnect happening
    * Symptom: Abend/Crash
    * Platforms affected: ALL Operating System environments
    * Functional Area affected: All
    * Customer Impact: High Importance  IJ10471
    * IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    * Node is expelled from the cluster because message is reported lost after reconnect, like the message below: Message ID 2449 was lost by node IP_ADDR NODE_NAME wasLost 1
    * Work around: None
    * Problem trigger: Messages are pending for more than 30 seconds waiting for replies and network is not good which leads to reconnect happening
    * Symptom: Node expel/Lost Membership
    * Platforms affected: ALL Operating System environments
    * Functional Area affected: All
    * Customer Impact: High Importance IJ10473
    * IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    * A few GPFS commands like mmaddnode, mmdelnode or change quorum semantic (mmchnode) may cause the status of systemd mmsdrserv.service to report as failed.
    * Work Around: Reset or ignore the failed mmsdrserv.service status.
    * Problem trigger: mmaddnode, mmdelnode, mmchnode --quorum/--noquorum while GPFS is running
    * Symptom: Error output/message
    * Platforms affected: Linux systems with systemd version 219 or later.
    * Functional Area affected: Admin Commands - systemd
    * Customer Impact: Suggested IJ09554
    * IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    * Hadoop component missing from mmhealth when full qualified host names used by Hadoop but short host names are used by GPFS.
    * Work Around: Change the Hadoop configuration to use same host names as the GPFS cluster.
    * Problem trigger: when Hadoop host names differ from gpfs host names
    * Symptom: Missing monitoring
    * Platforms affected: Linux Only
    * Functional Area affected: System Health
    * Customer Impact: Suggested  IJ10116
    * IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    * AFM with verbs RDMA does not work due to the way AFM changes the thread credentials during the replication.
    * Work Around: None
    * Problem trigger:  Always happens when RDMA+AFM is enabled with NSD backend.
    * Symptom: Unexpected Results/Behavior
    * Platforms affected:  Linux Only
    * Functional Area affected: AFM
    * Customer Impact: High Importance  IJ10398
    * IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    * AFM revalidation is slower some times in the caching modes.
    * Work Around: None
    * Problem trigger: AFM caching modes are used and readdir is performed on them after the refresh intervals expiration.
    * Symptom: Performance Impact/Degradation
    * Platforms affected:  Linux Only
    * Functional Area affected: AFM
    * Customer Impact: High Importance  IJ10400
    * This update addresses the following APARs: IJ09423 IJ09548 IJ09549 IJ09552 IJ09554 IJ09557 IJ09558 IJ09560 IJ09561 IJ09562 IJ09571 IJ09587 IJ09589 IJ09735 IJ09752 IJ09753 IJ09756 IJ09758 IJ09770 IJ09786 IJ09787 IJ09792 IJ09795 IJ09796 IJ09797 IJ09809 IJ09810 IJ09814 IJ09815 IJ09827 IJ09829 IJ09870 IJ10039 IJ10116 IJ10308 IJ10318 IJ10373 IJ10374 IJ10378 IJ10398 IJ10400 IJ10401 IJ10412 IJ10414 IJ10416 IJ10418 IJ10419 IJ10473 IJ10565.

  • gpfs@us.ibm.com
    gpfs@us.ibm.com
    662 Posts

    Re: IBM Spectrum Scale V5.0 announcements

    ‏2018-12-12T15:09:30Z  

    Technote (Troubleshooting):  IBM Spectrum Scale: Fix for deadlock may require file system format level to be updated

    Problem:

    Failure to mount a file system in Spectrum Scale 5.0.0 PTFs if the file system is created in (or upgraded to) 4.2.3 PTF9 (or later) or 4.1.1 PTF20 (or later).


    Failure to mount a file system in 4.2.3 PTF8 (or earlier) if the file system is created in (or upgraded to) 4.1.1 PTF20 (or later).

     

    See the complete bulletin at:  http://www.ibm.com/support/docview.wss?uid=ibm10719585

  • gpfs@us.ibm.com
    gpfs@us.ibm.com
    662 Posts

    Re: IBM Spectrum Scale V5.0 announcements

    ‏2019-01-04T15:45:32Z  

    Flash (Alerts):  IBM has identified a problem in IBM Spectrum Scale (GPFS) V4.1.0 thru V5.0.2 levels where the use of Local Read Only Cache (LROC) may result in directory corruption or undetected data corruption in regular files

    Summary:
    After cached data is moved from memory to the LROC device, any changes to that data should trigger invalidation of  the data stored in LROC.    Due to a problem with the invalidation logic, it is possible for invalidation of this LROC data to be skipped.   This may lead to stale or incorrect data being recalled from LROC and data in memory becoming corrupted, with potential undetected data corruption on disk.

     

    See the complete bulletin at:  http://www.ibm.com/support/docview.wss?uid=ibm10741439

     

  • gpfs@us.ibm.com
    gpfs@us.ibm.com
    662 Posts

    Re: IBM Spectrum Scale V5.0 announcements

    ‏2019-01-09T19:02:56Z  
    GPFS 5.0.2.2 is now available from IBM Fix Central:
    Problems fixed in IBM Spectrum Scale 5.0.2.2
    The APAR can be found at the following URL, just plug in your APAR # here
                                                               |
                                                               |
    <a href="https://www-01.ibm.com/support/entdocview.wss?uid=swg1IJ09718"> IJ09718</a></li>
    December 13, 2018
    * Problem description: mmlsquota command fails in AIX for users that belongs to more than 128 groups.
    * Work Around: None
    * Problem trigger: Starting in AIX 7.1, the maximum number of group that a user can be part of is increased to 2048, previously it was 128. The mmlsquota command code needs to be updated to handle users that are members of more than 128 groups.
    * Symptom: On AIX, when a user member of more than 128 groups run the mmlsquota command, the command fails with E_INVAL.
    * Platforms affected: All
    * Functional Area affected: Quotas
    * Customer Impact: Suggested  IJ11044
    * IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    * Problem description: A single thread self-deadlock problem happened when fine-grained QOS statistics is enabled.
    * Work Around: Disable the fine-grained QoS statistics and then restart GPFS node on the problem node.
    * Problem trigger: Fine-grained QoS statistics is being used and QoSed program is running.
    * Symptom: Stuck I/O.
    * Platforms affected: All
    * Functional Area affected: QoS
    * Customer Impact: Critical  IJ11043
    * IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    * Problem description: Daemon assert going off: !(sanGetHyperAllocBit() && hasAFragment() && ofP->isRootFS()), resulting in a daemon abend.
    * Work Around: In gpfs 5.0 and above, disable the assert. mmchconfig diableAssert='metadata.C;mnode.C;sanSetFileSizes.C'. In old gpfs, disable the dynassert: mmchconfig dynassert='sanergy 0'
    * Problem trigger: Users having applications which append to the same GPFS file from multiple nodes are potentially affected.
    * Symptom: Abend/Crash
    * Platforms affected: Windows only
    * Functional Area affected: All
    * Customer Impact: High Importance  IJ11092
    * IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    * Problem description: The CLI command audit log is flooded with entries of "mmaudit all list -Y".
    * Work around: none
    * Problem trigger: Both GUI and FAL configured
    * Symptom:  Error output/message
    * Platforms affected: ALL Linux OS environments
    * Functional Area affected: File audit logging
    * Customer Impact: Suggested  IJ11209
    * IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    * Problem description: EAGAIN and EWOULDBLOCK error from fflush() sysem call would cause unexpected file system unmount.
    * Work around: Retry operation after remount the file system. User may need to reboot the node in order to remount the file system.
    * Problem trigger: Unknown.
    * Symptom: Cluster/File System Outage
    * Platforms affected: ALL
    * Functional Area affected: All
    * Customer Impact: High Importance  IJ11090
    * IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    * Problem description: If the user forgot to enable the AFM at home (mmafmconfig), AFM will not migrate EAs and ACLs even though if it is enabled later.
    * Work around: None
    * Problem trigger: AFM is used to migrate the data from old GPFS systems without enabling the AFM at home.
    * Symptom: Unexpected Results/Behavior
    * Platforms affected: Linux Only
    * Functional Area affected: AFM
    * Customer Impact: Suggested  IJ11232
    * IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    * Problem description: Past defects has shown that the ip - command in combination with the pipe symbols on some OS does not work. We do not understand the root cause in details.
    * Work around: None
    * Problem trigger: ces ip moves are related with execution of the ip command
    * Symptom: Unexpected Results/Behavior, IO error
    * Platforms affected: ALL Linux OS environments
    * Functional Area affected: CES
    * Customer Impact: High Importance IJ11246
    * IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    * Problem description: Online replica compare function (mmrestripefs/mmrestripefile with -c option) could report false replica mismatch on last data block of a file. This is more like to happen on files in a snapshot.
    * Work around: Use offline fsck with -c option to perform replica compare.
    * Problem trigger: Run online replica compare function (mmrestripefs/mmrestripefile -c option) on GPFS 4.1.0.0 - 4.2.3.7
    * Symptom: Error output/message
    * Platforms affected: All
    * Functional Area affected: Admin Commands
    * Customer Impact: High Importance  IJ11045
    * IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    * Problem description: If transparent cloud tiering service was stopped, the health monitoring for CLOUDGATEWAY continued and logged tct_csap_removed / tct_csap_found events in mmhealth node eventlog.
    * Work around: None
    * Problem trigger: mmcloudgateway service stop lead to mmsysmon internal "unknown" status.
    * Symptom: "tct_csap_removed" and "tct_csap_found" events in mmhealth node eventlog when mmcloudgateway service was stopped.
    * Platforms affected: Linux only (CES nodes)
    * Functional Area affected: CES
    * Customer Impact: Low IJ11257
    * IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    * Problem description: Spectrum Scale mmfsd daemon hit segmentation fault when running mmdiag command to dump the threads traceback.
    * Work around: Reduce the frequency of mmdiag command or just run it on demand.
    * Problem trigger: The mmfsd daemon crashs when running mmdiag command to dump threads traceback.
    * Symptom: Daemon crash
    * Platforms affected: AIX Operating System environments
    * Functional area affected: mmdiag command
    * Customer Impact: Critical  IJ11046
    * IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    * Problem description: The file system version cannot be upgraded after upgrading all nodes' GPFS installation versions. The failure message incorrectly indicates some nodes are running with older versions.
    * Work around: Manually run some mm commands(e.g. mmmount or mmlsfs and etc) from the problem node to update the cached daemon version on file system manager node.
    * Problem trigger: Upgrading GPFS installations on some nodes and then start up GPFS service from all nodes. Later upgrading GPFS installations on the remaining nodes and restart GPFS service on them. At last upgrading the cluster configuration version and then upgrading the file system version but could fail due to the file system manager node is still caching the old daemon versions for those later upgraded nodes.
    * Symptom: mmchfs -V command failure.
    * Platforms affected: All
    * Functional area affected: mmchfs -V command
    * Customer Impact: Suggested  IJ11101
    * IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    * Problem description: CLI command mmces events list -Y does not work.Without option -Y it works.
    * Work around: use human readable version (without option -Y)
    * Problem trigger: always
    * Symptom: Error output/message
    * Platforms affected: All
    * Functional area affected: System Health
    * Customer Impact: Suggested IJ11282
    * IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    * Problem description: mmhealth monitoring for keystone service for object protocol authentication can cause a system to run out of memory.
    * Work around: Check the memory consumption of mmsysmon monitoring service and restart the service if required.
    * Problem trigger: The monitoring of an external keystone service caused increasing mmsysmon service memory consumption.
    * Symptom: "Out of memory: Kill process xxx (python)" messages in /var/log/messages
    * Platforms affected: Linux only (CES nodes)
    * Functional Area affected: CES
    * Customer Impact: High Importance IJ11284
    * IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    * Problem description: If mmhealth command is executed with -N option to do a remote call and one or more remote nodes are not available, this is not reported.
    * Work around: None
    * Problem trigger: A node is not reachable.
    * Symptom: Error output/message
    * Platforms affected: All
    * Functional Area affected: System Health, CES, SMB
    * Customer Impact: Suggested  IJ11299
    * IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    * Problem description: mmhealth monitoring for an external keystone server for object protocol authentication was not working. If mmces state show auth_obj still reported a HEALTHY state.
    * Work around: None
    * Problem trigger: The external keystone service was down but this was not recognized by mmhealth.
    * Symptom: HEALTHY auth_obj state in mmhealth output
    * Platforms affected: Linux only (CES nodes)
    * Functional Area affected: CES
    * Customer Impact: Medium  IJ11330
    * IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    * Problem description: A message like "exception on write file /gpfs/fs0/ces/connections .... [Errno 2] No such file or directory" showed up in the mmfs.log file, which indicates a file creation problem in the sharedroot folder. The created connection file might contain invalid data, so that a NFS failover might not inform the affected clients about the IP address move.
    * Work around: None
    * Problem trigger: A CES-IP was removed from node A and moved to node B (failover). 'ip addr' showed, that this IP was indeed not hosted on node A any more, but now hosted on node B. That works as expected. However, the "ss -nt state established" command (and also netstat) reported that IP still on node A. The reason is not clear. The "rpcbind" had a process running, which used that IP ( that is unexpected). Development has not seen such a situation before. It could be OS dependent. Since the IP was indeed hosted on node B, both nodes tried to create a temp file (connection information) for the same IP directly in the sharedroot folder. So node A finished the writing of that temp file and renamed it to its final name. When node B came to that point, the temp file was not there any more (because of the rename by node A), and the reported error " [Errno 2] No such file or directory" was logged in mmfs.log.
    * Symptom: Error output/message
    * Platforms affected: Linux Only (CES nodes)
    * Functional Area affected: CES
    * Customer Impact: Medium Importance  IJ11334
    * IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    * Problem description: GPFS daemon died after the application using invalid data buffer to append data into a file in GPFS.
    * Work around: Check application to avoid invalid data buffer
    * Problem trigger: When application appending data to a file with an invalid data buffer such as not big enough or totally invalid, in some case the kernel will failed to transfer data from user space buffer into the GPFS page pool, as a result a buggy buffer desc is left and leading to the assert in later data flush.
    * Symptom: GPFS daemon crash.
    * Platforms affected: All
    * Functional Area affected: All
    * Customer Impact: High Importance, the gpfs will crash, file system will unmounted. IJ11098
    * IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    * Problem description: For bandwidth checks, mmnetverify does not honor the cluster configuration value for tscCmdPortRange, which may result in an incorrect issue being reported if firewall settings do not allow TCP connections to ephemeral ports.
    * Work around: None
    * Problem trigger: This issue affects clusters with both conditions true: 1) The tscCmdPortRange configuration value is used to specify a port range outside the system's ephemeral port range. 2) Network firewall settings do not allow connections to the system's ephemeral port range.
    * Symptom: Unexpected Results/Behavior
    * Platforms affected: All
    * Function Area affected:  System Health
    * Customer Impact:  Suggested IJ11088
    * IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    * Problem description: Writing to the AFM filesets on the remote cluster mount might cause deadlock with afmHashVerion=5 where gateway node is explicitly assigned to the fileset
    * Work around: None
    * Problem trigger: Writing to the AFM fileset on the remote cluster mount with the afmHashVersion=5 and also with the explict gateway node assignment to the fileset.
    * Symptom: Hang/Deadlock/Unresponsiveness/Long Waiters
    * Platforms affected: Linux only
    * Functional Area affected: AFM
    * Customer Impact: Critical IJ11348
    * IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    * Problem description: mmlsfirmware missing drive information.
    * Work around: None
    * Platforms affected: ESS/GSS configurations.
    * Functional Area affected: ESS
    * Customer Impact:  Suggested  IJ11349
    * IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    * Problem description: This greatly reduces the time needed for fileset creation or inode expansion
    * Work around: None
    * Problem trigger: Create a file system with very large -n value (16K). Now create independent filesets for inode expand existing filesets. After a number of filesets the time for creating each single fileset increases up to several hours per fileset. This issue is affecting all releases.
    * Symptom: Performance Impact/Degradation
    * Platforms affected: All
    * Functional Area affected: All
    * Customer Impact: High IJ11542
    * IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    * Problem description: The Spectrum Scale lxtrace process will allocate memory from GPU NUMA domains when tracing is enabled on a node that has the numaMemoryInterleave configuration option enabled.
    * Work around: Disable numaMemoryInterleave on the node.
    * Problem trigger: Enable tracing on a node that with NUMA domains defined for GPU memory and numaMemoryInterleave is enabled on the node.
    * Symptom: Spectrum scale lxtrace process allocates memory on GPU NUMA domains
    * Platforms affected:  Linux Only
    * Functional Area affected: trace
    * Customer Impact: High Importance IJ11355
    * IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    * Problem description: Failure to allocate a buffer caused by a small pagepol failed to mount a FS. This occurs when mounting a FS while heavy workload is consuming the pagepool.
    * Work around: Increase the pagepool size.
    * Problem trigger: Mount a FS while heavy workload is running.
    * Symptom: Error output/message.
    * Platforms affected: ALL Operating System environments
    * Functional Area affected: Filesets.
    * Customer Impact: High Importance. IJ11468
    * IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    * Problem description: Inconsistent "mmhealth cluster show" and "mmhealth node show" output through the changed TCP port.
    * Work around: Use the default TCP port 1191
    * Problem trigger: Change the TCP port.
    * Symptom: different health state report of gpfs components on node layer and cluster layer by using mmhealth command
    * Platforms affected:  N/A
    * Functional Area affected: SystemHealth
    * Customer Impact: Medium Importance  IJ11346
    * IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    * Problem description: Assigned ces ips are removed because link was detected as down even if link is not down if usleep is not installed
    * Work around: Install usleep at the path defined in mmglobfuncs.Linux
    * Problem trigger: Program usleep is not available
    * Symptom: IO error    because node looses all ces ips
    * Platforms affected: ALL Linux OS environments
    * Functional Area affected: CES
    * Customer Impact: High Importance IJ11475
    * IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    * Problem description: Postgress DB for OBJECT Protocol does not start
    * Work around: Manual stop and start of object protocol
    * Problem trigger: Link detected as down
    * Symptom: IO error    for object protocol
    * Platforms affected: ALL Linux OS environments
    * Functional Area affected: CES
    * Customer Impact: High Importance IJ11486
    * IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    * Problem description: A node in an RDMA enabled compute cluster that has greater than 4k nodes and configured with no local file systems mounted both a file system served by a remote test cluster and a file system served by a remote production cluster and in doing so created greater than 32767 RDMA connections that resulted in a GPFS assert: raErrorP[i].bufferId.index == index.
    * Work around: The compute cluster should not mount the file system served by the remote test cluster and the quorum nodes in the compute cluster should not mount the file system served by the remote production cluster
    * Problem trigger: Mounting multiple remote file systems from a large compute cluster.
    * Symptom: GPFS crashes with assert
    * Platforms affected: Linux Only
    * Functional Area affected: RDMA
    * Customer Impact: High Importance  IJ11344
    * IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    * Problem description: mmhealth cannot be used with the -N option on a node running AIX
    * Work around: No work around possible on AIX
    * Problem trigger: Program time out not available on AIX
    * Symptom: Unexpected Results/Behavior
    * Platforms affected: AIX/Power only
    * Functional Area affected: CES
    * Customer Impact: Suggested  IJ11492
    * IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    * Problem description: Spectrum Scale has a kernel I/O hang detector which can be configured to panic the operating system if an I/O request doesn't complete within several minutes (see the panicOnIOHang parameter). But there is a setting in Linux, panic_on_oops, that if set false was causing the hang detector's panic to result in a system hang instead. This fix changes the panicOnIOHang feature to call Linux's kernel panic function directly so that it's not affected by the panic_on_oops setting.
    * Work around: Change the linux panic_on_oops setting to true.
    * Problem trigger: Hung I/O due to Linux kernel or SAS adapter firmware defects.
    * Symptom: Hang/Deadlock/Unresponsiveness/Long Waiters
    * Platforms affected: ALL Linux OS environments
    * Functional Area affected: All Scale Users
    * Customer Impact: Critical IJ11496
    * IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    * Problem description: Spectrum Scale mmfsd daemon crashes with "logAssertFailed: sgP->isPanicked()" when running mmlsfileset -i -d command.
    * Work around: Enable the trigger "MTWPlowThroughInode0Holes" based on the guidance from IBM support.
    * Problem trigger: While the process of mmlsfileset -i -d command is traversing the indirect block tree of inode 0 files, some inode blocks expansion and copy operations discard the whole indirect block tree of inode 0 files, then cause the problem as above described.
    * Symptom: Daemon crash
    * Platforms affected: ALL Operating System environments
    * Functional Area affected: mmlsfileset command with -i -d options.
    * Customer Impact: High Importance  IJ11356
    * IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    * Problem description: Command "mmlsfirmware --type drive" is failing when used for the first time when deploying DSS-G systems. It completes when further used on the same system. The error messages are as follows: dss23.cluster: mmcomp: Propagating the cluster configuration data to all, dss23.cluster: affected nodes. This is an asynchronous process. mmlsfirmware: Command failed. Examine previous error messages to determine cause.
    * Work around: A work-around would be to run "mmlscompspec" immediately after deploying the system.
    * Platforms affected: ESS/GSS configurations.
    * Functional Area affected: ESS
    * Customer Impact: Suggested  IJ11350
    * IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    * Problem description: In case of slow or unstable network connection between a node and the clustermanager, or if the cluster manager is overloaded, the transmission of a health events can time out. Such a timeout will cause the error message "sdrServ: Communication error on socket 481 (10.14.32.5) @handleRpcReq/AuthenticateIncoming, [err 146] Internal server error message." to show up in the mmfslog of the cluster manager. The log message is harmless because the node will retry to send the health event to the cluster manager.
    * Work around: None
    * Problem trigger: Slow or unstable network connectivity or overloaded cluster manager.
    * Symptom: Error message "sdrServ: Communication error on socket 481 (10.14.32.5) @handleRpcReq/AuthenticateIncoming, [err 146] Internal server error message."
    * Platforms affected: Linux and AIX
    * Functional Area affected: Health Monitoring
    * Customer Impact: Low Importance IJ11548
    * IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    * Problem description: Postgress DB for OBJECT Protocol does not start
    * Work around: Manual stop and start of object protocol
    * Problem trigger: Link detected as down
    * Symptom: IO error    for object protocol
    * Platforms affected: ALL Linux OS environments
    * Functional Area affected: CES
    * Customer Impact: High Importance IJ11505
    * IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    * Problem description: Kernel crash at: logAssertFailed: bdevP != NULL && bdevP->bd_disk != NULL file /usr/lpp/mmfs/src/gpl-linux/cxiIOBuffer.c line 2617 when device is removed. This is because of a new kernel behavior in 4.11.0 which returns a uninitialized block_device when device is removed while references are still held on the old data structure. The previous gpfs kernel code assumes the old kernel behavior and asserts on the condition, but the aforementioned kernel change it will cause assert.
    * Work around: No
    * Problem trigger: This condtion could happen when device is being removed from the linux kernel.
    * Symptom: Kernel crash at: logAssertFailed: bdevP != NULL && bdevP->bd_disk != NULL file /usr/lpp/mmfs/src/gpl-linux/cxiIOBuffer.c line 2617
    * Platforms affected: N/A
    * Functional Area affected: GNR
    * Customer Impact: Medium  IJ11552
    * IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    * Problem description: When there is some failure during vdisk create, the admin is possible to see assert like 'Assert exp(vdiskP->getMDUpdate() == vchLazy) line 1936 vdisk.C' in the clean up phase.
    * Work around: None
    * Problem trigger: Run mmvdisk to create new vdisk but fail in the middle.
    * Symptom: Abend/Crash
    * Platforms affected: ALL Operating System environments
    * Functional Area affected: Mestor/GNR
    * Customer Impact: High Importance IJ11573
    * IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    * Problem description: There is a race between RG master resign/relinquish and asynchronous recovery group descriptor update, which may cause an unnecessary assert which looks like 'exp(rgDescUpdater.isDescLocked() and ...):rgDesc.C:1972'.
    * Work around: None
    * Problem trigger: It's possible to happen during RG master resign/relinquish.
    * Symptom: Abend/Crash
    * Platforms affected: ALL Operating System environments
    * Functional Area affected: Mestor/GNR
    * Customer Impact: High Importance IJ11574
    * IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    * Problem description: When running mmaddpdisk, it's possible that it doesn't release the lock for admin commands properly. This will cause all subsequent admin commands to wait for this lock forever.
    * Work around: None
    * Problem trigger: Run mmaddpdisk with a stanza file where there isn't any DA stanza.
    * Symptom: Hang/Deadlock/Unresponsiveness/Long Waiters
    * Platforms affected: ALL Operating System environments
    * Functional Area affected: Mestor/GNR
    * Customer Impact: High Importance IJ11579
    * IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    * Problem description: If the worker node fails during vdisk create, it may fail to activate vdisk on the worker. At this time, the vdisk create procedure fails and back outs all the resources. This is an unnecessary behavior, given the vdisk has already been committed. When the worker fails over, the vdisk will be recovered on the new worker.
    * Work around: Retry vdisk create.
    * Problem trigger: Run mmcrvdisk to create vdisk and fail the vdisk worker node at the same time.
    * Symptom: Error output/message
    * Platforms affected: ALL Operating System environments
    * Functional Area affected: Mestor/GNR
    * Customer Impact: Suggested IJ11580
    * IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    * Problem description: When an vdisk create command to create 4+2/3p vdisk is issued during recovery group master recovery, there is some possibility to hit an assert like 'exp(rgFeatureBits != 0) in line 701 ../vdisk/RG.h'.
    * Work around: None
    * Problem trigger: Run mmcrvdisk to create 4+2/3p vdisk during recovery group master recovery.
    * Symptom: Abend/Crash
    * Platforms affected: ALL Operating System environments
    * Functional Area affected: Mestor/GNR
    * Customer Impact: High Importance  IJ11582
    * IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    * Problem description: GPFS daemon could die after user application do heavy IO to file system with mixed valid and invalid data buffer.
    * Work around: Check application to avoid invalid data buffer
    * Problem trigger: When application appending data to a file with an invalid data buffer such as not big enough or totally invalid, in some case the kernel will failed to transfer data from user space buffer into the GPFS page pool, as a result a buggy buffer desc is left, before the flush buffer detected this and discard this buggy buffer, if the page pool is almost used up, another IO activity to the file system will steal this buffer, the steal will hit this problem.
    * Symptom: GPFS daemon crash.
    * Platforms affected: All
    * Functional Area affected: All
    * Customer Impact: Importance, the gpfs will crash, file system will be unmounted. IJ11606
    * IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    * Problem description: mmcheckquota verbose output is not printed when client nodes failed during mmcheckquota run.
    * Work around: None
    * Problem trigger: Node failures while online mmcheckquota command is running.
    * Symptom: mmcheckquota command returns with error and doesn't report the calculated quota discrepancy (verbose output).
    * Platforms affected: ALL Operating System environments except windows.
    * Functional Area affected: File system core - quotas
    * Customer Impact: Suggested: has little or no impact on customer operation IJ11337
    * IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
    * update addresses the following APARs: IJ11043 IJ11044 IJ11045 IJ11046 IJ11088 IJ11090 IJ11092 IJ11098 IJ11101 IJ11209 IJ11232 IJ11246 IJ11257 IJ11282 IJ11284 IJ11299 IJ11330 IJ11334 IJ11337 IJ11344 IJ11346 IJ11348 IJ11349 IJ11350 IJ11355 IJ11356 IJ11468 IJ11475 IJ11486 IJ11492 I
  • gpfs@us.ibm.com
    gpfs@us.ibm.com
    662 Posts

    Re: IBM Spectrum Scale V5.0 announcements

    ‏2019-01-25T16:26:21Z  

    Flash (Alert):  IBM Spectrum Scale (GPFS) V4.1.1.0 through 5.0.1.1: a read from or write to a DMAPI-migrated file may result in undetected data corruption or a recall failure

    Abstract
    IBM has identified a problem in IBM Spectrum Scale V4.1.1.0 through 5.0.1.1, in which under some conditions reading a DMAPI-migrated file may return zeroes instead of the actual data. Further, a DMAPI-migrate operation or writing to a DMAPI-migrated file may cause the size of the stub file to be updated incorrectly, which may cause a mismatch between the file size recorded in the stub file and in the migrated object. This may result in failure of a manual or transparent recall, when triggered by a subsequent read from or write to the file.

     

    See the complete bulletin at:  http://www.ibm.com/support/docview.wss?uid=ibm10741243

  • gpfs@us.ibm.com
    gpfs@us.ibm.com
    662 Posts

    Re: IBM Spectrum Scale V5.0 announcements

    ‏2019-01-31T18:26:59Z  

    Technote  (Troubleshooting):  IBM Spectrum Scale: The GUI may display an error message instead of actual data in performance charts if a key for a queried metric disappeared (e.g. a node was renamed or removed from the cluster)

    Problem
    The Spectrum Scale GUI may display an error message instead of actual data in performance charts if a key for a queried metric disappeared (e.g., a node was renamed or removed from the cluster).


    See the complete bulletin at: http://www.ibm.com/support/docview.wss?uid=ibm10744521