Preventive Service Planning
This article details some known issues with the Network File System (NFS) implementations that affect the reliability of data stored on such file systems by the WebSphere Application Server transaction, activity and compenstation services and the Service Integration Bus filestore. When reliability is compromised file corruption is possible. In most cases this results in DATA LOSS. It also makes recommendations for configuring both NFS and WebSphere Application Server based on known configuration problems and customer experiences.
NFS is a convenient Network Attached Storage (NAS) technology.
NFS v4 is commonly used as a reliable storage for the following components of WebSphere Application Server:
- The recovery log service - the recovery log service is used by the transaction, activity and compensation services for data logging purposes. In the case of the transaction service, NFS is used to enable highly available transaction logs for peer recovery.
- The default messaging provider for JMS (Service Integration Bus) - to allow messaging engines configured to use a filestore to failover between servers
As with any file system, NFS must provide a behaviour and reliability that is consistent with the design of both the transaction service and filestore. Two principles that both of these server components rely on are:
- Lease based locking - both components may lock files and rely on the lock to exclude other processes attempting to lock the same file on either the same machine or some other machine. Should the process that owns the lock hang or end for whatever reason, the lock should be released, allowing other processes to acquire the lock. Once another process has acquired the lock, if the original process recovers from a hang, it should no longer be able to write to the file; it has lost the lock. A basic test for lease based locking is available here:
- File writes may be forced, which means that all writes to the file up to that point must be safely stored before the force operation completes. Safely stored means that in the event of a crash in any part of the system, when those files are next read by any process, the data written before the last successful force is accurate and complete. Testing this is difficult.
Required Linux NFS Patch to AVOID DATA LOSS
SuSe Linux Bug Number 828236.
Redhat Linux Bug Number 963785.
Vendor support should be contacted to acquire an NFS patch or kernel for these respective bug numbers.
These bugs address an issue in NFS where an NFS client (in the WebSphere Application Server case either the transaction service or a messaging engine) may lose a file lock, but without reacquiring the lock is able to continue writing the file. There is no notification that the lock is lost.
For the transaction service this means that when peer recovery is configured, a peer server might attempt to resolve the logs of a server perceived to be failed. If both servers write the logs at the same time due to this bug, the transaction logs will become corrupted and transactions may rollback prematurely.
For a messaging engine this means two instances of the same messaging engine could run in two different servers concurrently. This will result in filestore corruption and consequently data loss.
NFS on Linux implements advisory locking which means that files can be read and written without actually acquiring a lock.
Real world scenarios that have resulted in data loss because of this problem involve a network partition and a virtual image hang. Both conditions resulted in the loss of file lock but the process owning the lock is still able to write the file without the lock and is not notified in any way.
Other Recommended NFS Patches by Implementation
The following table lists known issues by implementation:
|Solaris 10 (SPARC)||Patch 147440-13 is recommended.|
|Solaris 10 (x86-64)||Patch 147441-13 is recommended.|
|SLES V10 Update 3||Suspected problem introduced by Update 3, resolved in kernel level 126.96.36.199-0.60.1.|
|AIX 5.3 TL6 to TL9||APAR IZ29559 is recommended.|
APAR IV35811 is recommended.
APAR IV46850 is recommended.
Required NFS Configuration
The following table lists mount options that are required. They are not exclusive but other options should not negate these.
|-t nfs4||Forces NFS v4 to prevent any possibility of falling back to NFS v3.|
|-o hard,intr||Soft mounts can lead to file corruption so hard mounts are required. intr allows a user to interrupt from the keyboard.|
mount -t nfs4 -o hard,intr server:/logs
The following table lists export options for Linux only. They are not exclusive but other options should not negate these.
|sync||Ensures the default behaviour which is reply AFTER changes committed to stable storage. This avoids the possibility that a failure can lead to loss of data because it has not been persisted.|
|no_wdelay||Do not attempt to delay in order to batch writes. This is a potential performance optimization, however, the transaction service and filestore wait for the write to complete and are the only writers, so no batching will ever be possible.|
Required Java Configuration for Recovery Log Service on Windows
On IBM Windows JVM versions 1.5 and 1.6, a JVM custom property must be set on all servers accessing recovery log files, except when memory mapping is disabled (see When to Disable Recovery Log Service Memory Mapping on Windows below). The name of the custom property is:
The default is false. This JVM custom property must be set to 'true'.
When to Disable Recovery Log Service Memory Mapping on z/OS
The transaction, activity and compensation services can use memory mapping to read and write the transaction, activity and compensation logs. This is the default in all current versions of WebSphere Application Server, except on Windows when peer recovery is enabled for the transaction service and except on zOS version 8.5 and above.
The reason that the default changed on zOS at version 8.5 is because memory mapped files cannot be expanded on zOS. This problem is documented in APAR PM58494.
On zOS, for all versions of WebSphere Application Server it is recommended that memory mapped files is disabled (note version 8.5 and above memory mapped files is disabled by default).
When to Disable Recovery Log Service Memory Mapping on Windows
WebSphere Application Server automatically disables memory mapping if the platform is Windows AND peer recovery is enabled. CIFS is most likely to used when peer recovery is enabled but can of course be used even if peer recovery is not enabled. If the transaction, activity or compensation logs reside on CIFS it is required to disable memory mapped files. An IOException results if memory mapping is not disabled.
Therefore, if the transaction, activity or compensation logs reside on CIFS, memory mapping must be disabled if peer recovery is enabled (which automatically disables memory mapping).
To disable memory mapped files set the JVM custom property:
to 'true' on any server that accesses recovery log files.
If peer recovery is enabled but the recovery logs do not reside on CIFS, memory mapping be explicitly be enabled by setting the property to 'false'.
How To Set JVM Custom Properties
The following Infocenter article describes how to set JVM custom properties for application servers:
15 June 2018