IBM Storage Scale RAID callbacks
IBM Storage Scale RAID includes callbacks for events that can occur during recovery group operations. These callbacks can be installed by the system administrator using the mmaddcallback command.
The callbacks are provided primarily as a method for system administrators to take notice when important IBM Storage Scale RAID events occur. For example, an IBM Storage Scale RAID administrator can use the pdReplacePdisk callback to send an e-mail to notify system operators that the replacement threshold for a declustered array was reached and that pdisks must be replaced. Similarly, the preRGTakeover callback can be used to inform system administrators of a possible server failover.
As notification methods, no real processing should occur in the callback scripts. IBM Storage Scale RAID callbacks should not be installed for synchronous execution; the default of asynchronous callback execution should be used in all cases. Synchronous or complicated processing within a callback might delay GPFS daemon execution pathways and cause unexpected and undesired results, including loss of file system availability.
- preRGTakeover
- The preRGTakeover callback
is invoked on a recovery group server prior to attempting to open
and serve recovery groups. The rgName parameter
may be passed into the callback as the keyword value _ALL_,
indicating that the recovery group server is about to open multiple
recovery groups; this is typically at server startup, and the parameter rgCount will
be equal to the number of recovery groups being processed. Additionally,
the callback will be invoked with the rgName of
each individual recovery group and an rgCount of
1 whenever the server checks to determine whether it should open and
serve recovery group rgName.
The following parameters are available to this callback: %myNode, %rgName, %rgErr, %rgCount, and %rgReason.
- postRGTakeover
- The postRGTakeover callback
is invoked on a recovery group server after it has checked, attempted,
or begun to serve a recovery group. If multiple recovery groups have
been taken over, the callback will be invoked with rgName keyword _ALL_ and
an rgCount equal to the total number of
involved recovery groups. The callback will also be triggered for
each individual recovery group.
The following parameters are available to this callback: %myNode, %rgName, %rgErr, %rgCount, and %rgReason.
- preRGRelinquish
- The preRGRelinquish callback
is invoked on a recovery group server prior to relinquishing service
of recovery groups. The rgName parameter
may be passed into the callback as the keyword value _ALL_,
indicating that the recovery group server is about to relinquish service
for all recovery groups it is serving; the rgCount parameter
will be equal to the number of recovery groups being relinquished.
Additionally, the callback will be invoked with the rgName of
each individual recovery group and an rgCount of
1 whenever the server relinquishes serving recovery group rgName.
The following parameters are available to this callback: %myNode, %rgName, %rgErr, %rgCount, and %rgReason.
- postRGRelinquish
- The postRGRelinquish callback
is invoked on a recovery group server after it has relinquished serving
recovery groups. If multiple recovery groups have been relinquished,
the callback will be invoked with rgName keyword _ALL_ and
an rgCount equal to the total number of
involved recovery groups. The callback will also be triggered for
each individual recovery group.
The following parameters are available to this callback: %myNode, %rgName, %rgErr, %rgCount, and %rgReason.
- rgOpenFailed
- The rgOpenFailed callback
will be invoked on a recovery group server when it fails to open a
recovery group that it is attempting to serve. This may be due to
loss of connectivity to some or all of the disks in the recovery group;
the rgReason string will indicate why the
recovery group could not be opened.
The following parameters are available to this callback: %myNode, %rgName, %rgErr, and %rgReason.
- rgPanic
- The rgPanic callback
will be invoked on a recovery group server when it is no longer able
to continue serving a recovery group. This may be due to loss of connectivity
to some or all of the disks in the recovery group; the rgReason string
will indicate why the recovery group can no longer be served.
The following parameters are available to this callback: %myNode, %rgName, %rgErr, and %rgReason.
- pdFailed
- The pdFailed callback
is generated whenever a pdisk in a recovery group is marked as dead, missing, failed,
or readonly.
The following parameters are available to this callback: %myNode, %rgName, %daName, %pdName, %pdLocation, %pdFru, %pdWwn, and %pdState.
- pdRecovered
- The pdRecovered callback
is generated whenever a missing pdisk is rediscovered.
The following parameters are available to this callback: %myNode, %rgName, %daName, %pdName, %pdLocation, %pdFru, and %pdWwn.
- pdReplacePdisk
- The pdReplacePdisk callback
is generated whenever a pdisk is marked for replacement according
to the replace threshold setting of the declustered array in which
it resides.
The following parameters are available to this callback: %myNode, %rgName, %daName, %pdName, %pdLocation, %pdFru, %pdWwn, %pdState, and %pdPriority.
- pdPathDown
- The pdPathDown callback
is generated whenever one of the block device paths to a pdisk disappears
or becomes inoperative. The occurrence of this event can indicate
connectivity problems with the JBOD array in which the pdisk resides.
The following parameters are available to this callback: %myNode, %rgName, %daName, %pdName, %pdPath, %pdLocation, %pdFru, and %pdWwn.
- daRebuildFailed
- The daRebuildFailed callback
is generated when the spare space in a declustered array has been
exhausted, and vdisk tracks involving damaged pdisks can no longer
be rebuilt. The occurrence of this event indicates that fault tolerance
in the declustered array has become degraded and that disk maintenance
should be performed immediately. The daRemainingRedundancy parameter
indicates how much fault tolerance remains in the declustered array.
The following parameters are available to this callback: %myNode, %rgName, %daName, and %daRemainingRedundancy.
- nsdCksumMismatch
- The nsdCksumMismatch callback
is generated whenever transmission of vdisk data by the NSD network
layer fails to verify the data checksum. This can indicate problems
in the network between the GPFS client
node and a recovery group server. The first error between a given
client and server generates the callback; subsequent callbacks are
generated for each ckReportingInterval occurrence.
The following parameters are available to this callback: %myNode, %ckRole, %ckOtherNode, %ckNSD, %ckReason, %ckStartSector, %ckDataLen, %ckErrorCountClient, %ckErrorCountNSD, and %ckReportingInterval.
- %ckDataLen
- The length of data involved in a checksum mismatch.
- %ckErrorCountClient
- The cumulative number of errors for the client side in a checksum mismatch.
- %ckErrorCountServer
- The cumulative number of errors for the server side in a checksum mismatch.
- %ckErrorCountNSD
- The cumulative number of errors for the NSD side in a checksum mismatch.
- %ckNSD
- The NSD involved.
- %ckOtherNode
- The IP address of the other node in an NSD checksum event.
- %ckReason
- The reason string indicating why a checksum mismatch callback was invoked.
- %ckReportingInterval
- The error-reporting interval in effect at the time of a checksum mismatch.
- %ckRole
- The role (client or server) of a GPFS node.
- %ckStartSector
- The starting sector of a checksum mismatch.
- %daName
- The name of the declustered array involved.
- %daRemainingRedundancy
- The remaining fault tolerance in a declustered array.
- %pdFru
- The FRU (field replaceable unit) number of the pdisk.
- %pdLocation
- The physical location code of a pdisk.
- %pdName
- The name of the pdisk involved.
- %pdPath
- The block device path of the pdisk.
- %pdPriority
- The replacement priority of the pdisk.
- %pdState
- The state of the pdisk involved.
- %pdWwn
- The worldwide name of the pdisk.
- %rgCount
- The number of recovery groups involved.
- %rgErr
- A code from a recovery group, where 0 indicates no error.
- %rgName
- The name of the recovery group involved.
- %rgReason
- The reason string indicating why a recovery group callback was invoked.
All IBM Storage Scale RAID callbacks are local, which means that the event triggering the callback occurs only on the involved node or nodes, in the case of nsdCksumMismatch, rather than on every node in the GPFS cluster. The nodes where IBM Storage Scale RAID callbacks should be installed are, by definition, the recovery group server nodes. An exception is the case of nsdCksumMismatch, where it makes sense to install the callback on GPFS client nodes as well as recovery group servers.
A sample callback script, /usr/lpp/mmfs/samples/vdisk/gnrcallback.sh
,
is available to demonstrate how callbacks can be used to log events
or email an administrator when IBM Storage
Scale RAID events occur.
Fri Feb 28 10:22:17 EST 2014: mmfsd: [W] event=pdFailed node=c45f01n01-ib0.gpfs.net
rgName=BB1RGL daName=DA1 pdName=e4d5s03 pdLocation=SV13306129-5-3 pdFru=46W6911
pdWwn=naa.5000C50055D4D437 pdState=dead/systemDrain
Fri Feb 28 10:22:39 EST 2014: mmfsd: [I] event=pdRecovered node=c45f01n01-ib0.gpfs.net
rgName=BB1RGL daName=DA1 pdName=e4d5s03 pdLocation=SV13306129-5-3 pdFru=46W6911
pdWwn=naa.5000C50055D4D437 pdState=UNDEFINED
Fri Feb 28 10:23:59 EST 2014: mmfsd: [E] event=rgPanic node=c45f01n01-ib0.gpfs.net
rgName=BB1RGL rgErr=756 rgReason=missing_pdisk_causes_unavailability
Fri Feb 28 10:24:00 EST 2014: mmfsd: [I] event=postRGRelinquish node=c45f01n01-ib0.gpfs.net
rgName=BB1RGL rgErr=0 rgReason=unable_to_continue_serving
Fri Feb 28 10:24:00 EST 2014: mmfsd: [I] event=postRGRelinquish node=c45f01n01-ib0.gpfs.net
rgName=_ALL_ rgErr=0 rgReason=unable_to_continue_serving
Fri Feb 28 10:35:06 EST 2014: mmfsd: [I] event=postRGTakeover node=c45f01n01-ib0.gpfs.net
rgName=BB1RGL rgErr=0 rgReason=retry_takeover
Fri Feb 28 10:35:06 EST 2014: mmfsd: [I] event=postRGTakeover node=c45f01n01-ib0.gpfs.net
rgName=_ALL_ rgErr=0 rgReason=none
An email notification would look something like this:
> mail
Heirloom Mail version 12.4 7/29/08. Type ? for help.
"/var/spool/mail/root": 7 messages 7 new
>N 1 root Fri Feb 28 10:22 18/817 "[W] pdFailed"
N 2 root Fri Feb 28 10:22 18/816 "[I] pdRecovered"
N 3 root Fri Feb 28 10:23 18/752 "[E] rgPanic"
N 4 root Fri Feb 28 10:24 18/759 "[I] postRGRelinquish"
N 5 root Fri Feb 28 10:24 18/758 "[I] postRGRelinquish"
N 6 root Fri Feb 28 10:35 18/743 "[I] postRGTakeover"
N 7 root Fri Feb 28 10:35 18/732 "[I} postRGTakeover"
From root@c45f01n01.localdomain Wed Mar 5 12:27:04 2014
Return-Path: <root@c45f01n01.localdomain>
X-Original-To: root
Delivered-To: root@c45f01n01.localdomain
Date: Wed, 05 Mar 2014 12:27:04 -0500
To: root@c45f01n01.localdomain
Subject: [W] pdFailed
User-Agent: Heirloom mailx 12.4 7/29/08
Content-Type: text/plain; charset=us-ascii
From: root@c45f01n01.localdomain (root)
Status: R
Wed Mar 5 12:27:04 EST 2014: mmfsd: [W] event=pdFailed node=c45f01n01-ib0.gpfs.net
rgName=BB1RGL daName=DA1 pdName=e4d5s03 pdLocation=SV13306129-5-3 pdFru=46W6911
pdWwn=naa.50 00C50055D4D437 pdState=dead/systemDrain
For more information about the mmaddcallback command, see the IBM Storage Scale: Command and Programming Reference.