OA60929: TS7700 R5.2 CUIR STAGE 1B - UNHEALTHY CLUSTER VARY

A fix is available

APAR status

Closed as new function.

Error description

Local fix

Problem summary

****************************************************************
* USERS AFFECTED:                                              *
* This APAR is part of the full function support for Release   *
* 5.2.1 of the TS7700 Virtualization Engine D/T3957 and        *
* provides enhanced CUIR support for an unhealthy (fenced)     *
* cluster.                                                     *
****************************************************************
* PROBLEM DESCRIPTION:                                         *
* PROBLEM DESCRIPTION: Enhanced CUIR Support for Release 5.2.1 *
* of the TS7700 Virtualization Engine.                         *
****************************************************************
* RECOMMENDATION:                                              *
* For full support (z/OS V2R3 and above) OA60929 (Device       *
* Services) will bring in the  support from MVS Allocation     *
* OA61050.                                                     *
****************************************************************

Problem conclusion

Temporary fix

Comments

TS7700 Release 5.2.1
--------------------
This APAR was developed as part of the support for Release 5.2.1

of the TS7700 Virtualization Engine and enhances the CUIR
support
that was initially delivered (see APAR OA52376). The initial
support
enabled devices to be automatically varied offline and online
when
a TS7700 cluster was placed in service.  This new support
enables
devices in an unhealthy (fenced) cluster to be automatically
varied offline and online.
.
None of the other functions being delivered in Release 5.2.1
of the TS7700 require host support.
.
Resources
---------
For a discussion of the other TS7700 Release 5.2.1 enhancements,

refer to: https://www.ibm.com/docs/en/ts7700-virtual-tape.
For a detailed discussion of the CUIR support, refer to:
https://www.ibm.com/support/pages/node/6355675.
.
Control Unit Initiated Recovery (CUIR) - Unhealthy Cluster
----------------------------------------------------------
With this host support, if all clusters in the grid are at the
5.2.1 release level, an automatic vary capability now
exists for the TS7700 to notify the host that a distributed
library (cluster) is having issues. This will enable each of
the supporting host systems to automatically vary the devices
offline and back online (for CUIR reasons). By default both of
the vary notifications (offline and online) are disabled.
The LIBRARY REQUEST command can be used to enable each of the
automatic notifications:
.
- LIBRARY REQUEST,complib,CUIR,SETTING,FENCE,{ENABLE|DISABLE}
- LIBRARY REQUEST,complib,CUIR,AONLINE,FENCE,{ENABLE|DISABLE}
.
In addition to the FENCE keyword above, SERVICE or ALL can
also be specified. SERVICE enables the initial CUIR vary
notification support and ALL enables both SERVICE and UNHEALTHY
cluster varies.
.
If device verification is needed before bringing the devices
online, the online notification (AONLINE) can be left disabled
and through the TS7700 management interface (MI), online
notification can be manually triggered.
.
With this support (as with the initial CUIR support) when a
tape device is varied offline for service or for an unhealthy
(fenced) cluster, it is varied offline for CUIR reasons. If
a device is offline for CUIR reasons, the reverse notification
is
needed to clear the CUIR state.  If the device is subsequently
offline for other reasons (path, operator, or library) it may
remain in the offline state due to the other reasons.  The
existing
LIBRARY DISPDRV command (CBR1220I) can be used to determine the
reason that a device is offline, including the CUIR reason. If
the
CUIR state does not clear, the existing VARY xxxx,ONLINE,RESET
command can be used.
.
When devices are varied offline for CUIR reasons, they will go
pending offline and then depending on the state of the device,
and the state of the cluster, the DDR SWAP command can be used
to move long running jobs to another device. For devices that
are boxed, when the host receives notification to bring the
devices back online, the host will attempt to bring the boxed
devices back online.
.
The following LIBRARY REQUEST commands were added with
the initial CUIR support:
- LIBRARY REQUEST,libname,LDRIVE  (composite or distributed)
- LIBRARY REQUEST,distlib,LDRIVE,GROUP,index
The LIBRARY REQUEST LDRIVE commands can be used to determine
the state of a CUIR notification request.  Refer to the
following
TS7700 white paper for the LDRIVE command options and the
output reported to the host:
https://www.ibm.com/support/pages/node/6355675
.
The following operator command was added with the initial CUIR
support:
- DEVSERV QTAPE,xxxx,QHA
The query host access (QHA) support for tape displays the
systems
that are online (grouped) to the specified tape device.  If
there
are systems whose devices are not going offline, this will show
the systems that are still online (grouped) to the specified
device.
Since for a period of time, only a subset of the systems may
support
the new unhealthy vary notification, the commands noted above
will
help determine if manual varies are needed from some of the
systems.
.
For IOS message updates related to the initial CUIR support
refer to IOS APAR (OA52379). Updates specific to this support
include:
.
IOS279I
(Message Text) - REQUEST REASON displayed may be
SERVICE or UNHEALTHY CLUSTER
(Explanation) - updated to explain the unhealthy cluster reason
The library has detected issues with a cluster (distributed
library)
in the grid (composite library) and has fenced the cluster.  The
library
has initiated a reconfiguration request from a device to quiesce
the
specified set of devices in the unhealthy cluster. The Control
Unit
Initiated Reconfiguration (C.U.I.R.) service has received
control to
perform the request. Quiescing devices means to make devices
unavailable
for use so that they cannot be varied online and used while the
cluster is
in the fenced state.
.
IOS280I
(Message Text) - REQUEST REASON displayed may be
SERVICE or UNHEALTHY CLUSTER
(Explanation) - updated to explain the unhealthy cluster reason
The library has initiated a reconfiguration request from a
device to
resume the specified set of devices when the library is no
longer in the
fenced state. The Control Unit Initiated Reconfiguration
(C.U.I.R.) service
has received control to perform the request. Resuming devices
means to make
devices available for use when the cluster (or distributed
library) is no
longer considered to be in the fenced (unhealthy) state. The
devices may have
been varied online by the system or may have been made available
to be varied
online.
.
IOS281I
No changes to the successful message.
.
IOS282I
(Message Text) - REQUEST REASON displayed may be
SERVICE or UNHEALTHY CLUSTER
(Explanation) - updated to cover both the service and the
unhealthy cluster
reason
The Control Unit Initiated Reconfiguration (C.U.I.R.) support
attempted to
quiesce the specified devices in order to satisfy the request
specified in
system message IOS279I (IOS1279I) but the devices could not be
quiesced. The request may
have failed because the current state of the device precludes it
from being
quiesced. This may be the case if the device is a JES3 managed
device (C.U.I.R.
is not supported). In some cases, it may be necessary to
manually vary
the device offline.
(Operator Response) - updated to cover both the service and the
unhealthy cluster
reason
You may need to manually vary the specified devices offline,
which may entail
an operator initiated DDR SWAP or the cancel of a job that has
an allocated device.
If the request was for a JES3 managed device, this failure is
expected since
JES3 managed devices are not supported by this function.
Otherwise contact your
IBM service representative if the failures persist.
.
Note: For the unhealthy cluster vary, since a healthy cluster
can also
report on the state of its peer, the IOS messages above (IOS2791
- IOS282I)
may be issued multiple times.  Also note that the CUIR support
continues
to only be supported when running natively on MVS, it is not
supported for
an MVS guest running under VM. In addition, a CUIR request for a

JES3 managed device is not supported, and will result in IOS282I
(IOS1282I)
being issued. The JES3 managed devices will be listed as devices
that could
not be brought offline for CUIR reasons. Lastly, if an IPL
occurs while in a
CUIR state, the CUIR state is not maintained across the IPL.
.
Additional Search Keywords:
MSGIOS279I MSGIOS280I MSGIOS281I MSGIOS282I

APAR Information

APAR number
OA60929
Reported component name
DEV SUPPORT TAP
Reported component ID
5695DF110
Reported release
230
Status
CLOSED UR1
PE
NoPE
HIPER
NoHIPER
Special Attention
NoSpecatt / Xsystem
Submitted date
2021-02-18
Closed date
2021-09-16
Last modified date
2022-03-18

APAR is sysrouted FROM one or more of the following:

OA60928
APAR is sysrouted TO one or more of the following:

OA61050 UJ06693 UJ06694 UJ06695

Modules/Macros

```
IECTDSRV IECTDSR2
```

*Publications Referenced*
SA38067630

Fix information

Fixed component name
DEV SUPPORT TAP
Fixed component ID
5695DF110

Applicable component levels

R230 PSY UJ06693
UP21/10/05 P F110
R240 PSY UJ06694
UP21/10/05 P F110
R250 PSY UJ06695
UP21/10/05 P F110

Fix is available

Select the PTF appropriate for your component level. You will be required to sign in. Distribution on physical media is not available in all countries.

[{"Business Unit":{"code":"BU054","label":"Systems w\/TPS"},"Product":{"code":"SG19M"},"Platform":[{"code":"PF054","label":"z\/OS"}],"Version":"230"}]

Document Information

Modified date:
19 March 2022

Tips

OA60929: TS7700 R5.2 CUIR STAGE 1B - UNHEALTHY CLUSTER VARY

A fix is available

Subscribe

APAR status

Closed as new function.

Error description

Local fix

Problem summary

Problem conclusion

Temporary fix

Comments

APAR Information

APAR number

Reported component name

Reported component ID

Reported release

Status

PE

HIPER

Special Attention

Submitted date

Closed date

Last modified date

APAR is sysrouted FROM one or more of the following:

APAR is sysrouted TO one or more of the following:

Modules/Macros

Fix information

Fixed component name

Fixed component ID

Applicable component levels

R230 PSY UJ06693

R240 PSY UJ06694

R250 PSY UJ06695

Fix is available

Select the PTF appropriate for your component level. You will be required to sign in. Distribution on physical media is not available in all countries.

Document Information

Share your feedback

Need support?