IBM Support

OA60929: TS7700 R5.2 CUIR STAGE 1B - UNHEALTHY CLUSTER VARY

A fix is available

Subscribe

You can track all active APARs for this component.

 

APAR status

  • Closed as new function.

Error description

Local fix

Problem summary

  • ****************************************************************
    * USERS AFFECTED:                                              *
    * This APAR is part of the full function support for Release   *
    * 5.2.1 of the TS7700 Virtualization Engine D/T3957 and        *
    * provides enhanced CUIR support for an unhealthy (fenced)     *
    * cluster.                                                     *
    ****************************************************************
    * PROBLEM DESCRIPTION:                                         *
    * PROBLEM DESCRIPTION: Enhanced CUIR Support for Release 5.2.1 *
    * of the TS7700 Virtualization Engine.                         *
    ****************************************************************
    * RECOMMENDATION:                                              *
    * For full support (z/OS V2R3 and above) OA60929 (Device       *
    * Services) will bring in the  support from MVS Allocation     *
    * OA61050.                                                     *
    ****************************************************************
    

Problem conclusion

Temporary fix

Comments

  • TS7700 Release 5.2.1
    --------------------
    This APAR was developed as part of the support for Release 5.2.1
    
    of the TS7700 Virtualization Engine and enhances the CUIR
    support
    that was initially delivered (see APAR OA52376). The initial
    support
    enabled devices to be automatically varied offline and online
    when
    a TS7700 cluster was placed in service.  This new support
    enables
    devices in an unhealthy (fenced) cluster to be automatically
    varied offline and online.
    .
    None of the other functions being delivered in Release 5.2.1
    of the TS7700 require host support.
    .
    Resources
    ---------
    For a discussion of the other TS7700 Release 5.2.1 enhancements,
    
    refer to: https://www.ibm.com/docs/en/ts7700-virtual-tape.
    For a detailed discussion of the CUIR support, refer to:
    https://www.ibm.com/support/pages/node/6355675.
    .
    Control Unit Initiated Recovery (CUIR) - Unhealthy Cluster
    ----------------------------------------------------------
    With this host support, if all clusters in the grid are at the
    5.2.1 release level, an automatic vary capability now
    exists for the TS7700 to notify the host that a distributed
    library (cluster) is having issues. This will enable each of
    the supporting host systems to automatically vary the devices
    offline and back online (for CUIR reasons). By default both of
    the vary notifications (offline and online) are disabled.
    The LIBRARY REQUEST command can be used to enable each of the
    automatic notifications:
    .
    - LIBRARY REQUEST,complib,CUIR,SETTING,FENCE,{ENABLE|DISABLE}
    - LIBRARY REQUEST,complib,CUIR,AONLINE,FENCE,{ENABLE|DISABLE}
    .
    In addition to the FENCE keyword above, SERVICE or ALL can
    also be specified. SERVICE enables the initial CUIR vary
    notification support and ALL enables both SERVICE and UNHEALTHY
    cluster varies.
    .
    If device verification is needed before bringing the devices
    online, the online notification (AONLINE) can be left disabled
    and through the TS7700 management interface (MI), online
    notification can be manually triggered.
    .
    With this support (as with the initial CUIR support) when a
    tape device is varied offline for service or for an unhealthy
    (fenced) cluster, it is varied offline for CUIR reasons. If
    a device is offline for CUIR reasons, the reverse notification
    is
    needed to clear the CUIR state.  If the device is subsequently
    offline for other reasons (path, operator, or library) it may
    remain in the offline state due to the other reasons.  The
    existing
    LIBRARY DISPDRV command (CBR1220I) can be used to determine the
    reason that a device is offline, including the CUIR reason. If
    the
    CUIR state does not clear, the existing VARY xxxx,ONLINE,RESET
    command can be used.
    .
    When devices are varied offline for CUIR reasons, they will go
    pending offline and then depending on the state of the device,
    and the state of the cluster, the DDR SWAP command can be used
    to move long running jobs to another device. For devices that
    are boxed, when the host receives notification to bring the
    devices back online, the host will attempt to bring the boxed
    devices back online.
    .
    The following LIBRARY REQUEST commands were added with
    the initial CUIR support:
    - LIBRARY REQUEST,libname,LDRIVE  (composite or distributed)
    - LIBRARY REQUEST,distlib,LDRIVE,GROUP,index
    The LIBRARY REQUEST LDRIVE commands can be used to determine
    the state of a CUIR notification request.  Refer to the
    following
    TS7700 white paper for the LDRIVE command options and the
    output reported to the host:
    https://www.ibm.com/support/pages/node/6355675
    .
    The following operator command was added with the initial CUIR
    support:
    - DEVSERV QTAPE,xxxx,QHA
    The query host access (QHA) support for tape displays the
    systems
    that are online (grouped) to the specified tape device.  If
    there
    are systems whose devices are not going offline, this will show
    the systems that are still online (grouped) to the specified
    device.
    Since for a period of time, only a subset of the systems may
    support
    the new unhealthy vary notification, the commands noted above
    will
    help determine if manual varies are needed from some of the
    systems.
    .
    For IOS message updates related to the initial CUIR support
    refer to IOS APAR (OA52379). Updates specific to this support
    include:
    .
    IOS279I
    (Message Text) - REQUEST REASON displayed may be
    SERVICE or UNHEALTHY CLUSTER
    (Explanation) - updated to explain the unhealthy cluster reason
    The library has detected issues with a cluster (distributed
    library)
    in the grid (composite library) and has fenced the cluster.  The
    library
    has initiated a reconfiguration request from a device to quiesce
    the
    specified set of devices in the unhealthy cluster. The Control
    Unit
    Initiated Reconfiguration (C.U.I.R.) service has received
    control to
    perform the request. Quiescing devices means to make devices
    unavailable
    for use so that they cannot be varied online and used while the
    cluster is
    in the fenced state.
    .
    IOS280I
    (Message Text) - REQUEST REASON displayed may be
    SERVICE or UNHEALTHY CLUSTER
    (Explanation) - updated to explain the unhealthy cluster reason
    The library has initiated a reconfiguration request from a
    device to
    resume the specified set of devices when the library is no
    longer in the
    fenced state. The Control Unit Initiated Reconfiguration
    (C.U.I.R.) service
    has received control to perform the request. Resuming devices
    means to make
    devices available for use when the cluster (or distributed
    library) is no
    longer considered to be in the fenced (unhealthy) state. The
    devices may have
    been varied online by the system or may have been made available
    to be varied
    online.
    .
    IOS281I
    No changes to the successful message.
    .
    IOS282I
    (Message Text) - REQUEST REASON displayed may be
    SERVICE or UNHEALTHY CLUSTER
    (Explanation) - updated to cover both the service and the
    unhealthy cluster
    reason
    The Control Unit Initiated Reconfiguration (C.U.I.R.) support
    attempted to
    quiesce the specified devices in order to satisfy the request
    specified in
    system message IOS279I (IOS1279I) but the devices could not be
    quiesced. The request may
    have failed because the current state of the device precludes it
    from being
    quiesced. This may be the case if the device is a JES3 managed
    device (C.U.I.R.
    is not supported). In some cases, it may be necessary to
    manually vary
    the device offline.
    (Operator Response) - updated to cover both the service and the
    unhealthy cluster
    reason
    You may need to manually vary the specified devices offline,
    which may entail
    an operator initiated DDR SWAP or the cancel of a job that has
    an allocated device.
    If the request was for a JES3 managed device, this failure is
    expected since
    JES3 managed devices are not supported by this function.
    Otherwise contact your
    IBM service representative if the failures persist.
    .
    Note: For the unhealthy cluster vary, since a healthy cluster
    can also
    report on the state of its peer, the IOS messages above (IOS2791
    - IOS282I)
    may be issued multiple times.  Also note that the CUIR support
    continues
    to only be supported when running natively on MVS, it is not
    supported for
    an MVS guest running under VM. In addition, a CUIR request for a
    
    JES3 managed device is not supported, and will result in IOS282I
    (IOS1282I)
    being issued. The JES3 managed devices will be listed as devices
    that could
    not be brought offline for CUIR reasons. Lastly, if an IPL
    occurs while in a
    CUIR state, the CUIR state is not maintained across the IPL.
    .
    Additional Search Keywords:
    MSGIOS279I MSGIOS280I MSGIOS281I MSGIOS282I
    

APAR Information

  • APAR number

    OA60929

  • Reported component name

    DEV SUPPORT TAP

  • Reported component ID

    5695DF110

  • Reported release

    230

  • Status

    CLOSED UR1

  • PE

    NoPE

  • HIPER

    NoHIPER

  • Special Attention

    NoSpecatt / Xsystem

  • Submitted date

    2021-02-18

  • Closed date

    2021-09-16

  • Last modified date

    2022-03-18

  • APAR is sysrouted FROM one or more of the following:

    OA60928

  • APAR is sysrouted TO one or more of the following:

    OA61050 UJ06693 UJ06694 UJ06695

Modules/Macros

  • IECTDSRV IECTDSR2
    

Publications Referenced
SA38067630    

Fix information

  • Fixed component name

    DEV SUPPORT TAP

  • Fixed component ID

    5695DF110

Applicable component levels

  • R230 PSY UJ06693

       UP21/10/05 P F110

  • R240 PSY UJ06694

       UP21/10/05 P F110

  • R250 PSY UJ06695

       UP21/10/05 P F110

Fix is available

  • Select the PTF appropriate for your component level. You will be required to sign in. Distribution on physical media is not available in all countries.

[{"Business Unit":{"code":"BU054","label":"Systems w\/TPS"},"Product":{"code":"SG19M"},"Platform":[{"code":"PF054","label":"z\/OS"}],"Version":"230"}]

Document Information

Modified date:
19 March 2022