IBM Support

IT45126: STGRULE REPLICATION HANG AFTER COMMUNICATION ERROR.

Subscribe to this APAR

By subscribing, you receive periodic emails alerting you to the status of the APAR, along with a link to the fix after it becomes available. You can track this item individually or track all items by product.

Notify me when this APAR changes.

Notify me when an APAR for this component changes.

 

APAR status

  • Closed as program error.

Error description

  • The storage rule replication processing reports as almost
    complete but does not finish and appears to be hung.
    
    
    QUERY PROCCESS on the SOURCE server shows the figures for work
    completed by the replication STGRULE process don't change over
    time.
    
    Evidence this issue is encountered:
    SHOW THREADS output on the SOURCE server shows
    NrReplicationThread thread hang at waiting for
    ctlP->phaseCompleteCond:
    Thread 3333, Parent 2222: NrReplicationThread, Storage 27763056,
    AllocCnt 13857 HighWaterAmt 27791488
    tid=139721881839360, ptid=139721920440064, det=1, zomb=0,
    join=0, result=0, sess=4444, procToken=50, sessToken=5555
    lwp=3864663
    Stack trace:
        0x7f174051c45c
    /lib64/libpthread.so.0(pthread_cond_wait+0x1fc)
             0x152d201
    /opt/tivoli/tsm/server/bin/dsmserv(pkWaitConditionTracked+0x91)
              0xb67b1e
    /opt/tivoli/tsm/server/bin/dsmserv(NrReplicationThread+0x4cfe)
             0x152a813
    /opt/tivoli/tsm/server/bin/dsmserv(StartThread+0x93)
        0x7f17405161ca /lib64/libpthread.so.0(UNKNOWN)
        0x7f173c867e73 /lib64/libc.so.6(clone+0x43)
    
    
     Awaiting cond ctlP->phaseCompleteCond (0x0x4544b20), using
    mutex ctlP->phaseCompleteMutex (0x0x7f14643da230), at
    nrmain.c(14075)
    
    
    A trace of SOURCE server (with trace classes REPL SDREPL
    SESSION) will have communication errors, and generates the
    ANR0483W message for current session terminated, and will make
    target session pair dead:
    03:06:41.100 [133164][smrepl.c][7871][smReplRtrv]:Object 9890010
    processed with rc 3014 (bfSize 288453632)
    03:06:41.100 [133164][smrepl.c][7885][smReplRtrv]:Exit: rc 3014,
    objId 9890010
    03:06:41.100
    [133164][nrfs.c][3216][NrReplicateBatch]:sourceRc=3014,
    bytesTransferred=0
    03:06:41.100 [133164][nrfs.c][7651][CheckRcForBatch]:Object
    9890010 had confirmation denied or sink error
    03:06:41.100 [133164][nrfs.c][2555][NrReplicateBatch]:Ending
    current txn
    03:06:41.100 [133164][smrepl.c][7477][smReplEndBatchTxn]:Enter:
    handle 0x7f98083acc40, inRc 1031
    03:06:41.100 [133164][smrepl.c][7504][smReplEndBatchTxn]:Sending
    ABORT for txn
    03:06:41.100 [133164][smrepl.c][7512][smReplEndBatchTxn]:Sending
    EndTxnEnhanced verb
    03:06:41.100 [133164][smrepl.c][7517][smReplEndBatchTxn]:Error
    -1 sending EndTxnEnhancedVerb
    03:06:41.100 [133164][smrepl.c][7620][smReplEndBatchTxn]:Exit:
    rc 2339, batchCount 0
    03:06:41.100 [133164][nrfs.c][2559][NrReplicateBatch]:Error 2339
    from smReplEndTxn
    ...
    03:06:41.129 [133164][smrepl.c][4163][smReplFreeSession]:Marking
    dead target session 758 to server LEVTSM03 source sess 11517.
    03:06:41.129 [133164][smrepl.c][4194][smReplFreeSession]:Closing
    out session 11517 to server LEVTSM03
    03:06:41.129 [133164][smexec.c][6026][smCloseSessionFasp]:Enter
    close FASP type FASP BIDI Session smSessP 0x7f98083aa3e0 for END
    state .
    03:06:41.131 [133164][smserv.c][3508][TraceMsg]:ANR0483W Session
    11517 for node LEVTSM03 (AIX) terminated - forced by
    administrator.~
    
    NrReplicationThread is stopped at waiting for uncompleted
    phases:
    03:18:49.535
    [133117][nrmain.c][14055][WaitForPhaseCompletion]:Phase
    1740037082017 is not completed
    03:18:49.535
    [133117][nrmain.c][14055][WaitForPhaseCompletion]:Phase
    1740037082054 is not completed
    03:18:49.535
    [133117][nrmain.c][14055][WaitForPhaseCompletion]:Phase
    1740037082071 is not completed
    03:18:49.536
    [133117][nrmain.c][14055][WaitForPhaseCompletion]:Phase
    1740037082311 is not completed
    03:18:49.536
    [133117][nrmain.c][14068][WaitForPhaseCompletion]:Waiting for 4
    incomplete phases to finish
    
    
    On the TARGET server side, in the trace (with trace classes
    REPL SDREPL SESSION) the ANR0479W is generated for the session
    terminated by server:
    02:10:48.473
    [21019][smrepl.c][2439][SmReplServerSession]:Received verb
    BackInsNormEnhanced (C5) on session 758
    02:10:48.473 [21019][smrepl.c][19985][TestArchRetention]:Exit,
    rc 0
    02:10:48.473 [21019][imrepl.c][7843][imProcReplBkObjInfo]:Entry
    02:10:48.473 [21019][imrepl.c][7997][imProcReplBkObjInfo]:Exit,
    rc=0, *objIdP=0, groupId=0
    02:10:48.521 [21019][smtrans.c][5701][SmRecvNextData]:Sending
    confirmation resp=YES
    02:10:48.522 [21019][smrepl.c][20974][TranslateToGRC]:Exit: rc
    0, *objRcP -1, confirmState NO
    02:10:48.522 [21019][smrepl.c][3144][SmReplServerSession]:Error
    -1 from SM Insert Verb
    02:10:48.523
    [21019][smrepl.c][3154][SmReplServerSession]:Absorbing object
    for failed transaction
    02:10:50.398 [21019][sdutil.c][3575][TraceMsg]:ANR0951I Session
    758 for node NODENAME processed 452 files by using inline data
    deduplication or compression, or both. The number of original
    bytes was 5,474,068,432. Inline data deduplication reduced the
    data by 0 bytes and inline compression reduced the data by 0
    bytes. ~
    02:10:50.399
    [21019][nrmain.c][7463][NrIsReplRecoveryProcessCancelled]:Exit:
    rc 0, procId 7
    02:10:50.399
    [21019][smrepl.c][3352][SmReplServerSession]:Replication session
    758 for SPSERVER on Linux/x86_64 failed, rc 9997~
    02:10:50.399
    [21031][smrepl.c][2439][SmReplServerSession]:Received verb
    BackInsNormEnhanced (C5) on session 770
    02:10:50.417 [21019][smexec.c][4920][TraceMsg]:ANR0479W Session
    758 for server SPSERVER (Linux/x86_64) terminated - connection
    with server severed.~
    02:10:50.418
    [21019][imrepl.c][8146][imFreeReplBatchInfo]:freeing
    igRow=12d11f5f0
    
    
    From the server trace GetUnsatisfiedDeps has phases waiting for
    an uncompleted phase as the dependency:
    02:22:18.151 [20982][sdrepl.c][4400][GetUnsatisfiedDeps]:PhaseI
    d=1740037082071has unsatisfied dependency 1740037082017
    02:22:18.151 [20982][sdrepl.c][4400][GetUnsatisfiedDeps]:PhaseI
    d=1740037082054has unsatisfied dependency 1740037082017
    02:22:18.151 [20982][sdrepl.c][4400][GetUnsatisfiedDeps]:PhaseI
    d=1740037082311has unsatisfied dependency 1740037082017
    02:22:22.453 [20982][sdrepl.c][4400][GetUnsatisfiedDeps]:PhaseI
    d=1740037082071has unsatisfied dependency 1740037082017
    02:22:22.453 [20982][sdrepl.c][4400][GetUnsatisfiedDeps]:PhaseI
    d=1740037082054has unsatisfied dependency 1740037082017
    02:22:22.453 [20982][sdrepl.c][4400][GetUnsatisfiedDeps]:PhaseI
    d=1740037082311has unsatisfied dependency 1740037082017
    
    In the SHOW THREADS output of TARGET server, the
    SdReplTcrPhaseCheckThread is hung waiting on condition
    ctlP->phaseCheckCond:
     Thread 20982, Parent 20959: SdReplTcrPhaseCheckThread, Storage
    18640, AllocCnt 17272 HighWaterAmt 125744
      tid=d4f6, ptid=2fdf, det=0, zomb=0, join=0, result=0, sess=0,
    procToken=7, sessToken=729
       Stack trace:
         0x090000000052d9a0 _cond_wait_global
         0x090000000052e69c _cond_wait
         0x090000000052f00c pthread_cond_wait
         0x000000010000b794 pkWaitConditionTracked
         0x0000000100cf1cb4 SdReplTcrPhaseCheckThread
         0x0000000100011a70 StartThread
       Awaiting cond ctlP->phaseCheckCond (0x116fa9cd0), using mutex
    ctlP->mutex (0x113b5b290), at sdrepl.c(2114)
    
    
    
    Additional Keywords:
    TS014556040 Spectrum TSM ANR0483W ANR0479W
    

Local fix

Problem summary

  • ****************************************************************
    * USERS AFFECTED:                                              *
    * All IBM Storage Protect server users.                        *
    ****************************************************************
    * PROBLEM DESCRIPTION:                                         *
    * See error description.                                       *
    ****************************************************************
    * RECOMMENDATION:                                              *
    * Apply fixing level when available.                           *
    * This problem is currently projected to be fixed in level     *
    * 8.1.22.                                                      *
    * Note that this is subject to change at the discretion of     *
    * IBM.                                                         *
    ****************************************************************
    

Problem conclusion

  • This problem was fixed.
    Affected platforms for reported release: AIX, Linux, and
    Windows.
    Platforms fixed: AIX, Linux, and Windows.
    

Temporary fix

Comments

APAR Information

  • APAR number

    IT45126

  • Reported component name

    TSM SERVER

  • Reported component ID

    5698ISMSV

  • Reported release

    81A

  • Status

    CLOSED PER

  • PE

    NoPE

  • HIPER

    NoHIPER

  • Special Attention

    NoSpecatt / Xsystem

  • Submitted date

    2023-12-12

  • Closed date

    2024-03-05

  • Last modified date

    2024-03-05

  • APAR is sysrouted FROM one or more of the following:

  • APAR is sysrouted TO one or more of the following:

Fix information

  • Fixed component name

    TSM SERVER

  • Fixed component ID

    5698ISMSV

Applicable component levels

[{"Business Unit":{"code":"BU029","label":"Software"},"Product":{"code":"SSGSG7","label":"Tivoli Storage Manager"},"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"81A"}]

Document Information

Modified date:
02 April 2024