IBM Support

IC81596: NODE REPLICATION CAN HANG WHEN MULTIPLE THREADS PROCESS OBJECTS OF THE SAME PEER GROUP.

Subscribe

You can track all active APARs for this component.

 

APAR status

  • Closed as program error.

Error description

  • This problem is a timing problem that can happen when multiple
    node replication threads are processing objects belonging to the
    same peer group on the source server.  The hang happens on the
    target server.  This is more likely to happen when performing
    node replication for Windows 2008 clients with backups of
    systemstate objects.  There are other possibilities for peer
    groups but the Windows 2008 clients with systemstate backup
    objects are the largest of these groups at this time.
    
    Customer/L2 Diagnostics (if applicable)
    The thread that is hanging in DB2 will have the
    ImReplFindOrAddGroup function:
    
     Thread 84693, Parent 54: psSessionThread, Storage 11416396,
    AllocCnt 601411 HighWaterAmt 11617472
      tid=102d5, ptid=2136, det=1, zomb=0, join=0, result=0,
    sess=27441
       Holding mutex txnP->mutex (0x117794118), acquired at
    tbcli.c(1248)
       Stack trace:
         0x0900000000262510 semop
         0x0900000000e92640 sqloSSemP
         0x0900000000e92084
    .sqlccipcrecv.fdpr.clone.756__FP17SQLCC_COMHANDLE_TP12SQLCC_COND
    _T
         0x0900000000e92fe4 .sqlccrecv.fdpr.clone.125
         0x0900000000e92ce8 sqljcReceive__FP10sqljCmnMgr
         0x0900000000e9d8e0
    sqljrDrdaArExecute__FP14db2UCinterfaceP9UCstpInfo
         0x090000000127c008
    CLI_sqlExecute__FP17CLI_STATEMENTINFOP19CLI_ERRORHEADERINFO
         0x0900000001285450
    SQLExecute2__FP17CLI_STATEMENTINFOP19CLI_ERRORHEADERINFO
         0x09000000010be1b0 SQLExecute
         0x0000000100133564 RdbPrepareAndExecuteStmt
         0x000000010012f250 RdbCliUpdate
         0x000000010012ee64 tbCliSRUpd
         0x00000001005d65c8 ImReplFindOrAddGroup    <--- look for
    this
         0x00000001005d5a50 imProcReplBkObjInfo
         0x000000010059a95c SmDoBackInsNormEnhanced
         0x000000010060b40c SmReplServerSession
         0x0000000100192f74 DoReplServer
         0x000000010018a8b8 smExecuteSession
         0x0000000100057008 psSessionThread
         0x0000000100020760 StartThread
    
    
    The SHOW TXNT outputs for this thread:
     Tsn=0:31570620, Resurrected=False, InFlight=True,
    Distributed=False, Persistent=True, Addr
    11dbebdf8
       Start ThreadId=84693, Timestamp=02/18/12 09:35:32,
    Creator=smrepl.c(7248)
       Last known in use by ThreadId=84693
       Participants=3, summaryVote=ReadOnly
       EndInFlight False, endThreadId 0, tmidx 0 0,
    processBatchCount 0, mustAbort False.
         Participant DB: voteReceived=False, ackReceived=False
           DB: in-flight Txn 117797af8, skipped its detail.
         Participant IM: voteReceived=False, ackReceived=False
         Participant BF: voteReceived=False, ackReceived=False
    
    This transaction for the thread is holding the lock 19006.
    
     Tsn=0:31585228, Resurrected=False, InFlight=True,
    Distributed=False, Persistent=True, Addr
    116718118
       Start ThreadId=84693, Timestamp=02/18/12 09:35:51,
    Creator=imrepl.c(6028)
       Last known in use by ThreadId=84693
       Participants=1, summaryVote=ReadOnly
       EndInFlight False, endThreadId 0, tmidx 0 0,
    processBatchCount 0, mustAbort False.
         Participant DB: voteReceived=False, ackReceived=False
           DB: Txn 117792758, ReadOnly(YES), connP=1177a1458,
    applHandle=7537, openTbls=5:
           DB: --> OpenP=1196d54b8 for table=InFlight.ReplGroups.
           DB: --> OpenP=129aa5818 for table=Extended.Attributes.
           DB: --> OpenP=1221bbd98 for table=Nodes.
           DB: --> OpenP=11ce99658 for table=Policy.Domain.Members.
           DB: --> OpenP=116d04858 for table=Server.Connect.
       Locks held by Tsn=0:31585228 :
         Type=19006(im repl group), NameSpace=0, SummMode=xLock,
    Mode=xLock, Key='162862599.2'
    
    
    The application info on the DB2 side where this is hanging. This
    is waiting on an uncommitted read which means this row was
    updated in another transaction.:
    
    Application :
      Address :                0x0780000005A40080
      AppHandl [nod-index] :   7544     [000-07544]
      TranHdl :                216
      Application PID :        4784304
      Application Node Name :  tsm02fm
      IP Address:              n/a
      Connection Start Time :  (1329571538)Sat Feb 18 08:25:38 2012
      Client User ID :         tsminst1
      System Auth ID :         TSMINST1
      Coordinator EDU ID :     68210
      Coordinator Partition :  0
      Number of Agents :       1
      Locks timeout value :    NotSet
      Locks Escalation :       No
      Workload ID :            1
      Workload Occurrence ID : 126868
      Trusted Context :        n/a
      Connection Trust Type :  non trusted
      Role Inherited :         n/a
      Application Status :     Lock-wait
      Application Name :       dsmserv
      Application ID :         *LOCAL.tsminst1.120218133227
      ClientUserID :           n/a
      ClientWrkstnName :       n/a
      ClientApplName :         n/a
      ClientAccntng :          n/a
      CollectActData:          N
      CollectActPartition:     C
      SectionActuals:          N
    
      List of active statements :
       *UOW-ID :          21
        Activity ID :     17
        Package Schema :  NULLID
        Package Name :    SYSSN100
        Package Version :
        Section Number :  17
        SQL Type :        Dynamic
        Isolation :       UR
        Statement Type :  DML, Insert/Update/Delete
        Statement :       UPDATE
        "TSMDB1"."INFLIGHT_REPLGROUPS" SET FLAGS=? WHERE
    (SRC_GROUPID=? AND GROUPTYPE=?) --84693
    
    
    The DB2 lock information for this lock:
    
            LRB         State Status Mode Dur CMode CDur Flags
    TranHandle HoldCount lsoFeedback CursorBitmap AppHandle  rrIID
     ------------------ ----- ------ ---- --- ----- ---- ------
    ---------- --------- ------------ ---------- ---------  -----
        a000602617ed300     L      G  ..X   1   NON    0   0000
    81         0 784489098843     40000000    0-7524 0000
        a0006004b2e7100     L      W  ..X   1   NON    0   0000
    216         0 0                40000000    0-7544 0000
    
    The thread in the SHOW THREADS output for the application handle
    that holds this lock:
    
     Tsn=0:31579200, Resurrected=False, InFlight=True,
    Distributed=False, Persistent=True, Addr
    11c8cff78
       Start ThreadId=84681, Timestamp=02/18/12 09:35:44,
    Creator=smrepl.c(7248)
       Last known in use by ThreadId=84681
       Participants=4, summaryVote=ReadOnly
       EndInFlight False, endThreadId 0, tmidx 0 0,
    processBatchCount 0, mustAbort False.
         Participant DB: voteReceived=False, ackReceived=False
           DB: Txn 116860fb8, ReadOnly(NO), connP=117758c98,
    applHandle=7524, openTbls=13:
           DB: --> OpenP=11b099298 for table=Restore.Sessions.
           DB: --> OpenP=11f85c698 for table=Group.Leaders.
           DB: --> OpenP=11c621ed8 for table=InFlight.ReplGroups.
           DB: --> OpenP=116d30878 for
    table=Replicated.Objects.Reverse.
           DB: --> OpenP=1205fccb8 for table=Filespaces.
           DB: --> OpenP=1205fbb18 for table=Backup.Objects.
           DB: --> OpenP=123768df8 for table=Archive.Objects.
           DB: --> OpenP=123768a58 for table=Replicated.Objects.
           DB: --> OpenP=12a16ee58 for table=Extended.Attributes.
           DB: --> OpenP=11ffbbaf8 for table=Nodes.
           DB: --> OpenP=117d5a498 for table=Replicating.Servers.
           DB: --> OpenP=117c59938 for table=Server.Connect.
           DB: --> OpenP=116deef18 for table=Server.Connect.Info.
         Participant IM: voteReceived=False, ackReceived=False
         Participant BF: voteReceived=False, ackReceived=False
         Participant SS: voteReceived=False, ackReceived=False
       Locks held by Tsn=0:31579200 :
         Type=46001(bf aggregate (superbitfile) id), NameSpace=0,
    SummMode=xLock, Mode=xLock,
    Key='209578738'
    
    This is the call stack for thread 84681:
     Thread 84681, Parent 54: psSessionThread, Storage 7774573,
    AllocCnt 772586 HighWaterAmt 11703935
      tid=f7c9, ptid=2136, det=1, zomb=0, join=0, result=0,
    sess=27429
       Awaiting cond waitP->waiting (0x110bca820), using mutex
    TMV->mutex (0x110c6cd38), at
    tmlock.c(749)
       Stack trace:
         0x09000000004bba60 _cond_wait_global
         0x09000000004bc5f8 _cond_wait
         0x09000000004bd2e0 pthread_cond_wait
         0x0000000100007644 pkWaitConditionTracked
         0x00000001000bca9c tmLockTracked
         0x00000001005d5f1c ImReplFindOrAddGroup
         0x00000001005d5a50 imProcReplBkObjInfo
         0x000000010059a95c SmDoBackInsNormEnhanced
         0x000000010060b40c SmReplServerSession
         0x0000000100192f74 DoReplServer
         0x000000010018a8b8 smExecuteSession
         0x0000000100057008 psSessionThread
         0x0000000100020760 StartThread
    
    Platforms affected:
    TSM 6.3 Unix Linux Windows
    
    Initial Impact: Medium
    
    Additional Keywords:  hung ZZ63 replicate nodegroup
    

Local fix

Problem summary

  • ****************************************************************
    * USERS AFFECTED: All Tivoli Storage Manager server users.     *
    ****************************************************************
    * PROBLEM DESCRIPTION: See error description.                  *
    ****************************************************************
    * RECOMMENDATION: Apply fixing level when available. This      *
    *                 problem is currently projected to be fixed   *
    *                 in level 6.3.2. Note that this is            *
    *                 subject to change at the discretion of IBM.  *
    ****************************************************************
    *
    

Problem conclusion

  • This problem was fixed.
    Affected platforms:  AIX, HP-UX, Solaris, Linux, and Windows.
    

Temporary fix

Comments

APAR Information

  • APAR number

    IC81596

  • Reported component name

    TSM SERVER

  • Reported component ID

    5698ISMSV

  • Reported release

    63A

  • Status

    CLOSED PER

  • PE

    NoPE

  • HIPER

    NoHIPER

  • Special Attention

    NoSpecatt

  • Submitted date

    2012-02-22

  • Closed date

    2012-04-11

  • Last modified date

    2012-04-11

  • APAR is sysrouted FROM one or more of the following:

  • APAR is sysrouted TO one or more of the following:

Fix information

  • Fixed component name

    TSM SERVER

  • Fixed component ID

    5698ISMSV

Applicable component levels

  • R63A PSY

       UP

  • R63H PSY

       UP

  • R63L PSY

       UP

  • R63S PSY

       UP

  • R63W PSY

       UP

[{"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"SSGSG7","label":"Tivoli Storage Manager"},"Component":"","ARM Category":[],"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"63A","Edition":"","Line of Business":{"code":"LOB26","label":"Storage"}}]

Document Information

Modified date:
11 April 2012