IBM Support

IT42566: SDREPLICATEUNRESOLVEDCHUNKSTHREAD EXIT CAN CAUSE REPLICATION STORAGE RULE HANG

Subscribe to this APAR

By subscribing, you receive periodic emails alerting you to the status of the APAR, along with a link to the fix after it becomes available. You can track this item individually or track all items by product.

Notify me when this APAR changes.

Notify me when an APAR for this component changes.

 

APAR status

  • Closed as program error.

Error description

  • A replication storage rule may lead to hang condition on the
    target replication server
    
    when the SdReplicateUnresolvedChunksThread exits due to a
    critical error.
    
    
    Replication storage rule hangs and it cannot be cancelled from
    the target sever
    
    after at least one failed replication storage rule has
    occurred. The hang of
    replication storage rule occurs from slow running replication
    storage rule
    
    transactions that become slower and slower due to enormous
    amount of extents that are
    needing updates and they need to be updated in the database
    table SD_REFCOUNT_UPDATES.
    
    This table is used to manage the reference counts for the
    deduplication catalogue.
    
    The issue is the backlog of the SD_REFCOUNT_UPDATES table since
    upon startup the
    
    thread that is responsible for managing that table gets into a
    single threaded
    
    mode trying to resolve inflight entries from the failed
    replication. The queries
    
    into this table, used by the replication storage rule process,
    become slower and
    
    slower and eventually the IBM Spectrum Protect Server causes
    replication storage
    
    rule to hang as there are a lot of entries in
    SD_REFCOUNT_UPDATES table from the
    
    failed replication storage rule. This will only occur on the
    target server side of IBM
    
    Spectrum Protect Server.
    
    This APAR can be identified by following steps:
    
    1) On the target server SERVER2, a query process output shows
    the following process:
    Protect: SERVER2>query process
    
    Process      Process Description          Job Id     Process
    Status                                         Parent
     Number
                                              Process
    --------     --------------------     ----------
    -------------------------------------------------     --------
          2     Inbound replication              XX     Inbound
    Replication Storage Rule REPLICATION
                 storage rule                            from source
    server SERVER1, source process 2,
                 REPLICATION from                        source job
    XY.
                 SERVER1
    
    when trying to cancel the storage rule for replication process,
    the following is shown:
    
    Protect: SERVER2>cancel process 2
    ANR0943E CANCEL PROCESS: Process 2 could not be cancelled.
    ANS8001I Return code 14.
    
    2) In GSTACK output from the source server check for:
    SdReplTcrPhaseCheckThread
    SdReplicateUnresolvedChunks
    
    If either of these threads are not seen in gstack output then
    APAR applies.
    
    GSTACK output can be gather by commands bellow from Linux system
    from root session:
    "ps -ef |grep dsmserv"
    note the pid of the dsmserv process to be used in the following
    command
    "gstack <dsmserv-pid>
    
    
    IBM Spectrum Protect Versions Affected:
    IBM Spectrum Protect Server 8.1.13 and above on all supported
    platforms
    
    Additional Keywords:
    Spectrum Protect; TSM; stgrule; replication; hang;
    SdReplTcrPhaseCheckThread; SdReplicateUnresolvedChunks;
    TS008112620
    

Local fix

  • The target server will need to be re-cycled before it can
    receive another replication process.
    

Problem summary

  • ****************************************************************
    * USERS AFFECTED:                                              *
    * All Spectrum Protect users of replication storage rules      *
    ****************************************************************
    * PROBLEM DESCRIPTION:                                         *
    * See error description.                                       *
    ****************************************************************
    * RECOMMENDATION:                                              *
    * Apply fixing level when available. This problem is currently *
    * projected to be fixed in level 8.1.18. Note that this is     *
    * subject to change at the discretion of IBM.                  *
    ****************************************************************
    

Problem conclusion

  • This problem was fixed. Affected platforms:  AIX, Linux, and
    Windows.
    

Temporary fix

Comments

APAR Information

  • APAR number

    IT42566

  • Reported component name

    TSM SERVER

  • Reported component ID

    5698ISMSV

  • Reported release

    81L

  • Status

    CLOSED PER

  • PE

    NoPE

  • HIPER

    NoHIPER

  • Special Attention

    NoSpecatt / Xsystem

  • Submitted date

    2022-11-24

  • Closed date

    2023-02-02

  • Last modified date

    2023-02-02

  • APAR is sysrouted FROM one or more of the following:

  • APAR is sysrouted TO one or more of the following:

Fix information

  • Fixed component name

    TSM SERVER

  • Fixed component ID

    5698ISMSV

Applicable component levels

[{"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"SSGSG7","label":"Tivoli Storage Manager"},"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"81L","Line of Business":{"code":"LOB26","label":"Storage"}}]

Document Information

Modified date:
17 March 2023