Direct links to fixes
8.1.1.100-IBM-SPOC-WindowsX64
8.1.1.100-IBM-SPOC-Linuxx86_64
8.1.1.100-IBM-SPOC-Linuxs390x
8.1.1.100-IBM-SPOC-AIX
8.1.1.000-IBM-SPSRV-WindowsX64
8.1.1.000-IBM-SPSRV-Linuxs390x
8.1.1.000-IBM-SPSRV-AIX
8.1.1.000-IBM-SPSRV-Linuxx86_64
7.1.7.100-TIV-TSMSRV-WIN
7.1.7.100-TIV-TSMSRV-SolarisSPARC
7.1.7.100-TIV-TSMSRV-Linuxx86_64
7.1.7.100-TIV-TSMSRV-Linuxs390x
7.1.7.100-TIV-TSMSRV-Linuxppc64
7.1.7.100-TIV-TSMSRV-HP-UX
7.1.7.100-TIV-TSMSRV-AIX
IBM Spectrum Protect Server V8.1 Fix Pack 1 (V8.1.1) Downloads
IBM Spectrum Protect Server V7.1.7.X interim fix downloads
IBM Spectrum Protect Server V7.1 Fix Pack 8 (7.1.8.000) Downloads
APAR status
Closed as program error.
Error description
After canceling Node Replication on the source server, the target server may have many orphaned Node Replication sessions.  Canceling these orphaned sessions resulted in the target instance becoming hung. IBM Spectrum Protect Versions Affected: 7.1.3, 7.1.4, 7.1.5, 7.1.6 and 7.1.7 Collect the servermon.pl script data before and during the cancellation of orphaned Node replication sessions. If that was not collected, then get the AIX procstack or UNIX pstack on the dsmserv process id. Wait 10 minutes and gather the procstack output again. Then get a core file produced with a Kill -11 on the hung dsmserv process. Key stack threads: in pth_cond._cond_wait_global at 0x9000000004ec260 ($t5533) 0x9000000004ec260 (_cond_wait_global+0x4e0) e8410028 ld r2,0x28(r1) pth_cond._cond_wait_global(??, ??, ??) at 0x9000000004ec260 pth_cond._cond_wait(??, ??, ??) at 0x9000000004ecdf4 pth_cond.pthread_cond_wait(??, ??) at 0x9000000004edadc pkmon.pkWaitConditionTracked(??, ??, ??, ??, ??) at 0x100008f10 sdprodcon.SdFreeWriteControl(??) at 0x1009c8e18 sdutil.sdEndSession(??) at 0x100977450 smrepl.SmReplServerSession(??) at 0x100358618 smexec.DoReplServer(??, ??) at 0x100579ad8 smexec.smExecuteSession(??, ??, ??, ??, ??, ??, ??, ??) at 0x10056ef64 tcpcomm.psSessionThread(??) at 0x100545930 pkthread.StartThread(0x0) at 0x10000da90 and _cond_wait_global(??, ??, ??) at 0x9000000004ec260 _cond_wait(??, ??, ??) at 0x9000000004ecdf4 pthread_cond_wait(??, ??) at 0x9000000004edadc pkWaitConditionTracked(??, ??, ??, ??, ??) at 0x100008f10 SdFreeWriteControl(??) at 0x1009c8e18 sdEndSession(??) at 0x100977450 SmReplServerSession(??) at 0x100358618 DoReplServer(??, ??) at 0x100579ad8 smExecuteSession(??, ??, ??, ??, ??, ??, ??, ??) at 0x10056ef64 psSessionThread(??) at 0x100545930 StartThread(0x0) at 0x10000da90 There may be many of these program stacks: 0x9000000004ec260 (_cond_wait_global+0x4e0) e8410028 ld r2,0x28(r1) pth_cond._cond_wait_global(??, ??, ??) at 0x9000000004ec260 pth_cond._cond_wait(??, ??, ??) at 0x9000000004ecdf4 pth_cond.pthread_cond_wait(??, ??) at 0x9000000004edadc pkmon.pkWaitConditionTracked(??, ??, ??, ??, ??) at 0x100008f10 queue.DequeueVarQueue(??, ??, ??, ??, ??) at 0x100357820 prodcons.ProdConsGetWork(??, ??) at 0x10093dd80 prodcons.PcConsumerThread(??) at 0x10093d278 and many threads are "running", in pkDelayThread() waiting for the timer to end. Example: in pth_spinlock._global_lock_common at 0x9000000004c983c ($t6404) pth_spinlock._global_lock_common(??, ??, ??) at 0x9000000004c983c in pth_spinlock._global_lock_common at 0x9000000004c983c ($t6406) pth_spinlock._global_lock_common(??, ??, ??) at 0x9000000004c983c in pth_spinlock._global_lock_common at 0x9000000004c983c ($t6409) pth_spinlock._global_lock_common(??, ??, ??) at 0x9000000004c983c Some threads are in BeginSession() which needs the SMV->mutex to proceed. Example: _global_lock_common(??, ??, ??) at 0x9000000004c983c _mutex_lock(??, ??, ??) at 0x9000000004d7104 pkAcquireMutexTracked(??, ??, ??) at 0x1000078d4 BeginSession() at 0x100571014 smExecuteSession(??, ??, ??, ??, ??, ??, ??, ??) at 0x10056d988 psSessionThread(??) at 0x100545930 StartThread(0x0) at 0x10000da90 The deadlocked threads are smLockSessMutexTracked and smLockSessMutex attempting to get the same mutex. Example: in pth_spinlock._global_lock_common at 0x9000000004c983c ($t6404) 0x9000000004c983c (_global_lock_common+0x4bc) e8410028 ld r2,0x28(r1) pth_spinlock._global_lock_common(??, ??, ??) at 0x9000000004c983c pth_mutex._mutex_lock(??, ??, ??) at 0x9000000004d7104 pkmon.pkAcquireMutexTracked(??, ??, ??) at 0x1000078d4 smutil.smLockSessMutexTracked(??, ??, ??) at 0x1002aac10 smcancel.CancelSessionNum(??, ??) at 0x10028a04c smcancel.smCancelSession(??) at 0x1002893f8 admcmd.AdmCommandLocal(??, ??, ??, ??, ??) at 0x1007af284 admcmd.admCommand(??, ??, ??, ??, ??) at 0x1007acd40 smadmin.SmAdminCommandThread(??) at 0x1008e0c10 pkthread.StartThread(0x0) at 0x10000da90 _global_lock_common(??, ??, ??) at 0x9000000004c983c _mutex_lock(??, ??, ??) at 0x9000000004d7104 pkAcquireMutexTracked(??, ??, ??) at 0x1000078d4 smLockSessMutexTracked(??, ??, ??) at 0x1002aac10 CancelSessionNum(??, ??) at 0x10028a04c smCancelSession(??) at 0x1002893f8 AdmCommandLocal(??, ??, ??, ??, ??) at 0x1007af284 admCommand(??, ??, ??, ??, ??) at 0x1007acd40 SmAdminCommandThread(??) at 0x1008e0c10 StartThread(0x0) at 0x10000da90 Initial Impact: High Additional Keywords: hung deadlock
Local fix
Do not cancel target replication sessions.
Problem summary
**************************************************************** * USERS AFFECTED: * * All Tivoli Storage Manager server users. * **************************************************************** * PROBLEM DESCRIPTION: * * See error description. * **************************************************************** * RECOMMENDATION: * * Apply fixing level when available. This problem is currently * * projected to be fixed in levels 7.1.7.100, 7.1.8, and 8.1.1. * * Note that this is subject to change at the discretion of * * IBM. * ****************************************************************
Problem conclusion
This problem was fixed. Affected platforms: AIX, Solaris, Linux, and Windows.
Temporary fix
Comments
APAR Information
APAR number
IT17970
Reported component name
TSM SERVER
Reported component ID
5698ISMSV
Reported release
71A
Status
CLOSED PER
PE
NoPE
HIPER
NoHIPER
Special Attention
NoSpecatt / Xsystem
Submitted date
2016-11-15
Closed date
2016-12-12
Last modified date
2016-12-12
APAR is sysrouted FROM one or more of the following:
APAR is sysrouted TO one or more of the following:
Fix information
Fixed component name
TSM SERVER
Fixed component ID
5698ISMSV
Applicable component levels
R71A PSY
UP
R71H PSY
UP
R71L PSY
UP
R71S PSY
UP
R71W PSY
UP
R81A PSY
UP
R81L PSY
UP
R81W PSY
UP
Document Information
Modified date:
25 September 2021