IBM Support

Server failover in MSCS might fail due to deadlock timeout

Troubleshooting


Problem

After IBM Spectrum Protect server has been configured with Microsoft Cluster Server on Windows Server, a failover attempt might fail while waiting for DB2 resource to come online.

Cause

The cluster service is waiting on DB2 resource to come online while constantly polling it. After a timeout is reached, cluster performs a cleanup and switches IBM Spectrum Protect Server instance back to primary node.

Environment

IBM Spectrum Protect Server on Microsoft Windows Cluster Server

Diagnosing The Problem

A review of the Cluster.log will show following:

000019a0.00002538::2013/03/08-08:08:03.349 INFO [RES] DB2 Server <TSM1>: DB2WOLF1: Polled at IsAlive 8:08:08.008.
...
000019a0.00001830::2013/03/08-08:12:58.411 INFO [RES] DB2 Server <TSM1>: DB2WOLF1: Polled at IsAlive 8:08:08.008.
00001894.000018a4::2013/03/08-08:13:01.001 ERR [RHS] RhsCall::DeadlockMonitor: Call ONLINERESOURCE timed out for resource 'TSM-Server-TSM1'.
00001894.000018a4::2013/03/08-08:13:01.001 ERR [RHS] Resource TSM-Server-TSM1 handling deadlock. Cleaning current operation.

The cluster starts to poll DB2 Server with IsAlive call and will wait for the resource to respond for a period of 5 minutes, which is a default value for a DeadlockTimeout property value. If the resource does not respond to a call within the period, it will give up and initiate a recovery action.

Resolving The Problem

DeadlockTimeout property value specifies the period (in milliseconds) of the deadlock detection heartbeat. Cluster resources are expected to respond to an IsAlive or LooksAlive calls within a few hundred milliseconds, therefore waiting 5 minutes for a resource to respond is considered a long time. If a resource, which normally responds within milliseconds takes longer than 5 minutes, an application or performance problem can be presumed.

It is generally recommended for a DeadlockTimeout to stay at default value, however in some cases it can be increased to a higher value to allow resource to respond. For example, if a secondary node is not identical to the primary node (ie. has less memory or CPU cores), it can take for an application longer to complete initialization.

Example:
To increase the DeadlockTimeout, open a PowerShell and first import the failoverclusters module:
PS C:\> import-module failoverclusters

After failoverclusters module has been imported, set a DeadlockTimeout property value with the newly available PowerShell cmdlet to 10 minutes:
PS C:\> (Get-ClusterResource “TSM-Server-TSM1”).DeadlockTimeout = 600000

Verify new value with the same command, omitting value:
PS C:\> (Get-ClusterResource “TSM-Server-TSM1”).DeadlockTimeout
600000

Note: The timeout should be set to a value greater than it takes for database activation, which might take a long time due to various reasons (i.e. slow disk storage, crash recovery) and is size dependent. If the time needed is unusually high, this might indicate an issue and support should be consulted.

[{"Product":{"code":"SSGSG7","label":"Tivoli Storage Manager"},"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Component":"Server","Platform":[{"code":"PF033","label":"Windows"}],"Version":"All Supported Versions","Edition":"Edition Independent","Line of Business":{"code":"LOB26","label":"Storage"}}]

Document Information

Modified date:
17 June 2018

UID

swg21634738