Troubleshooting
Problem
Long-running lanfree operations fail with this error: ANR0479W (ANE4985I Session: 1, Origin: STA) Session 2 for server TSM (AIX) terminated - connection with server severed.
Symptom
A new function was added with the Tivoli Storage Manager server 6.3.4.200 & 7.1.0 level. This new function introduces 3 new configurable options for keepalive settings, to prevent sessions from being terminated by network firewall timeouts.
This new function was introduced with apar IC92088 to help resolve network sessions from being terminated by a network firewall timeout during server replication functions. This new function is also available for LAN-Free backups.
The new Tivoli Storage Manager server (and Storage agent) options are :
KEEPALIVE yes/no
KEEPALIVETIME seconds
KEEPALIVEINTERVAL seconds
For example :
KEEPALIVE yes
KEEPALIVETIME 600
KEEPALIVEINTERVAL 30
With the above options, the Tivoli Storage Manager server will enable the KeepAlive socket function and keep-alive transmissions will be sent every 10 minutes (600 seconds). If no response is received, it will repeat keep-alive transmissions every 30 seconds.
So, instead of increasing the network firewall timeout, or decreasing the idletimeout/resourcetimemout values, the new options can be used in both the Tivoli Storage Manager server and storage agent options file to prevent sessions from being terminated by timeout. Additional information about these parameters can be found at the following links:
KEEPALIVE
KEEPALIVEINTERVAL
KEEPALIVETIME
As mentioned in this, starting with Tivoli Storage Manager version 7.1.3 "keepalive yes" is the default value.
Cause
During lanfree backup or restore operations, the storage agent starts a session on the Tivoli Storage Manager server that acts as a proxy session for the client node. This proxy session will remain in an idle state on the Tivoli Storage Manager server while the storage agent is actively reading (restore) or writing (backup) data to tape. For large backup/restore operations it is possible that the connection between the storage agent and Tivoli Storage Manager server can be closed due to a firewall timeout. Most firewalls have a timeout that will dictate how long an inactive connection can remain open before it is terminated/closed. If the connection between the storage agent and Tivoli Storage Manager server is closed, the lanfree operation will fail regardless as to how much data has been successfully backed up or restored.
Diagnosing The Problem
Because communications between the storage agent and Tivoli Storage Manager server are being terminated externally, trace data captured on the storage agent and/or server will not show any definitive evidence of a firewall timeout. The trace data will likely only show a lengthy delay in communications between the storage agent and server (expected because the session is idle), and then an error similar to the following will be seen in the storage agent trace:
- 06:37:00.290 [15][smlfsta.c][6039][DoSendVerbToServer]:Exiting on termReason for srvSess 1.
This error simply indicates that an attempt by the storage agent to send data to the Tivoli Storage Manager server had failed because the existing connection for the proxy session had been severed.
Resolving The Problem
The storage agent attempts to prevent firewall timeouts from prematurely terminating idle proxy sessions by periodically sending a ping, or keep-alive, verb to the Tivoli Storage Manager server. The frequency of these keep-alive verbs will vary depending on the version of Tivoli Storage Manager server code.
For server versions older than 6.3.4.200, the interval at which the storage agent sends these keep-alive verbs to the server is determined by the defined RESOURCETIMEOUT and/or IDLETIMEOUT values. The interval is determined by first identifying the lesser of these 2 values, and then dividing that value by 4. For example, if the RESOURCETIMEOUT value is defined as 60 minutes and the IDLETIMEOUT value is 30 minutes, then the storage agent will attempt to send keep-alive verbs to the server at 7 minute intervals (30 minutes / 4 = 7). When the RESOURCETIMEOUT and IDLETIMEOUT values are set high (e.g. 1200 minutes), then the likelihood of idle proxy sessions being terminated due to firewall timeouts increases significantly because the storage agent may not be sending keep-alive verbs to the Tivoli Storage Manager server frequently enough to prevent the timeout from occurring.
To prevent firewall timeouts from terminating long-running lanfree operations, either the RESOURCETIMEOUT and/or IDLETIMEOUT value can be defined such that the keep-alive verbs are sent at a frequency that is less than the defined firewall timeout value. For example, if the firewall timeout is defined as 60 minutes, then either the RESOURCETIMEOUT or IDLETIMEOUT value should be less than 240 minutes. If the RESOURCETIMEOUT and IDLETIMEOUT values cannot be reduced as doing so would introduce other problems, then the firewall timeout should be increased accordingly.
Beginning with the 6.3.4.200 version of server code, three new parameters were introduced that allow for users to configure the keep-alive settings appropriately for their environment:
KEEPALIVE <yes/no>
KEEPALIVETIME <seconds>
KEEPALIVEINTERVAL <seconds>
The KEEPALIVE parameter indicates whether or not keep-alive verbs should be sent for outbound TCP sockets; the KEEPALIVETIME value specifies how frequently the keep-alive verbs should be sent; the KEEPALIVEINTERVAL value determines how often the keep-alive verbs will be repeated in the event no response is received. These parameters can be specified in the options files used by the Tivoli Storage Manager server (dsmserv.opt) and/or the storage agents (dsmsta.opt). Additional information about these parameters can be found at the following links:
Note: The storage agent will also only continue to send keep-alive verbs to the Tivoli Storage Manager server when the corresponding client session is active. If the client session is idle on the storage agent (e.g. no data transfer or tape reads/writes being performed), then the expectation is that the operation has completed and, thus, the storage agent will not send any further keep-alive verbs to the server and the proxy session is potentially exposed to firewall timeouts. This behavior may be seen during multi-session backup/restore operations via the API client. For example, some database applications will initiate the backup of a database using multiple sessions, with each individual session being responsible for backing up specific database tables. If one or more database tables being backed up is significantly larger than the other tables, then the session performing the backup of the large table may run considerably longer than the other sessions. If the application requires that each individual session must run to completion before any of the sessions can end, then this may result in one or more client sessions sitting in an idle wait state on the storage agent for an extended period of time while waiting for the backup of the large database table to complete. While the sessions are in an idle wait on the storage agent, no keep-alive verbs are sent from the storage agent to the Tivoli Storage Manager server for the corresponding proxy sessions. Firewall timeouts of these proxy sessions can only be avoided by increasing the firewall timeout value, or evenly balancing the workload across the multiple client sessions to prevent any of these sessions from completing significantly earlier than the other sessions.
Product Synonym
TSM
Was this topic helpful?
Document Information
Modified date:
17 June 2018
UID
swg21639855