Investigating tough problems with OMEGAMON XE for Mainframe Networks
deanab 1200006JV8 Visits (7253)
Recently, I worked with an airline company to identify why a few transactions out of hundreds per second were failing. Many z/OS applications act as a TCP/IP server and accept connections from users or applications on other hosts and mobile devices. The application at this customer that is experiencing problems does accept connections from end users, but it also acts as a TCP/IP client to retrieve information from servers running on Linux. Some of the transactions between the z/OS application and Linux servers were failing. Here's a brief architecture diagram:
The customer was puzzled at hundreds of transactions per second completing successfully while a few failed. The application provided statistics of successful vs failed transactions, but did not provide any clues to why the transactions failed. The Linux servers gave no indication that any transactions failed.
OMEGAMON XE for Mainframe Networks provides the key performance indicators for network applications and the network from z/OS. The key steps for this problem were:
1) Create a historical collection for the TCP Connections attribute group. You can limit the amount of data stored in short-term history by adding Filters on Application Name and Remote IP Address. Use multiple rows to specify multiple Remote IP Addresses (one for each Linux server).
2) In the Tivoli Enterprise Portal, navigate to the Applications workspace, find the application in the table, right click the link (image of a chain) icon, and click on Application TCP Connections.
3) You are viewing the connections associated with the application. This includes both connections accepted by the application, and connections opened by the application to the Linux servers. Right-click in the table view to filter on the remote IP addresses of the Linux servers. Now, you see only the connections between the z/OS application and the Linux servers.
The problem became apparent by rearranging the columns in the table to group these attributes together:
Some of the connections had over 1 second connection duration and time since last activity; there was data ready to be sent; the congestion window was small (a little over 4000); some bytes had been sent but none had been received. The transactions that were successful were all in Time-Wait state (normal connection close) and had a duration less than 100 milliseconds. The transactions that failed never completed a connection and timed out.
A packet trace on the network confirmed that z/OS was performing correctly - packets were sent to initiate all connections, but the Linux server never responded on the few where the transactions failed. With the evidence provided by OMEGAMON XE for Mainframe Networks, the manager was able to prove that the Linux server team needed to isolate and resolve the issue.
How can you best detect issues with applications that are connecting from z/OS to a foreign host? Create a situation that detects failed connections between key z/OS and remote applications. The value for Connection Duration and Time Since Last Activity will depend on the normal transaction duration for your applications.
The capabilities described here are available with the OMEGAMON Performance Management Suite!