The z/OS Parallel Sysplex® technology provides a highly reliable, redundant, and robust environment by clustering together multiple z/OS systems with one or more coupling facilities to achieve near-continuous availability. However, simply running your workload applications in a parallel sysplex does not ensure that your workloads become continuously available. The z/OS Parallel Sysplex technology is an enabling technology, only. In order for workloads to take full advantage of the highly available properties of a parallel sysplex, their applications need to be able to run instances in parallel on any z/OS system in the cluster and access data in a data sharing environment.
Applications that have affinities may not be eligible to leverage the availability features of a parallel sysplex. A typical example of an affinity is a workload application that has a dependency on non-shared data that is only available locally on one z/OS system. Another case of an application affinity to a single z/OS system is application access to non-shared resources such as an external feed from another company.
Solutions to these types of affinities may require changes to the infrastructure or the workload applications themselves. Unless these changes are made, the workload applications are not continuously available. That means the workloads that a company depends on could be subject to an outage if the application, database management system (DBMS), or the z/OS system itself fails. Once the outage is finally detected, the cause of the failure must be determined, and a decision as to whether to restart the workload in place or on an alternate z/OS system must be made. From the time the outage occurs to the time the workload is available again could span multiple hours.
From multiple hours to one minute
Although an outage is sometimes unavoidable, the time to recover from an outage can be reduced. If you have a parallel sysplex and have not or cannot enable data sharing for your workload applications, one solution is to utilize IBM Multi-site Workload Lifeline® along with an appropriate software replication product such as IBM InfoSphere Data Replication (IIDR) for Db2®, and possibly an external load balancer.
InfoSphere Data Replication
IBM provides software replication products for three data sources that run on z/OS - Db2, IMS, and VSAM. The purpose of these software replication products is to provide a transactionally consistent copy of the data source in an alternate location. Typically, this data source copy is used as a backup in the event of a failure of the original data source, or used as a read-only copy of the data source for performing data analytics.
Multi-site Workload Lifeline
IBM Multi-site Workload Lifeline (Lifeline) is a product that provides workload monitoring and routing. Lifeline can monitor workloads with data sharing applications running on two parallel sysplexes running in different data centers as well as non-data sharing applications running on two z/OS systems each in their own monoplex or within the same parallel sysplex. In the event of a workload failure for data sharing applications, Lifeline facilitates the routing of new workload connections or MQ messages to the data sharing applications in the alternate parallel sysplex. For a workload failure with non-data sharing applications, Lifeline orchestrates the routing of new workload connections or MQ messages to the non-data sharing application on the alternate z/OS system. A workload failure can occur if the workload applications are no longer healthy or active, the z/OS systems where the workload applications run have failed, or there is a parallel sysplex outage where the workload is active.
Lifeline supports a variety of workload types that run on z/OS systems.
- TCP applications, such as transaction management systems like CICS or IMS, are monitored. Lifeline provides routing recommendations to external load balancers on how to distribute workload connections to these applications.
- Lifeline preserves investment in legacy SNA workloads. These SNA applications are monitored for health and availability. Lifeline directs external load balancers to connect to a subset of gateways, such as TN3270, in order to create sessions to specific SNA applications.
- For workloads that use messaging services provided by an MQ cluster, the MQ queue managers and cluster queues are monitored. Lifeline controls how MQ messages are delivered to only the eligible MQ queue managers in the MQ cluster.
For these workload types, Lifeline provides system administrators with a centralized view for determining workload application status and a method for controlling how the workload connections or MQ messages are routed.
A customer has a z/OS workload deployed in a parallel sysplex that consists of an application running in CICS that updates/queries Db2. Access to the CICS application is through a web browser. Due to some design restrictions, Db2 data sharing may not be used and all workload connections are processed from a single instance of the CICS application and Db2.
To provide near-continuous availability for this workload within the parallel sysplex, without requiring application changes, the following steps can be implemented.
- Using IIDR for Db2, create a second copy of the database on a second z/OS system in the parallel sysplex that is continuously replicated from the original Db2 data source. A second Db2 DBMS is used to manage this copy of the database. Neither Db2 database is enabled for data sharing.
- Ensure that a second CICS application instance is running on the z/OS system where the second Db2 is running.
- Configure the workload to Lifeline, so that Lifeline will monitor both CICS application instances and z/OS systems.
- Ensure an external load balancer is configured to communicate with Lifeline. F5 Networks BIG-IP Local Traffic Manager® is the recommended load balancer.
Because just one CICS application can process workload connections from web browsers, the workload can only be processed from one z/OS system. Web browsers connect to the external load balancer, instead of directly to the CICS application. Using Lifeline, one CICS application instance is selected as the active instance, and Lifeline will direct the external load balancer to route all workload connections to this one CICS application. In the event of a failure of the active CICS application or z/OS system where the active CICS application is running, Lifeline will detect that the failure occurred. Lifeline can either:
- Automatically switch the workload by directing the external load balancer to route new workload connections to the alternate CICS application and its copy of Db2
- Prompt to have a controlled workload switch performed by the system administrator
The web browsers continue to connect to the external load balancer and are unaware that a different CICS application is used. The elapsed time of the workload outage can now be reduced to around one minute.
Running a workload in a z/OS Parallel Sysplex environment does not guarantee continuous availability for the workload. Depending on how the workload’s applications are designed, there may be affinities present in the application that prohibit it from participating in a data sharing environment. For this type of workload, you can provide near-continuous availability by using IBM Multi-site Workload Lifeline with the appropriate software replication product for your workload. Lifeline monitors the health and availability of your workload application and z/OS systems and coordinates the switching of your workload to an alternate workload application and data source. This type of configuration lays the groundwork for the adoption of the GDPS Continuous Availability solution, which provides 99.999% availability and even more robust workload monitoring and workload and systems management.
GDPS Continuous Availability is a powerful offering that facilitates near-instantaneous switching of workloads between two sites that can be separated by virtually unlimited distances. Based on asynchronous software replication, planned switches can be accomplished with no data loss (RPO 0). When sufficient replication bandwidth is provided, the RPO can be as low as a few seconds for an unplanned workload switch. If you would like to participate in a brief survey to give us your feedback about GDPS Continuous Availability, follow the survey link: http://ibm.biz/GDPSAAsurvey.
For more information:
About the author:
Michael Fitzpatrick is a Senior Technical Staff Member of the IBM Enterprise Networking Software Group, based in Research Triangle Park, North Carolina, in the US. He is the architect for the Multi-site Workload Lifeline product. Mike has worked in the networking area for 22 years, with a focus on resiliency, network design, and performance.