IBM Multi-site Workload Lifeline

IBM Multi-site Workload Lifeline (Lifeline) enables intelligent load balancing of critical workload transactions by influencing routing of connections for TCP/IP workloads and messages for IBM MQ cluster workloads. Routing is done across two sites to provide near continuous availability.

When an outage occurs, IBM Multi-site Workload Lifeline helps reduce critical workload recovery time versus traditional disaster recovery from hours to minutes. The recovery time for unplanned outages is reduced by detecting workload failures and rerouting to another site. The impact of planned outages is mitigated by switching workloads to another site with minimal disruption.

Benefits

Improve performance

New connections of workloads are routed to the applications, servers, and systems most capable of processing so that transaction response time is reduced. System resources are used more efficiently.

Achieve higher availability

Route new workload connections to other available applications in the event of application, system or site outages. Outages for maintenance updates or other planned events can be minimized.

Increase scalability

Add application instances on-demand. Automatically monitor and include added instances in workload routing decisions.

Reduce recovery time

Reduce response time by aligning new workload connections with the most capable applications and systems. Recovery time after a workload failure can be reduced from hours to minutes.

Improve workload migration, utilization

Route workloads from one site to another with minimal disruption. Connections for query workloads can be distributed to both sites simultaneously.

Simplify disaster recovery procedures

Add simpler, non-disruptive testing of disaster recovery procedures by validating that workloads remain accessible on the recovery site – without requiring an outage of the production site.

IBM Multi-site Workload Lifeline can help us understand whether a site is normal and whether the data is synchronized. Only when IBM Multi-site Workload Lifeline is deployed, IBM GDPS Continuous Availability (GDPS AA) can complete the workload switching to achieve continuous availability.

Senior manager of data center

A large Asian bank

Features

Load balancing workloads

Lifeline uses two tiers of load balancing for workloads targeting TCP/IP applications. Lifeline directs first-tier load balancers to route workload connections to second-tier load balancers in the selected site, which then route the connections to applications in the site. Lifeline relies on IBM MQ clusters for workloads by using messaging. Lifeline directs the cluster to route workload messages to IBM MQ queue managers in the selected site, which then make the messages available to applications.

Site routing recommendations

For workloads that use two tiers of load balancers, Lifeline provides first-tier load balancers with site connection routing recommendations based on the availability and health of the workload applications, the z/OS systems and (if applicable) Linux® on IBM Z® systems across both sites. For workloads that use IBM MQ clusters, Lifeline provides the cluster with site message routing recommendations based on the availability and health of the IBM MQ queue managers and the z/OS systems across both sites.

Lifeline Agents

A Lifeline Agent is started on each z/OS system and Linux on Z Management Guest where the workloads are present across both sites. The Agent is responsible for monitoring the workload applications that reside on its system and reporting this information back to a Lifeline Advisor. The Agent on z/OS is also responsible for communicating with an MQ queue manager to monitor and influence MQ message routing within an MQ cluster.

Lifeline Advisors

A Lifeline Advisor is started on a z/OS system and can be started as the primary or secondary Advisor. A primary Advisor communicates with all Lifeline Agents to determine workload availability. The Advisor provides MQ message distribution rules to the Agents for the MQ clusters and routing recommendations to load balancers for TCP connections for these workloads. A secondary Advisor monitors the availability of the primary Advisor and will take over primary Advisor responsibility in the event of a primary Advisor failure.

Workload configurations

Each workload that is configured to Multi-site Workload Lifeline is classified as an "active/standby" or "active/query" workload.

An active/standby workload is active in one site. Lifeline directs load balancers to route incoming connections to the active site. When database updates are made, database software replication transmits those changes asynchronously from the active instance of the workload to the standby instance of the workload. At the standby site, the standby instance of the workload is active and ready to receive work. The updated data from the active site is applied to the database subsystem running in the standby site in near real time.
An active/query workload can be active in one or both sites. Lifeline provides routing recommendations to the load balancers to intelligently balance connections across both sites. When database updates are made by the associated active/standby workload, database replication latency is monitored by Lifeline to ensure connections are not routed to a site if the replicated database on that site contains data that are too out of date with the database on the active site.

Workload types

Lifeline can support many types of workloads that reside on z/OS or Linux on Z:

TCP-based applications such as CICS Sockets or HTTP Servers
SNA applications that can be accessed from a TCP-based server such as TN3270
MQ applications that receive messages from an MQ cluster defined on z/OS
Db2 subsystems accessed through DRDA messaging
TCP-based applications running on Linux on Z guests such as WebSphere Application Server

System requirements

Software requirements

Virtualization on z/VM requires Version V5R3 or newer.
Virtualization on z/OS requires z/OS Version V2R1 or newer running Communications Server for z/OS.

Hardware requirements

Any System z that can run z/OS V2R1 or higher and uses the TCP/IP stack from the Communications Server for z/OS.

Resources

IBM Multi-site Workload Lifeline

See how IBM Multi-site Workload Lifeline plays a key role in solving major problems in the enterprise.

Load Balancing with IBM Multi-site Workload Lifeline

See how you can have intelligent load balancing of TCP/IP workloads and still have nearly continuous availability.

IBM Multi-site Workload Lifeline use cases

Read use cases that describe the integration of IBM Multi-site Workload Lifeline with F5 BIG-IP.

IBM Multi-site Workload Lifeline: Converting to an IBM MQ Cluster

Learn to convert an existing IBM MQ environment with shared channels to a cluster and how to configure Lifeline to support a workload that uses an MQ cluster.

Improve your IT resilience with IBM Multi-site Workload Lifeline

See how Lifeline helps your business to save costs and be competitive 24x7.

Choosing your disaster recovery or continuous availability solution from IBM GDPS offerings

See how Lifeline plays an itegral role in the GDPS Continuous Availaiblity solution.

FAQ

How does IBM Multi-site Workload Lifeline enable continuous availability?

Lifeline monitors the workload applications and the systems where these applications reside, across the two sysplexes, or sites, where these systems are running. Lifeline controls the routing of connections and MQ messages that are targeting these workload applications, ensuring the connections and MQ messages are sent to the optimal workload applications in the active site(s).
If a workload failure in the active site is detected by Lifeline, Lifeline can automatically perform a workload switch, in seconds, to the workload applications in the alternate site. Or Lifeline can generate alert messages that automation products can capture to perform their own workload switching.

Does my business need continuous availability for workloads?

If your business meets one of the following situations, continuous availability for your workloads is needed.

Your business must run 24x7 due to industry regulations.
Other businesses are dependent on your business’ always-on availability, for example, if your business is in the financial and insurance industries.
Your business has no recovery procedures in place, for example, with non-sysplex environments and no disk replication capabilities.

How is continuous availability different from disaster recovery?

Existing disaster recovery solutions utilize disk-based replication to make mirror copies to a remote site of all disks used by the systems in the local site. These disk copies cannot be used while the disk replication is occurring. In the event of a failure in the local site, the systems and workload applications need to be restarted in the remote site before access to the workloads is reestablished. Typically, this can take an hour or longer to accomplish.
With Lifeline-enabled continuous availability solutions, software data replication, such as InfoSphere Data Replication for Db2, is used to keep data in sync between the local and remote sites. The key difference is that systems in both sites are active, and Lifeline is used to monitor the workloads across both sites. In the event of a failure in the local site, Lifeline will detect the workload failure and route all new workload connections to the alternate site. So access to the workloads is re-established in seconds, versus the hour or more with disaster recovery solutions.

How does Lifeline act as an integral part of the GDPS® Continuous Availability solution?

Lifeline, through its monitoring and workload routing, plays an integral role in the GDPS Continuous Availability solution and provides the following benefits:

Improved performance: New connections of workloads are routed to the applications, servers, and systems most capable of processing them so that transaction response time is reduced. System resources are used more efficiently.
Improved availability: New connections of workload can be routed to available applications and systems when some of them down. Outages for maintenance updates or other planned events can be minimized.
Reduced recovery time: Reduce Recovery Time Objective from hours to minutes. With disk replication, traditional DR solutions recover on standby site by restarting systems or applications. Normally that takes hours and IT services are out for this period. With Lifeline working within the GDPS Continuous Availability solution, workload can be switched to the standby site in minutes.

Find out more

Is Lifeline only available as part of the GDPS Continuous Availability solution?

No. Although typically used as an integral part of the GDPS Continuous Availability solution, Lifeline can also be deployed outside the solution.
If your business has your own automation capabilities, Lifeline, along with a software data replication product to keep data in both sites in sync, can be used.
In other cases, if your business has workload applications that are not sysplex-enabled, you cannot use the GDPS Continuous Availability solution. Using Lifeline, along with a software data replication produce to keep data in both sites in sync, will provide “sysplex-like” recovery for these workload types.

Find out more

How does Lifeline reduce the maintenance window for planned outages?

Lifeline provides the ability to perform a graceful switch of the applications and their data sources, called workloads by Lifeline, during planned outages. By using simple Lifeline commands, workload migration from one site to another can be easily performed, minimizing the down time for planned events such as scheduled maintenance activities.

Find out more

How does Lifeline provide near continuous availability for critical workloads during unplanned outages?

Lifeline increases availability as new connections and messages can routed away from failing workload applications and systems. Lifeline reduces response times by routing connections and messages to workload applications and systems with capacity for additional work and reduces recovery time from hours to minutes.

Do all workloads running in a site need to be configured to Lifeline initially?

No. One of the many benefits of Lifeline is that it is not an all or nothing solution, like disaster recovery solutions tend to be. Only the most critical workloads would be configured to Lifeline to provide continuous availability, while all other workloads, including batch, would be recovered using existing disaster recovery procedures. And additional workloads can be added to Lifeline at any time.

What are the characteristics of a workload, when defining it to Lifeline?

A workload’s characteristics is dependent on the workload type. For TCP-based workloads, it’s the IP addresses and port numbers of the TCP applications. For SNA-based workloads, it’s the SNA appl names of the SNA applications. For MQ-based workload, it’s the MQ cluster queues and MQ queue managers where MQ messages for the workloads are sent. For Db2 DRDA-based workloads, it’s the IP addresses and port numbers of the Db2 aliases and Db2 subsystems. For Linux on Z workloads, it’s the Linux on Z guests running on zVM.

How does Lifeline control routing of connections to workload applications?

Lifeline relies on a load balancer that supports the Server/Application State Protocol, or SASP, that is documented in RFC 4678. The protocol allows Lifeline to periodically send routing recommendations to a SASP-enabled load balancer, directing the load balancer on how to route workload connections across a set of workload applications that can span both sites. The F5 Big-IP Switch Local Traffic Manager is the recommended load balancer for use with Lifeline.

How does Lifeline control routing of MQ messages for workloads?

Lifeline communicates with the MQ queue managers that manage the queues used by the workloads and directs the MQ cluster on which MQ queue managers are eligible to receive MQ messages. Following a workload failure in a site, Lifeline also ensures any stranded MQ messages are transferred to MQ managers in the alternate site during a workload switch.

Next steps

Multi-site Workload Lifeline helps reduce critical workload recovery time when outage occurs.

More ways to explore

Documentation

Support

Global financing