IBM Z Resilience Everything you need to get started quickly. Get started - Play overview Transcript

Welcome to the IBM Z® Resilience content solution, your homepage for technical resources and use cases for resilient server solutions for IBM Z.

IT resilience with IBM Z is the ability to rapidly adapt and respond to any internal or external disruption, demand, or threat, and continue business operations without significant impact. The goal is that you can run your workload 24x7, no matter what.

Learn how solutions for IT resilience can optimize availability, keep your systems running, detect problems in advance and recover your critical data. Examine the key benefits and capablities of IBM Z for resilience on the z15™ platform.

See what IBM Z Resilience can do for your business.
Related solution System Recovery Boost

Minimize the impact of stopping and restarting z/OS.

Big picture 1. Start with a resilient and reliable base. 2. Reduce the duration of outages with failover capability. 3. Reduce the impact of outages with fault-tolerant architecture and GDPS. 4. Add GDPS for continuous availability and maximum resilience. 5. Proactively address potential anomalies. How to get started
Resilient and reliable base Start with a resilient and reliable base. Since the installation may be exposed to planned and unplanned outages, and ad-hoc automation, the installation can improve the manageability of the environment with performance monitoring, service management and storage management software. The starting point is a single IBM Z system, based on the IBM RAS (Reliability, Availability, and Serviceability) design principles. These principles guide the system architecture to ensure resilience and reliability.
IBM Z Hardware Resilience Resilience is the ability to provide the required capability in the face of adversity, without significant impact. Keep hardware, the operating system, middleware, and applications up and running throughout planned and unplanned outages. Recover a site from an unplanned event without data loss. Read the IBM Redbook to see how it works
System Recovery Boost System Recovery Boost increases your processor capacity during startup and shutdown, decreasing the time it takes to process backlog. Explore the content solution to see how it works. System Recovery Boost content solution
Failover capability Reduce the duration of outages with failover capability. Add failover capabilities to the IBM Z environment. Your system can ensure fast data replication with the help of synchronous Metro Mirror for shorter distances and Global Mirror for longer distances.

Mirroring can be enhanced with the help of IBM Copy Services Manager (CSM).

IBM Z Business Resilience Stress Test (zBuRST) zBuRST extends the IBM Z Application Development and Test Solution (DevTest) solution to increase the available capacity of your quality assurance (QA) test environment or disaster recovery assets (or separate machine) to support load and stress testing activities at production scale. zBuRST exploits On/Off Capacity-on-Demand to enable you to use spare (“dark”) IBM Z engines, resources to load and stress test changes at up to 150% production scale. You can validate the quality of any production change and ensure the resilience of your critical business services running on Z.

Frequently asked questions About the DevTest solution
System Recovery Boost System Recovery Boost increases your processor capacity during startup and shutdown, decreasing the time it takes to process backlog. Explore the content solution to see how it works. System Recovery Boost content solution
Parallel Sysplex Parallel Sysplex® is a clustering technology that allows you to operate multiple copies of a z/OS® as a single system image. Images can be added to or removed from the cluster while applications continue to run. Operating systems participating in a Parallel Sysplex can span multiple servers and applications and data can be shared across systems. Benefits include no single point of failure, non-disruptive capacity and scaling, dynamic workload balancing, application compatibility, and disaster recovery. What is a Parallel Sysplex?
Fault-tolerant architecture Reduce the impact of outages with fault-tolerant architecture and fully automated recovery of failures, all of which can be handled by Geographically Dispersed Parallel Sysplex (GDPS®). Increase the fault tolerance of your system by adding z/OS parallel sysplex data sharing and disk replication of your storage resources. The benefits include:
  • Multiple systems and data in two or more data centers
  • Workload continuing even when a system fails
  • Dynamic workload balancing
  • Shared data with read and write access directly from any member of the sysplex.
IBM Z hardware resilience Resilience is the ability to provide the required capability in the face of adversity, without significant impact. Keep hardware, the operating system, middleware, and applications up and running throughout planned and unplanned outages. Recover a site from an unplanned event without data loss. Read the IBM Redbook to see how it works
System Recovery Boost System Recovery Boost increases your processor capacity during startup and shutdown, decreasing the time it takes to process backlog. Explore the content solution to see how it works. System Recovery Boost content solution
Parallel Sysplex The Parallel Sysplex cluster infrastructure provides a set of common shared services where applications can run, in parallel, concurrently across multiple systems, with a common view of time, a common view of underlying network, application and infrastructure redundancy, and the built-in recovery mechanisms to exploit that redundancy for high availability. The sysplex provides ways to dynamically route transactions and workload from the system on which they are received to a system in the sysplex that might be better able to process the workload. The sysplex also enables applications to route around failures and workload spikes and other issues that might arise.

A Parallel Sysplex also provides a common and consistent view of sysplex-wide shared data, called data sharing. Shared data can be directly accessed from any application region on any participating system in the sysplex.

A well configured Parallel Sysplex and its well constructed sysplex-enabled workload can be configured to have no single point of failure. A Parallel Sysplex is not dependent on the functioning of any single resource, any single CEC or operating system. If any one copy happens to fail or needs to be removed for maintenance, the workload as a whole can continue to run on another systems of the cluster.

What is a Parallel Sysplex?
GDPS continuous availability Enable your workloads to fail over to another Parallel Sysplex for planned maintenance or unplanned workload recovery within seconds. The IBM GDPS Continuous Availability concept consists of three or more sites. The sites can be separated by virtually unlimited distances and run the same applications with the same data sources. In this way, you maximize cross-site workload balancing, continuous availability, and disaster recovery. For an introduction to business resilience and the role of GDPS, read the IBM Redbook. IBM GDPS Family: An Introduction to Concepts and Capabilities
IBM Z Hardware Resilience Resilience is the ability to provide the required capability in the face of adversity, without significant impact. Keep hardware, the operating system, middleware, and applications up and running throughout planned and unplanned outages. Recover a site from an unplanned event without data loss. Read the IBM Redbook to see how it works
System Recovery Boost System Recovery Boost increases your processor capacity during startup and shutdown, decreasing the time it takes to process backlog. Explore the content solution to see how it works. System Recovery Boost content solution
Parallel Sysplex GDPS integrates Parallel Sysplex technology and remote copy technology to enhance application availability and improve disaster recovery. GDPS topology is a parallel sysplex cluster distributed across two sites, with all critical data mirrored between the sites. GDPS manages the remote copy configuration and storage systems, automates parallel sysplex operational tasks, and automates failure recovery from a single point of control, thereby improving application availability. GDPS supports all transaction managers, for example, Customer Information Control System (CICS®) and Information Management System (IMS) and data base managers, for example, Db2®, IMS, and Virtual Storage Access Method (VSAM).

With the introduction of GDPS Metro, end-to-end automated recovery is provided, even in the event that an entire data center becomes inoperative. A fully configured Parallel Sysplex data sharing environment, together with GDPS Metro support for HyperSwap for disk events, is designed to provide 99.99999% availability. GDPS Metro includes disk reconfiguration, managing servers, Sysplex resources, CBU, activation profiles, and so on. GDPS Metro is designed to be a near continuous availability and disaster recovery solution.

GDPS Metro extends the Parallel Sysplex redundancy to disk subsystems. The HyperSwap function can help significantly reduce the time needed to switch to the secondary set of disks while keeping the z/OS systems active, together with their applications.

What is a Parallel Sysplex?
z/OS anomaly mitigation Proactively address potential anomalies before an availability-impacting event can develop. z/OS anomaly mitigation is the system's ability to detect anomalous behavior in real-time, improve operations staff decision-making and follow-on triage processes.
Anomaly identification by Predictive Failure Analysis (PFA) PFA is intended to detect abnormal behavior early enough to allow you to correct the problem before it affects your business. PFA uses remote checks from IBM® Health Checker for z/OS to collect data about your installation, and then uses machine learning to analyze this historical data to identify abnormal behavior. Predictive Failure Analysis
Anomaly diagnostics by z/OS Runtime Diagnostics Analyze a system with a potential problem or anomaly. Anomalies in particular are often difficult or impossible to detect and can slowly lead to the degradation of the solution that is using z/OS. Anomaly diagnostics by z/OS Runtime Diagnostics performs many of the same tasks you might typically perform when looking for a failure, but it can do the tasks more quickly and without the need for a storage dump. Runtime Diagnostics
Sysplex Failure Management (SFM) Allows you to define a sysplex-wide policy that specifies the actions that MVS™ is to take when certain failures occur in the sysplex, so that the system recovers automatically when a system failure is detected. Controlling system availability and recorvery through the SFM policy
Automatic restart manager (ARM) ARM enables automatic restart of all subsystems, in the prescribed order, when recovering from an outage. This is important for allowing retained database locks to be released and restart immediately. Automatic restart manager
System Status Detection partitioning protocol The System Status Detection (SSD) partitioning protocol exploits BCPii interfaces for availability and recovery. You can avoid SFM having to wait the failure detection interval (FDI). BCPii allows XCF to query the state of other systems via authorized interfaces through the support element and HMC network. Benefits: XCF can detect and reset failed systems more quickly, and sympathy sickness time is reduced. Using the system status detection partitioning protocol and BCPii for availability and recovery
SMF record flooding avoidance Avoid a potential system outage or data loss due to an unexpected increase of SMF record arrival. Set up rules for matching a flood condition of SMF data records to either issue a warning message or begin dropping records for specific SMF types. Specifying SMF record flood options
Message flooding avoidance Addresses the problems of runaway write-to-operator conditions that can cause sever disruptions to z/OS operation. It can include taking installation-specified actions. Message Flood Automation
Detection of anomalies by z/OS components z/OS attempts to detect anomalies as close to the source as possible, using the least amount of resources and requiring the smallest amount of the stack to be operational. Detection of an anomaly requires the ability to identify when something is wrong. Often, component processing exceeded a threshold set by the installation. It might be a “throttle”, allowing the component to manage its own work queues. Typically, the installation can make a good guess at the appropriate value. Recommendations are generally available in IBM Documentation and z/OS Health Checks.
Learn more IBM Z Resilience features

Process and air-gap technology to protect business data. For more, see IBM Z Cyber Vault.

Provides a synchronized, standardized approach to data generation that enables the ability to dynamically define and correlate disparate customer-specific performance anomalies without a predefined policy.This correlation capability helps you implicate or exonerate workload components, where correlated anomalous activities are identified and non-anomalous activities are exonerated. zWIC data can be streamed through IBM Z Common Data Provider (CDP) For more information, see IBM z/OS Workload Interaction Correlator in IBM Documentation.

A visually intuitive analytics engine designed to provide visibility into interdependencies and interactions across workloads and to dynamically recognize anomalous behavior across one or two interval periods over multiple subsystems. For more information, see IBM z/OS Workload Interaction Navigator in IBM Documentation.

Addresses the requirement for more frequent disaster recovery (DR) testing and emerging government regulations that mandate for extended operations out of DR sites. Provides a cost efficient way to move processing capacity between Production and DR sites for extended periods of time. IBM Z DR for Cloud works with Country Multiplex Pricing (CMP) or Tailored Fit Pricing Software Consumption Solution IBM software. For information about the Software Consumption Solution, see the Tailored Fit Pricing content solution.

Identifies potential problems before they impact your availability or cause outages. It checks the current active z/OS and sysplex settings and definitions for a system and compares the values to those suggested by IBM or defined by you.

Read about Health Checker for z/OS in IBM Documentation.

Technical resources IBM Z System Recovery Boost content solution

Explore the IBM Z System Recovery Boost content solution

Explore the content solution
The cloud you want for cyber resilience

In this engaging lightboard demonstration, IBM's Nada Santiago will show how you can build a resilient infrastructure and get back up and running faster with IBM Z and IBM LinuxONE.

Watch the video
Getting Started with IBM Z Resilience

This IBM Redbooks publication gives a broad understanding of resilience on the IBM Z platform and explains how it works and why it is important.

Read the IBM Redbook
Related solutions System Recovery Boost

Minimize the impact of stopping and restarting z/OS.

What's new

Links to z/OS documentation were updated to use the z/OS 2.5 library.