Data resilience
You can use a number of technologies to address the data resilience requirements described in the “Benefits of High Availability” section. Described below are the five key multisystem data resilience technologies. Keep in mind that multiple technologies can be used in combination to further strengthen your data resiliency.
Logical replication
Logical replication is a widely deployed multisystem data resiliency topology for high availability (HA) in the IBM® i space. It is typically deployed through a product provided either by IBM or a high availability independent software vendor (ISV). Replication is run (through software methods) on objects. Changes to the objects (for example file, member, data area, or program) are replicated to a backup copy. The replication is near or in real time (synchronous remote journaling) for all journaled objects. Typically if the object such as a file is journaled, replication is handled at a record level. For such objects as user spaces that are not journaled, replication is handled typically at the object level. In this case, the entire object is replicated after each set of changes to the object is complete.
Most logical replication solutions allow for additional features beyond object replication. For example, you can achieve additional auditing capabilities, observe the replication status in real time, automatically add newly created objects to those being replicated, and replicate only a subset of objects in a given library or directory.
To build an efficient and reliable multisystem HA solution using logical replication, synchronous remote journaling as a transport mechanism is preferable. With remote journaling, IBM i continuously moves the newly arriving data in the journal receiver to the backup server journal receiver. At this point, a software solution is employed to “replay” these journal updates, placing them into the object on the backup server. After this environment is established, there are two separate yet identical objects, one on the primary server and one on the backup server.
With this solution in place, you can rapidly activate your production environment on the backup server by doing a role-swap operation. The figure below illustrates the basic mechanics in a logical replication environment.
A key advantage of this solution category is that the backup database file is live. That is, it can be accessed in real time for backup operations or for other read-only application types such as building reports. In addition, that normally means minimal recovery is needed when switching over to the backup copy.
The challenge with this solution category is the complexity that can be involved with setting up and maintaining the environment. One of the fundamental challenges lies in not strictly policing undisciplined modification of the live copies of objects residing on the backup server. Failure to properly enforce such a discipline can lead to instances in which users and programmers make changes against the live copy so that it no longer matches the production copy. If this happens, the primary and the backup versions of your files are no longer identical.
Another challenge associated with this approach is that objects that are not journaled must go through a check point, be saved, and then sent separately to the backup server. Therefore, the granularity of the real-time nature of the process may be limited to the granularity of the largest object being replicated for a given operation.
For example, a program updates a record residing within a journaled file. As part of the same operation, it also updates an object, such as a user space, that is not journaled. The backup copy becomes completely consistent when the user space is entirely replicated to the backup system. Practically speaking, if the primary system fails, and the user space object is not yet fully replicated, a manual recovery process is required to reconcile the state of the non-journaled user space to match the last valid operation whose data was completely replicated.
Another possible challenge associated with this approach lies in the latency of the replication process. This refers to the amount of lag time between the time at which changes are made on the source system and the time at which those changes become available on the backup system. Synchronous remote journal can mitigate this to a large extent. Regardless of the transmission mechanism used, you must adequately project your transmission volume and size your communication lines and speeds properly to help ensure that your environment can manage replication volumes when they reach their peak. In a high volume environment, replay backlog and latency may be an issue on the target side even if your transmission facilities are properly sized.
Switchable Device
A switchable device is a collection of hardware resources such as disk units, communication adapters, and tape devices that can be switched from one system to another. For data resilience, the disk units can be configured into a special class of auxiliary storage pool (ASP) that is independent of a particular host system. The practical outcome of this architecture is that switching an independent disk pool from one system to another involves less processing time than a full initial program load (IPL). The IBM i implementation of independent disk pools supports both directory objects (such as the integrated file system (IFS)) and library objects (such as database files). This is commonly referred to as switched disks.
The benefit of using independent disk pools for data resiliency lies in their operational simplicity. The single copy of data is always current, meaning there is no other copy with which to synchronize. No in-flight data, such as data that is transmitted asynchronously, can be lost, and there is minimal performance overhead. Role swapping or switching is relatively straight forward, although you might need to account for the time required to vary on the independent disk pool.
Another key benefit of using independent disk pools is zero-transmission latency which can affect any replication-based technology. The major effort associated with this solution involves setting up the direct-access storage device (DASD) configuration, the data, and application structure. Making an independent disk pool switchable is relatively simple.
Limitations are also associated with the independent disk pool solution. First, there is only one logical copy of the data in the independent disk pool. This can be a single point of failure, although the data should be protected using RAID 5, RAID 6 or mirroring. The data cannot be concurrently accessed from both hosts. Things such as read access or backup to tape operations cannot be done from the backup system. Certain object types, such as configuration objects, cannot be stored in an independent disk pool. You need another mechanism, such as periodic save and restore operations, clustering administrative domain or logical replication, to ensure that these objects are appropriately maintained.
Another limitation involves hardware associated restrictions. Examples include distance limits in the High Speed Link (HSL) loop technology and outages associated with certain hardware upgrades. The independent disk pool cannot be brought online to an earlier system. With this in mind, up-front system environment design and analysis are essential.
Switched logical unit (LUN) characteristics
Switched logical units allow data that is stored in the independent disk pool from logical units created in an IBM System Storage DS8000 or DS6000 to be switched between systems providing high availability.
A switched logical unit is an independent disk pool that is controlled by a device cluster resource group and can be switched between nodes within a cluster. When switched logical units are combined with IBM i clusters technology, you can create a simple and cost effective high availability solution for planned and some unplanned outages.
The device cluster resource group (CRG) controls the independent disk pool which can be switched automatically in the case of an unplanned outage, or it can be switched manually with a switchover.
A group of systems in a cluster can take advantage of the switchover capability to move access to the switched logical unit pool from system to system. A switchable logical unit must be located in an IBM System Storage DS8000 or DS6000 connected through a storage area network. Switched logical units operate similar to switched disks, but hardware is not switched between logical partitions. When the independent disk pool is switched the logical units within the IBM System Storage unit are reassigned from one logical partition to another.
Cross-site mirroring (XSM)
- Geographic Mirroring
- Geographic mirroring is a function of the IBM i operating system. All the data placed in the production copy of the independent disk pool is mirrored to a second independent disk pool on a second, perhaps remote, system.
- The benefits of this solution are essentially the same as the basic switchable device solution with the added advantage of providing disaster recovery to a second copy at increased distance. The biggest benefit continues to be operational simplicity. The switching operations are essentially the same as that of the switchable device solution, except that you switch to the mirror copy of the independent disk pool, making this a straightforward HA solution to deploy and operate. As in the switchable device solution, objects not in the independent disk pool must be handled via some other mechanism and the independent disk pool cannot be brought online to an earlier system. Geographic mirroring also provides real-time replication support for hosted integrated environments such as Microsoft Windows and Linux. This is not generally possible through journal-based logical replication.
- Since geographic mirroring is implemented as a function of the IBM i, a potential limitation of a geographic mirroring solution is performance impacts in certain workload environments.
- When running input/output (I/O)-intensive batch jobs, some performance degradation on the primary system is possible. Also, be aware of the increased central processing unit (CPU) overhead required to support geographic mirroring, and the backup copy of the independent disk pool cannot be accessed while the data synchronization is in process. For example, if you want to back up to tape from the geographically mirrored copy, you must quiesce operations on the source system and detach the mirrored copy. Then you must vary on the detached copy of the independent disk pool on the backup system, perform the backup procedure, and then re-attach the independent disk pool to the original production host. Synchronization of the data that was changed while the independent disk pool was detached will then be performed. Your HA solution is running exposed, meaning there is no up-to-date second data set, while doing the backups and when synchronization is occurring. Using source and target side tracking will minimize this exposure.
- Metro Mirror
- Metro mirroring is a function of the IBM System Storage® Server. The data that is stored in independent disk pools data is on disk units located in the System Storage Server. This solution involves replication at the hardware level to a second storage server using IBM System Storage Copy Services. An independent disk pool is the basic unit of storage for the System Storage Peer-to-Peer Remote Copy (PPRC) function. PPRC provides replication of the independent disk pool to another System Storage Server. IBM i provides a set of functions to combine the PPRC, independent disk pools, and IBM i cluster resource services for coordinated switchover and failover processing through a device cluster resource group (CRG).
- You also have the ability to combine this solution with other System Storage-based copy services functions, including FlashCopy®, for save window reduction.
- Metro Mirror data transfer is done synchronously. You must also be aware of the distance limitations and bandwidth requirements associated with transmission times as with any solution when synchronous communications are used.
- Global Mirror
- Global Mirror uses the same base technology as Metro Mirror except the transmission of data is done in an asynchronous manner and FlashCopy to a third set of disks is required to maintain data consistency. Because this data transmission is asynchronous, there is no limit to how geographically dispersed the System Storage servers can be from each other.