High Availability and Disaster Recovery with IBM Db2 Data Gate for z/OS

White Papers

Abstract

This paper describes how to implement high availability and disaster recovery (HA/DR) for IBM Db2 Data Gate for z/OS. Db2 Data Gate supports several storage options. The different storage options require a slightly different high availability setup.

Content

As IBM Db2 Data Gate runs on IBM Cloud Pak for Data, which in turn is based on RedHat OpenShift, this document will first explain relevant OpenShift basics before going into details on how both, high availability and disaster recovery scenarios can be implemented.

Lastly, this document shows how to configure the Db2 JDBC driver to support a Db2 Data Gate high availability/disaster recovery setup.

Storage options

IBM Db2 Data Gate for z/OS supports multiple IBM Cloud Pak for Data storage options. At the time of this writing, the following storage types are supported:

Network File System (NFS)
Portworx
Rook/Ceph
hostPath storage (local)

OpenShift basics

IBM Cloud Pak for Data is based on Red Hat OpenShift – a platform as a service built around application containers. The container management and orchestration in OpenShift is organized by Kubernetes. Furthermore, OpenShift contains additional software for monitoring, as well as for the secure communication between containers. The OpenShift cluster consists of several head nodes and worker nodes.

For a discussion of HA and DR in this context, the following terms must be introduced: pods, projects, and routes:

Pod: A pod is the basic execution unit of a Kubernetes application. A pod consists of one or more containers with shared storage and a shared network. The podspec specifies how to run the containers and which containers have access to which shared storage volumes. All containers of a pod are always co-located on a node and run in a shared environment. That is, all containers in a pod can reference each other via localhost.
Project: A project allows you to partition the cluster virtually into smaller sub-clusters. It also allows you to divide the cluster resources between different projects.
Route: A route exposes a service hosted on the OpenShift cluster to the outside world, so that users can interact with the service. Db2 Data Gate is a separate service and the Db2 instance used by Db2 Data Gate is a separate service as well.

High availability setup

OpenShift and Kubernetes monitor the pods and restart the pods if one or more containers inside the pod indicate an error condition. This includes the restart of a pod on a different node in case of a node failure. This means that high availability is already a part of OpenShift.

The choice of the storage - cluster-wide vs. node-local - has implications for the high availability setup of OpenShift. The impact can be outlined as follows:

Container or pod failure: If a container or an entire pod fails, OpenShift will restart it on the same node. This requires a short outage. In this particular case, there is no difference between cluster-wide storage and local storage.
Disk failure: Disk failures can be tolerated up to the RAID level’s fault tolerance. There won’t be any outage in this case. Outages as a result of disk failures exceeding the RAID level’s fault tolerance are not in the scope of Cloud Pak for Data and must therefore be considered during the planning phase of the storage configuration.
Node failure: With cluster-wide storage, Kubernetes will bring up the pod on a different node in case of a node failure. With local storage, this is not possible because only one node and its associated storage is available. A possible solution is to provision a second Db2 Data Gate instance with local storage on a different node. This second instance needs to connect to the same Db2 for z/OS data source. It should contain the same set of tables, or at least a subset of the most business-critical tables.

Depending on the use case, performance might be an important factor during the planning phase for your Db2 Data Gate installation. Local storage offers the highest throughput and performance, but it comes at the expense of having to set up and maintain a second Data Gate instance if high availability or disaster recovery are required. IBM can offer support when it comes to exploring different storage options and their advantages or disadvantages based on the specific workload planned to be processed by Db Data Gate.

The following sections describe how to set up a second Data Gate instance (including the necessary Db2 instance) to realize a fully functional high availability setup with local storage – as depicted in Figure 1.

HA setup before and after a failover event

Figure 1 HA setup before and after a failover event

Setting up a second Db2 Data Gate instance

After setting up a first Data Gate instance with the required Db2 instance, you need to create a second Db2 Data Gate instance. You can use the same process: First, deploy a new Db2 instance, then, after that task has been finished, deploy a new Db2 Data Gate instance, and select the newly created Db2 as the backend for the second Db2 Data Gate instance.

Policy agent configuration

On the z/OS side, configure the policy agent to initiate encrypted connections to both Db2 Data Gate instances. For that reason, define a second policy in addition to the one which already exists for the first Db2 Data Gate instance. This second policy ensures that the traffic to the second Db2 Data Gate system is encrypted as well – possibly with a different certificate. The procedure how to create the policy is described in the Cloud Pak for Data product hub: https://www.ibm.com/support/producthub/icpdata/docs/content/SSQNUZ_current/svc-dg/dg-network-configure-policy-agent.html.

In the end, there must be two policies:

Db2 Data Gate 1: The policy defines that traffic to the route URL and the port of Db2 Data Gate 1 are encrypted.
Db2 Data Gate 2: The policy defines that traffic to the route URL and the port of Db2 Data Gate 2 are encrypted.

The route URL and the port are defined during the provisioning of the Db2 Data Gate instance and can be viewed by selecting the Details menu in “Collect > My data”:

image-20200706192109-2

Maintaining the second Db2 Data Gate instance

There is no automatic mechanism to keep both Db2 Data Gate instances in sync. You must keep the tables in the second instance manually up-to-date, either by using the Cloud Pak for Data user interface or automation scripts.

Although advisable, it is not necessary to duplicate all the tables in the first Db2 Data Gate instance in the second Db2 Data Gate instance. It is sufficient to add and load only those tables that are business-critical and hence must be available in a HA or DR scenario.

User creation

Users with access to the Db2 Data Gate Db2 instance 1 and 2 have randomly generated passwords by default. This causes a problem when trying to fail over from one instance to the other, as it would result in authentication failures. To overcome this issue, it is necessary to keep the passwords of all users in the HA setup in sync. This can be achieved by manually changing one or both passwords manually, as described in this article: https://www.ibm.com/support/producthub/icpdata/docs/content/SSQNUZ_current/cpd/svc/dbs/aese-crtnoncp4duser.html.

Note: The first command in this document is not suitable for the purpose at hand because it has been written for a Cloud Pak for Data cluster with a single Db2 instance.

  [root@hjdg1-inf ~]# oc rsh $(oc get po | grep ldap | cut -d " " -f 1) /bin/bash

Instead, execute the following command to retrieve all Db2 LDAP pods using the following command:

  [root@hjdg1-inf ~]# oc get po | grep ldap | cut -d " " -f 1

It is necessary to follow the document from top to bottom for both entries. Considering the following sample output:

  [root@johker-dg1-inf ~]# oc get po | grep ldap | cut -d " " -f 1    db2oltp-1592559437925-db2u-ldap-655c9d7dd5-jz5p7    db2oltp-1592569168177-db2u-ldap-589d559785-hmt6p

The command in the document must be run once for the first list entry:

  [root@hjdg1-inf ~]# oc rsh db2oltp-1592559437925-db2u-ldap-655c9d7dd5-jz5p7 /bin/bash

and a second time for the second entry:

  [root@hjdg1-inf ~]# oc rsh db2oltp-1592569168177-db2u-ldap-589d559785-hmt6p /bin/bash

JDBC driver settings

The final step in creating an HA or DR setup is to configure the JDBC driver correctly. The setup depends on a feature of the Db2 JDBC driver called client affinities (https://www.ibm.com/support/knowledgecenter/en/SSEPGG_11.5.0/com.ibm.db2.luw.apdv.java.doc/src/tpc/imjcc_c0056256.html). This feature allows you to switch transparently from a primary server to a secondary server and back to the primary server as soon as it is available again (if this behavior is wanted). A sample configuration is shown here:

  DB2SimpleDataSource ds = new DB2SimpleDataSource();    ds.setUser("user1001");  ds.setPassword("tops3cr3t");  ds.setDatabaseName("BLUDB");  ds.setPortNumber(31504);  ds.setServerName("db2-1.route.example.com");  ds.setDriverType(4);  ds.setRetrieveMessagesFromServerOnGetMessage(true);    ds.setEnableClientAffinitiesList(1);  ds.setClientRerouteAlternateServerName("db2-1.route.example.com, db2-2.route.example.com");  ds.setClientRerouteAlternatePortNumber("31504,30942");  ds.setEnableSeamlessFailover(1);  ds.setAffinityFailbackInterval(30);

The first block of commands establishes a connection to db2-1.route.example.com on port 31504. The second block sets up the client affinities. In this case, db2-2.route.example.com on port 30942 is defined as the secondary server. Setting seamless failover to 1 (true) prevents the generation of SQL error code -4498 when a failover occurs, which would result in a SQLException in the Java program. The affinity failback interval defines whether the JDBC driver attempts to fail back to the primary server. If set to a value of 0 or less, this feature is disabled. If the value is set to a value greater than 0, the JDBC driver attempts to reconnect to the primary server every X seconds, where X is the value that has been specified. The following link points to an article that discusses each command in more detail:

https://www.ibm.com/support/knowledgecenter/SSEPGG_11.5.0/com.ibm.db2.luw.apdv.java.doc/src/tpc/imjcc_r0052038.html.

The client affinities feature is available for the Db2 CLI driver, the .NET driver, and the JDBC type 4 driver.

Testing the setup

You can test the HA or DR setup as follows:

Add and load a single table into both Db2 instances.

Insert two more rows in the first Db2 instance than in the second instance.

Customize the sample program below according to your needs. When properly customized, the program opens a connection to Db2 and executes a “COUNT(*)” statement on previously added table. As the number of rows in the table differs between both Db2 instances, you can tell easily which server the program is currently connected to by checking the row count.

To force a failover from the primary to the secondary Db2 instance, delete the Db2 pod by running the
oc delete pod db2oltp-1592559437925-db2u-0

command, which takes the primary Db2 instance offline. After a restart of the pod, the primary Db2 instance comes online and will be reachable again. Depending on the JDBC driver setting “setAffinityFailbackInterval”, the JDBC driver will reconnect to the primary Db2 instance omit a reconnection.

Sample program

  package com.ibm.test;    import java.sql.Connection;  import java.sql.ResultSet;  import java.sql.SQLException;  import java.sql.Statement;    import com.ibm.db2.jcc.DB2SimpleDataSource;    public class Main {          public static void main(String[] args) {              DB2SimpleDataSource ds = new DB2SimpleDataSource();                ds.setUser("user1001");              ds.setPassword("tops3cr3t");              ds.setDatabaseName("BLUDB");              ds.setPortNumber(31504);              ds.setServerName("db2-1.route.example.com ");              ds.setDriverType(4);              ds.setRetrieveMessagesFromServerOnGetMessage(true);                ds.setEnableClientAffinitiesList(1);              ds.setClientRerouteAlternateServerName("db2-1.route.example.com, db2-2.route.example.com ");              ds.setClientRerouteAlternatePortNumber("31504,30942");              ds.setEnableSeamlessFailover(1);              ds.setAffinityFailbackInterval(30);                try (Connection conn = ds.getConnection()) {                    for (int i = 0; i < 100; i++) {                          try (Statement stmt = conn.createStatement();                                      ResultSet rs = stmt                                                  .executeQuery("select count(*) as total from \"DGBVTSCHEMA\".\"DGBVTTABLE\"");) {                                  while (rs.next()) {                                      Integer total = rs.getInt("total");                                      System.out.println(total);                                }                            } catch (SQLException e) {                                e.printStackTrace();                          }                                                   Thread.sleep(1000);                    }              } catch (SQLException e1) {                    e1.printStackTrace();              } catch (InterruptedException e) {                    e.printStackTrace();              }        }  }

Sample program output

  [root@johker-dg1-inf ha_sample]# java -classpath /root/ha_sample/db2jcc4.jar:/root/ha_sample/db2jcc_license_cisuz.jar:$PWD com.ibm.test.Main  4  4  4  4  4  2  2  ...  2  2  4  4

As can be observed in the sample output, the program first connects to the primary server (row count 4), then switches to the secondary server (row count 2), and finally switches back to the primary server (row count 2) as soon as it is available again after the outage.

Disaster recovery setup

The high availability setup can also be used for disaster recovery. Instead of creating the second Db2 Data Gate instance on the same cluster, created this instance on a secondary cluster located in a separate data center. This way, the system will not fail over from one instance to the other on the same cluster, but to an instance on a different cluster or data center.

To implement both, high availability and disaster recovery, this concept can be expanded, so that on each cluster, two Db2 Data Gate instance and two Db2 instances are running. The addresses of all Db2 instances involved must be configured in the client affinities section of the JDBC driver. This way, the driver will switch to another Db2 system if one becomes unavailable.

Limitations

Keep the following limitations in mind when you work with a HA or DR setup as described above:

Write access to tables might cause incorrect output if a failover happens during a write operation. This can be prevented by setting seamless failover to false and handling the generated SQLException correctly.
Read access to tables might yield different results, as in-sync operations on tables in different Db2 Data Gate instances are not synchronized. For example, one table could have a synchronization latency of 1 second in instance 1, but a 60-second synchronization latency in instance 2. If a seamless failover occurs, and the same query is run once before the failover, and once after the failover, it might yield a different result. But even without a seamless failover the query results cannot be guaranteed. The decision to switch to a secondary database is made by the program logic, and the program does not have any knowledge about the synchronization states of the tables in both instances, so the program cannot make an informed decision.

[{"Business Unit":{"code":"BU059","label":"IBM Software w\/o TPS"},"Product":{"code":"SS2NGZ","label":"IBM Db2 for z\/OS Data Gate"},"ARM Category":[{"code":"a8m0z0000000741AAA","label":"Administration"}],"ARM Case Number":"","Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"All Version(s)","Line of Business":{"code":"LOB10","label":"Data and AI"}}]

Tips