Contents


Cross-site replication and recovery with IBM ODM Decision Server Insights

What you need for high availability and disaster recovery

Comments

With the introduction of Decision Server Insights, the IBM ODM product family is extended to address business scenarios that are becoming more sophisticated. Tracking events relevant to particular business entities, and associating new events in the context of previous events, allows decision making based on correlations and enables much more powerful automation capabilities.

The main differentiators of Decision Server Insights are the detection of business situations that analyze a combination of current events, previous historical events, predictive analytics scores, and geo-spatial context to decide when to act in a targeted manner. This analysis helps businesses improve their business results, detect potential risk and fraud, and improve business operations and alerting.

Of course, this new capability requires persisting the business state that supports the decisions into the Decision Server Insights infrastructure. To deal with the realities of enterprise class systems software (which must provide stable and reliable service even in an environment where hardware and software failures might occur) administrative teams responsible for these software systems must ensure that there are infrastructure and operational procedures in place to support the needs of the business.

This tutorial shares valuable information from the Decision Server Insights development and test teams about the architectural principles that make enterprise-class quality of service possible – and the implementation details necessary to make it a reality.

Background: high availability and disaster recovery

Failure events in software systems are typically divided into two categories. High availability events consist of events that compromise the function of only one element of the larger system, whether it is a physical machine or an operating system process. By contrast, disaster recovery events are events where many elements in the system (or the entire data center) are compromised simultaneously. Install and configure a Decision Server Insights reference topology, published earlier this year explained the reference topology for Decision Server Insights, which uses a network of IBM WebSphere eXtreme Scale servers that work together in a grid to provide the redundancy that is needed for high availability. (See the Resources section.) If one member of the grid fails, the remaining members can seamlessly continue processing.

However, when Decision Server Insights architects prepare for a disaster, they must prepare for a situation in which the entire WebSphere eXtreme Scale grid is lost and seamless recovery is impossible. Instead, the data that defines the state of business entities must be replicated to a site far enough away to remain unaffected by the disaster, and processing must be recovered from this replica. Decision Server Insights also uses features in IBM WebSphere Liberty for high availability and disaster recovery.

To support of this activity, Decision Server Insights allows configuration of a special backing database where you can persist the contents of the WebSphere eXtreme Scale grid. Then, you can use database management tools (such as the DB2 HADR feature and Oracle DataGuard) to replicate the database contents to a remote site. In this way, the Decision Server Insights topology consists of a dual database setup. One local database acts as the primary database and another database serves as a remote standby database. This tutorial looks at database failover and switchover capabilities that use the IBM DB2 High Availability Disaster Recovery (HADR) feature.

You can configure the DB2 HADR feature to recover from failure of the primary database either automatically or manually. For disaster recovery, manual failover is recommended, because many components beyond just the database (such as the remainder of the Decision Server Insights servers and connectivity components) must be recovered before processing is restored to normal. See the Resources section for more information about configuration and management of the DB2 HADR capabilities.

Background: Decision Server Insights components

There are five major servers in a Decision Server Insights topology:

  • Runtime server: This server is the main component in Decision Server Insights, where events are processed, and where entities are stored. All the runtime servers are part of a WebSphere eXtreme Scale cluster. The event queues and entity storage is partitioned across this cluster.
  • Input connectivity server: Input connectivity Liberty servers receive events from some enterprise system that generates events. Events are generated over HTTP or JMS and are sent to the runtime server to be processed.
  • Output connectivity server: Output connectivity Liberty servers receive emitted events from the runtime server, and send them through HTTP or JMS.
  • Catalog server: Catalog servers are a WebSphere eXtreme Scale component that track the placement of partitions among the WebSphere eXtreme Scale servers in the runtime cluster. Good high availability design dictates that catalog servers should not be installed on the same node as a runtime server.
  • Backing database server: The backing database is used to store events offline, when they are not needed in memory. It is used to recover when the whole cluster is restarted after it is shut down. The backing database is a prerequisite for Decision Server Insights, not strictly part of Decision Server Insights. Decision Server Insights supports DB2 and Oracle.

In Figure 1, you can see an example topology. A broker (possibly IBM Integration Bus) routes inbound events to the primary site. If there is a failure of the primary site, then the standby site is started (it might be started already), and the events are loaded into memory from the standby database. New requests are then routed by the broker to the standby site's inbound connectivity.

Figure 1. Example topology of Decision Server Insights highly available configuration with disaster recovery databases
Illustration                     of example topology for Decision Server                 Insights highly available configuration with disaster recovery databases
Illustration of example topology for Decision Server Insights highly available configuration with disaster recovery databases

To provide for high availability, you need at least two catalog servers in each of the primary and standby sites. This approach ensures that the catalog server is not a single point of failure at either site. The infrastructure that is depicted in Figure 1 also protects against network partition (split-brain scenarios) by ensuring that both catalog servers are located on the same subnet. Of course, the primary and secondary sites themselves are placed on independent network infrastructure. Network partition is avoided by ensuring that cross-site failover is managed manually rather than automatically, with procedures in place to ensure that the servers in only one of the sites are active at any given time.

Background: Key Decision Server Insights configuration parameters

There are several Decision Server Insights parameters that must be configured for optimal replication and recovery. This section examines the purpose and explains the recommended values for several commonly used parameters. You can alter each of the parameters by changing the setting value in the objectgriddeployment.xml file that is located in the grids directory of your created server.

Decision Server Insights partitions entity and event storage. The partitions are distributed across the cluster. The Decision Server Insights gateway routes events to the partition, where the associated entity is stored. In this way work load is balanced across the cluster.

Configuring the number of partitions in a cluster is important for efficient operation. If there are too few partitions per node, there won't be enough work to keep the CPUs busy. Since changing the number of partitions requires shutting down the entire cluster, you will want to configure extra partitions, if you think that you will be scaling out the cluster later. But don't set the number of partitions higher than you think you'll need, because there is some increased processor usage to manage them. See the numberOfPartitions setting for guidelines.

By default Decision Server Insights keeps one synchronous and one asynchronous replica of each partition in memory. The replicas are always kept on different nodes in case one node is lost. This approach is why running multiple servers per node does not achieve high availability.

The synchronous replicas provide high availability. When a server fails, several things happen:

  • The catalog servers will instantly change all synchronous replicas to become the primaries for all the primaries that were lost.
  • The asynchronous replicas are collected to become the synchronous replicas. (If there is no asynchronous replica, all of the data in the partition must be copied to create the synchronous replica.) Work on each partition is suspended until the synchronous replicas are in place.
  • The asynchronous replicas are re-created as time allows.

Table 1 shows the recommended settings for the main objectgriddeployment.xml file, which is in the grids directory of your created server.

Table 1. Configuration options for the objectgriddeployment.xml
SettingDescription
developmentMode="false"The developmentMode is set to false in production or realistic testing. If set to true, it tells WebSphere eXtreme Scale to ignore some of the other settings that are described in this section.
maxAsyncReplicas="1"Asynchronous partition replicas allow the system to recover quickly in an HA/CA situation. This value should normally be set to 1. The number of asynchronous replicas is automatically reduced for small clusters.
maxSyncReplicas="1"Synchronous partition replicas allow the system to handle high availability and continuous availablity without losing data or disrupting business. The maxSyncReplicas value determines the number of nodes that can be lost simultaneously without losing any data. We recommend that you set the value of this parameter to 1. Anything less compromises high availability. Anything more can negatively affect performance.
minSyncReplicas="0"The minSyncReplicas value is set to 0, so that the system can run, even if there is no replica. If this value is set to 1 (not recommended), the system stops if there is only one node in the runtime cluster.
numInitialContainers="3"This option is the number of servers that are required for the system to start. A good value to use is (maxAsyncReplicas + maxSyncReplicas + 1), but it must be set lower for clusters with 1 or 2 servers.
numberOfPartitions="127"Decision Server Insights runs most efficiently with about 20 partitions per node. A good value for numberOfPartitons across the cluster is to add up all the cores (CPUs) across the cluster, double it, and round up to the next prime number. Because this value cannot be changed while the cluster is running, you might want to increase it further for future scale out. For example, if you start with four 16-core nodes, but you think that eventually you might need eight nodes, set numberOfPartitions to 257.
writeBehind="T20;C200"Asynchronous write-behind is recommended for performance. Because of the in-memory redundancy, data integrity is not an issue with writeBehind, except in the case of a whole site loss. In a low-volume non-high availability installation, or if you are concerned about whole site loss, you can remove this setting to keep the backing database synchronously up to date.

Implementation details

The following example shows a Decision Server Insights setup in the reference topology. Primary and standby databases are on separate physical servers.

Test environment

Configure Decision Server Insights in a reference topology with a primary and standby database that you can configure with the DB2 high availability disaster recovery (HADR) feature. There are two catalogs, three containers, and two databases, as shown in Table 2.

Table 2. Reference topology configuration settings
Hostname/Function/IPSiteNotes
cistest01/Catalog/9.20.146.112 Primary Insights Catalog 1
cistest02/Catalog/9.20.146.113 Primary Insights Catalog 2
cistest03/Container/9.20.146.114 Primary Insights Container 1
cistest04/Container/9.20.146.115 Primary Insights Container 2
cistest05/Container/9.20.146.116 Primary Insights Container 3
cistest06/Container/9.20.146.117 Primary Insights Container 4
cistest07/Inbound/9.20.146.118 Primary Insights Inbound connectivity
cistest08/Outbound/9.20.146.119 Primary Insights Outbound connectivity
cistest09/Catalog/9.20.147.146 Standby Insights Catalog 1
cistest10/Catalog/9.20.147.147 Standby Insights Catalog 2
cistest11/Container/9.20.147.148 Standby Insights Container 1
cistest12/Container/9.20.147.149 Standby Insights Container 2
cistest13/Container/9.20.147.150 Standby Insights Container 3
cistest14/Container/9.20.147.151 Standby Insights Container 4
cistest15/Inbound/9.20.147.152 Standby Insights Inbound connectivity
cistest16/Outbound/9.20.147.153 Standby Insights Outbound connectivity
cistest17/db2/9.20.146.128 Primary Primary database
cistest18/db2/9.20.147.155 Standby Standby database
cistest19/Broker/9.20.146.130 Primary Broker

Database configuration

This section shows how to create the primary and standby databases.

  1. First, transfer the DB2Distrib.sql database creation script from the Decision Server Insights installation location (which should be under the runtime directory: for example, ~/runtime/ia/persistence/sql/DB2) to the system where the primary database is located. In this example, the DB2Distrib.sql file is put in /opt on the primary database system.
  2. On the primary database (which is named cistest17 with the IP address 9.20.146.128in this example) create a database named insights and create the tables and other components that Decision Server Insights requires:

    db2 create database insights

    db2 connect to insights

    db2 -f /opt/DB2Distrib.sql

  3. Configure the following settings:

    db2 update db cfg for insights using LOGARCHMETH1 LOGRETAIN

    db2 update db cfg for insights using HADR_LOCAL_HOST 9.20.146.128

    db2 update db cfg for insights using HADR_LOCAL_SVC 55001

    db2 update db cfg for insights using HADR_REMOTE_HOST 9.20.147.155

    db2 update db cfg for insights using HADR_REMOTE_SVC 55002

    db2 update db cfg for insights using HADR_REMOTE_INST db2inst1

    db2 update db cfg for insights using LOGINDEXBUILD ON

  4. Then, back up the database:
    db2 "backup database insights"
  5. On the standby database (which in this example is named cistest18 with an IP address of 9.20.147.155) you need to copy the backup file that you created from the primary database to the standby system. Name it something like INSIGHTS.0.db2inst1.DBPART000.20150706111358.001.
  6. Next, use the following command to restore the data into the standby database: db2 "restore database insights"
  7. Now run the following commands to configure the standby database to use the DB2 HADR feature:

    db2 update db cfg for insights using HADR_LOCAL_HOST 9.20.147.155

    db2 update db cfg for insights using HADR_LOCAL_SVC 55002

    db2 update db cfg for insights using HADR_REMOTE_HOST 9.20.146.128

    db2 update db cfg for insights using HADR_REMOTE_SVC 55001

    db2 update db cfg for insights using HADR_REMOTE_INST db2inst1

  8. Now you should be able to start the standby database with the following command:
    db2 start hadr on database insights as standby
  9. Then, you can start the HADR feature on the primary database with the following command:
    db2 start hadr on database insights as primary
  10. Use the db2pd -db insights -hadr command on each system to verify that the primary database started correctly. Look for the HADR_CONNECT_STATUS of CONNECTED.

Even though the DB2 HADR feature can be used for high availability of the database in some systems, it is not necessary for Decision Server Insights because the data grid provides redundancy. (The backing database plays no role in Decision Server Insights during normal run time. It is only used as a tool for disaster recover of the entire system.) For more information about the DB2 HA DR feature, see the Resources section.

Configuring the Decision Server Insights data source definition for disaster recovery

To configure Decision Server Insights database persistence, see the example configurations in the server.xml file:

<dataSource jndiName="jdbc/ia/persistence">
                <connectionManager/>
                <jdbcDriver>
                        <library>
                                <fileset dir="/opt/IBM/" includes="db2jcc4.jar, db2jcc_license_cu.jar"/>
                        </library>
                </jdbcDriver>
                <properties.db2.jcc currentSchema="DB2INST1" 
                  databaseName="insights" password="cistestdb" portNumber="50000" 
                  serverName="cistest17" user="db2inst1" alternateGroupDatabaseName="insights" 
	      clientRerouteAlternateServerName="cistest18" clientRerouteAlternatePortNumber="50000" 
                  maxRetriesForClientReroute="2" retryIntervalForClientReroute="15" enableSeamlessFailover="1"/>
</dataSource>

        <ia_persistence datasourceName="jdbc/ia/persistence" maxBatchSize="10000" maxCacheAge="1000"/>

Note the following key configuration values in the example:

  • alternateGroupDatabaseName , which points to the standby database.
  • clientRerouteAlternateServerName, which points to the standby database.
  • clientRerouteAlternateServerPortNumber, which specifies the standby database port number.
  • maxRetriesForClientReroute, which specifies the number of times specifies the number of times that Decision Server Insights attempts to connect to the primary database before it moves on to try to connect to the standby database.
  • retryIntervalForClientReroute, which specifies the time that passes between connection attempts.
  • enableSeamlessFailover, which causes a SQLException to be passed to the Decision Server Insights system when a database failover event occurs (necessary for Decision Server Insights recovery implementation).

For the objectGridDeployment.xml consider the following example:

...
<mapSet name="iaMaps" numberOfPartitions="127" numInitialContainers="3" minSyncReplicas="0" maxSyncReplicas="1" maxAsyncReplicas="1" developmentMode="false">
...

Database failover

For Decision Server Insights servers that run on Liberty versions earlier than 8.5.5.7, don't use the alternate server and alternate port parameters in the server.xml data source definition, because the Liberty versions ignore those parameters. Instead, manually change the server.xml file on the secondary Decision Server Insights system from the primary database. Change the definition of the primary data source so that when Decision Server Insights in the secondary data center starts, it properly finds the database in its own data center.

If you run Decision Server Insights on a Liberty version earlier than 8.5.5.7 and do not alter the database connection information, you might see the -4498 error code in the server logs, like the following example:

[AUDIT   ] J2CA0056I: The Connection Manager received a fatal connection error from the Resource Adapter for resource jdbc/ia/persistence. The exception is: com.ibm.db2.jcc.am.ClientRerouteException: [jcc][t4][2027][11212][4.13.127] A connection failed but has been re-established. The host name or IP address is "cistest18.hursley.ibm.com" and the service name or port number is 50,000.
Special registers may or may not be re-attempted (Reason code = 1). ERRORCODE=-4498, SQLSTATE=08506
[ERROR   ] CWMBE3613E: SQLException encountered when building "UPSERT" Statement for persistence handler "com.ibm.ia.persistence.handler.db2.impl.DB2JobHistoryHandler". Exception Message: [jcc][t4][2027][11212][4.13.127] A connection failed but has been re-established. The host name or IP address is "cistest18.hursley.ibm.com" and the service name or port number is 50,000.
[ERROR   ] CWMBE3611E: The configured database has issued an exception with an error code: "-4,498" and a SQLState: "08506"

For Decision Server Insights servers that run on Liberty versions 8.5.5.7 or later, you can use the alternate server and alternate port parameters in the server.xml data source definition to allow Decision Server Insights to automatically connect to the database in the secondary data center without modifying the server.xml file. Of course, you can also manually alter the database connection information in the data source definition, regardless of Liberty version.

Reloading events into memory

Because the events are in the database, you need to reload them into memory. See the following example:

[root@cistest04 bin]# ./dataLoadManager load --propertiesFile=/opt/IBM/test04.prop
CWMBD9712W: Hostname verification is disabled by the "disableSSLHostnameVerification" connection property. The client will not check the hostname specified in the server certificate.
CWMBD9659I: Persisted data is being restored to the system. The total number of batches of data is 35.

Conclusion

You learned how to use the Decision Server Insights backing database with the DB2 HADR feature to implement a successful cross-site replication and recovery strategy, protecting your system from loss of the entire data center that hosts the Decision Server Insights infrastructure.

High availability is achieved in Decision Server Insights by having multiple instances of the server components.

If you are using Oracle databases, you can implement the same replication and recovery strategy by using the Oracle Data Guard feature. See Introduction to Oracle Data Guard in the Oracle documentation.

Using what you learned in this tutorial, you can design topologies to successfully provide disaster recovery in your environments.

Acknowledgements

The authors would like to thank Richard Jacks for his reviews, contributions, and comments.


Downloadable resources


Related topics


Comments

Sign in or register to add and subscribe to comments.

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Middleware
ArticleID=1022615
ArticleTitle=Cross-site replication and recovery with IBM ODM Decision Server Insights
publish-date=12092015