Cross-site replication and recovery with IBM ODM Decision Server Insights
What you need for high availability and disaster recovery
With the introduction of Decision Server Insights, the IBM ODM product family is extended to address business scenarios that are becoming more sophisticated. Tracking events relevant to particular business entities, and associating new events in the context of previous events, allows decision making based on correlations and enables much more powerful automation capabilities.
The main differentiators of Decision Server Insights are the detection of business situations that analyze a combination of current events, previous historical events, predictive analytics scores, and geo-spatial context to decide when to act in a targeted manner. This analysis helps businesses improve their business results, detect potential risk and fraud, and improve business operations and alerting.
Of course, this new capability requires persisting the business state that supports the decisions into the Decision Server Insights infrastructure. To deal with the realities of enterprise class systems software (which must provide stable and reliable service even in an environment where hardware and software failures might occur) administrative teams responsible for these software systems must ensure that there are infrastructure and operational procedures in place to support the needs of the business.
This tutorial shares valuable information from the Decision Server Insights development and test teams about the architectural principles that make enterprise-class quality of service possible – and the implementation details necessary to make it a reality.
Background: high availability and disaster recovery
Failure events in software systems are typically divided into two categories. High availability events consist of events that compromise the function of only one element of the larger system, whether it is a physical machine or an operating system process. By contrast, disaster recovery events are events where many elements in the system (or the entire data center) are compromised simultaneously. Install and configure a Decision Server Insights reference topology, published earlier this year explained the reference topology for Decision Server Insights, which uses a network of IBM WebSphere eXtreme Scale servers that work together in a grid to provide the redundancy that is needed for high availability. (See the Resources section.) If one member of the grid fails, the remaining members can seamlessly continue processing.
However, when Decision Server Insights architects prepare for a disaster, they must prepare for a situation in which the entire WebSphere eXtreme Scale grid is lost and seamless recovery is impossible. Instead, the data that defines the state of business entities must be replicated to a site far enough away to remain unaffected by the disaster, and processing must be recovered from this replica. Decision Server Insights also uses features in IBM WebSphere Liberty for high availability and disaster recovery.
To support of this activity, Decision Server Insights allows configuration of a special backing database where you can persist the contents of the WebSphere eXtreme Scale grid. Then, you can use database management tools (such as the DB2 HADR feature and Oracle DataGuard) to replicate the database contents to a remote site. In this way, the Decision Server Insights topology consists of a dual database setup. One local database acts as the primary database and another database serves as a remote standby database. This tutorial looks at database failover and switchover capabilities that use the IBM DB2 High Availability Disaster Recovery (HADR) feature.
You can configure the DB2 HADR feature to recover from failure of the primary database either automatically or manually. For disaster recovery, manual failover is recommended, because many components beyond just the database (such as the remainder of the Decision Server Insights servers and connectivity components) must be recovered before processing is restored to normal. See the Resources section for more information about configuration and management of the DB2 HADR capabilities.
Background: Decision Server Insights components
There are five major servers in a Decision Server Insights topology:
- Runtime server: This server is the main component in Decision Server Insights, where events are processed, and where entities are stored. All the runtime servers are part of a WebSphere eXtreme Scale cluster. The event queues and entity storage is partitioned across this cluster.
- Input connectivity server: Input connectivity Liberty servers receive events from some enterprise system that generates events. Events are generated over HTTP or JMS and are sent to the runtime server to be processed.
- Output connectivity server: Output connectivity Liberty servers receive emitted events from the runtime server, and send them through HTTP or JMS.
- Catalog server: Catalog servers are a WebSphere eXtreme Scale component that track the placement of partitions among the WebSphere eXtreme Scale servers in the runtime cluster. Good high availability design dictates that catalog servers should not be installed on the same node as a runtime server.
- Backing database server: The backing database is used to store events offline, when they are not needed in memory. It is used to recover when the whole cluster is restarted after it is shut down. The backing database is a prerequisite for Decision Server Insights, not strictly part of Decision Server Insights. Decision Server Insights supports DB2 and Oracle.
In Figure 1, you can see an example topology. A broker (possibly IBM Integration Bus) routes inbound events to the primary site. If there is a failure of the primary site, then the standby site is started (it might be started already), and the events are loaded into memory from the standby database. New requests are then routed by the broker to the standby site's inbound connectivity.
Figure 1. Example topology of Decision Server Insights highly available configuration with disaster recovery databases
To provide for high availability, you need at least two catalog servers in each of the primary and standby sites. This approach ensures that the catalog server is not a single point of failure at either site. The infrastructure that is depicted in Figure 1 also protects against network partition (split-brain scenarios) by ensuring that both catalog servers are located on the same subnet. Of course, the primary and secondary sites themselves are placed on independent network infrastructure. Network partition is avoided by ensuring that cross-site failover is managed manually rather than automatically, with procedures in place to ensure that the servers in only one of the sites are active at any given time.
Background: Key Decision Server Insights configuration parameters
There are several Decision Server Insights parameters that must be
configured for optimal replication and recovery. This section examines the
purpose and explains the recommended values for several commonly used
parameters. You can alter each of the parameters by changing the setting
value in the
objectgriddeployment.xml file that is located in
grids directory of your created server.
Decision Server Insights partitions entity and event storage. The partitions are distributed across the cluster. The Decision Server Insights gateway routes events to the partition, where the associated entity is stored. In this way work load is balanced across the cluster.
Configuring the number of partitions in a cluster is important for
efficient operation. If there are too few partitions per node, there won't
be enough work to keep the CPUs busy. Since changing the number of
partitions requires shutting down the entire cluster, you will want to
configure extra partitions, if you think that you will be scaling out the
cluster later. But don't set the number of partitions higher than you
think you'll need, because there is some increased processor usage to
manage them. See the
numberOfPartitions setting for
By default Decision Server Insights keeps one synchronous and one asynchronous replica of each partition in memory. The replicas are always kept on different nodes in case one node is lost. This approach is why running multiple servers per node does not achieve high availability.
The synchronous replicas provide high availability. When a server fails, several things happen:
- The catalog servers will instantly change all synchronous replicas to become the primaries for all the primaries that were lost.
- The asynchronous replicas are collected to become the synchronous replicas. (If there is no asynchronous replica, all of the data in the partition must be copied to create the synchronous replica.) Work on each partition is suspended until the synchronous replicas are in place.
- The asynchronous replicas are re-created as time allows.
Table 1 shows the recommended settings for the main
objectgriddeployment.xml file, which is in the
grids directory of your created server.
Table 1. Configuration options for the objectgriddeployment.xml
|Asynchronous partition replicas allow the system to recover quickly in an HA/CA situation. This value should normally be set to 1. The number of asynchronous replicas is automatically reduced for small clusters.|
|Synchronous partition replicas allow the system to
handle high availability and continuous availablity without
losing data or disrupting business. The
|This option is the number of servers that are
required for the system to start. A good value to use is
|Decision Server Insights runs most efficiently with
about 20 partitions per node. A good value for
|Asynchronous write-behind is recommended for
performance. Because of the in-memory redundancy, data
integrity is not an issue with |
The following example shows a Decision Server Insights setup in the reference topology. Primary and standby databases are on separate physical servers.
Configure Decision Server Insights in a reference topology with a primary and standby database that you can configure with the DB2 high availability disaster recovery (HADR) feature. There are two catalogs, three containers, and two databases, as shown in Table 2.
Table 2. Reference topology configuration settings
|Primary||Insights Catalog 1|
|Primary||Insights Catalog 2|
|Primary||Insights Container 1|
|Primary||Insights Container 2|
|Primary||Insights Container 3|
|Primary||Insights Container 4|
|Primary||Insights Inbound connectivity|
|Primary||Insights Outbound connectivity|
|Standby||Insights Catalog 1|
|Standby||Insights Catalog 2|
|Standby||Insights Container 1|
|Standby||Insights Container 2|
|Standby||Insights Container 3|
|Standby||Insights Container 4|
|Standby||Insights Inbound connectivity|
|Standby||Insights Outbound connectivity|
This section shows how to create the primary and standby databases.
- First, transfer the
DB2Distrib.sqldatabase creation script from the Decision Server Insights installation location (which should be under the runtime directory: for example,
~/runtime/ia/persistence/sql/DB2) to the system where the primary database is located. In this example, the
DB2Distrib.sqlfile is put in
/opton the primary database system.
- On the primary database (which is named
cistest17with the IP address
18.104.22.168in this example) create a database named
insightsand create the tables and other components that Decision Server Insights requires:
db2 create database insights
db2 connect to insights
db2 -f /opt/DB2Distrib.sql
- Configure the following
db2 update db cfg for insights using LOGARCHMETH1 LOGRETAIN
db2 update db cfg for insights using HADR_LOCAL_HOST 22.214.171.124
db2 update db cfg for insights using HADR_LOCAL_SVC 55001
db2 update db cfg for insights using HADR_REMOTE_HOST 126.96.36.199
db2 update db cfg for insights using HADR_REMOTE_SVC 55002
db2 update db cfg for insights using HADR_REMOTE_INST db2inst1
db2 update db cfg for insights using LOGINDEXBUILD ON
- Then, back up the
db2 "backup database insights"
- On the standby database (which in this example is named
cistest18with an IP address of
188.8.131.52) you need to copy the backup file that you created from the primary database to the standby system. Name it something like
- Next, use the following command to restore the data into the standby
db2 "restore database insights"
- Now run the following commands to configure the standby database to
use the DB2 HADR feature:
db2 update db cfg for insights using HADR_LOCAL_HOST 184.108.40.206
db2 update db cfg for insights using HADR_LOCAL_SVC 55002
db2 update db cfg for insights using HADR_REMOTE_HOST 220.127.116.11
db2 update db cfg for insights using HADR_REMOTE_SVC 55001
db2 update db cfg for insights using HADR_REMOTE_INST db2inst1
- Now you should be able to start the standby database with the
db2 start hadr on database insights as standby
- Then, you can start the HADR feature on the primary database
with the following
db2 start hadr on database insights as primary
- Use the
db2pd -db insights -hadrcommand on each system to verify that the primary database started correctly. Look for the
Even though the DB2 HADR feature can be used for high availability of the database in some systems, it is not necessary for Decision Server Insights because the data grid provides redundancy. (The backing database plays no role in Decision Server Insights during normal run time. It is only used as a tool for disaster recover of the entire system.) For more information about the DB2 HA DR feature, see the Resources section.
Configuring the Decision Server Insights data source definition for disaster recovery
To configure Decision Server Insights database persistence, see the example
configurations in the
<dataSource jndiName="jdbc/ia/persistence"> <connectionManager/> <jdbcDriver> <library> <fileset dir="/opt/IBM/" includes="db2jcc4.jar, db2jcc_license_cu.jar"/> </library> </jdbcDriver> <properties.db2.jcc currentSchema="DB2INST1" databaseName="insights" password="cistestdb" portNumber="50000" serverName="cistest17" user="db2inst1" alternateGroupDatabaseName="insights" clientRerouteAlternateServerName="cistest18" clientRerouteAlternatePortNumber="50000" maxRetriesForClientReroute="2" retryIntervalForClientReroute="15" enableSeamlessFailover="1"/> </dataSource> <ia_persistence datasourceName="jdbc/ia/persistence" maxBatchSize="10000" maxCacheAge="1000"/>
Note the following key configuration values in the example:
alternateGroupDatabaseName, which points to the standby database.
clientRerouteAlternateServerName, which points to the standby database.
clientRerouteAlternateServerPortNumber, which specifies the standby database port number.
maxRetriesForClientReroute, which specifies the number of times specifies the number of times that Decision Server Insights attempts to connect to the primary database before it moves on to try to connect to the standby database.
retryIntervalForClientReroute, which specifies the time that passes between connection attempts.
enableSeamlessFailover, which causes a
SQLExceptionto be passed to the Decision Server Insights system when a database failover event occurs (necessary for Decision Server Insights recovery implementation).
objectGridDeployment.xml consider the following
... <mapSet name="iaMaps" numberOfPartitions="127" numInitialContainers="3" minSyncReplicas="0" maxSyncReplicas="1" maxAsyncReplicas="1" developmentMode="false"> ...
For Decision Server Insights servers that run on Liberty versions earlier
than 18.104.22.168, don't use the alternate server and alternate port parameters
server.xml data source definition, because the Liberty
versions ignore those parameters. Instead, manually change the
server.xml file on the secondary Decision Server Insights
system from the primary database. Change the definition of the primary
data source so that when Decision Server Insights in the secondary data
center starts, it properly finds the database in its own data center.
If you run Decision Server Insights on a Liberty version earlier than 22.214.171.124 and do not alter the database connection information, you might see the -4498 error code in the server logs, like the following example:
[AUDIT ] J2CA0056I: The Connection Manager received a fatal connection error from the Resource Adapter for resource jdbc/ia/persistence. The exception is: com.ibm.db2.jcc.am.ClientRerouteException: [jcc][t4][4.13.127] A connection failed but has been re-established. The host name or IP address is "cistest18.hursley.ibm.com" and the service name or port number is 50,000. Special registers may or may not be re-attempted (Reason code = 1). ERRORCODE=-4498, SQLSTATE=08506 [ERROR ] CWMBE3613E: SQLException encountered when building "UPSERT" Statement for persistence handler "com.ibm.ia.persistence.handler.db2.impl.DB2JobHistoryHandler". Exception Message: [jcc][t4][4.13.127] A connection failed but has been re-established. The host name or IP address is "cistest18.hursley.ibm.com" and the service name or port number is 50,000. [ERROR ] CWMBE3611E: The configured database has issued an exception with an error code: "-4,498" and a SQLState: "08506"
For Decision Server Insights servers that run on Liberty versions 126.96.36.199
or later, you can use the alternate server and alternate port parameters
server.xml data source definition to allow Decision
Server Insights to automatically connect to the database in the secondary
data center without modifying the
server.xml file. Of course,
you can also manually alter the database connection information in the
data source definition, regardless of Liberty version.
Reloading events into memory
Because the events are in the database, you need to reload them into memory. See the following example:
[root@cistest04 bin]# ./dataLoadManager load --propertiesFile=/opt/IBM/test04.prop CWMBD9712W: Hostname verification is disabled by the "disableSSLHostnameVerification" connection property. The client will not check the hostname specified in the server certificate. CWMBD9659I: Persisted data is being restored to the system. The total number of batches of data is 35.
You learned how to use the Decision Server Insights backing database with the DB2 HADR feature to implement a successful cross-site replication and recovery strategy, protecting your system from loss of the entire data center that hosts the Decision Server Insights infrastructure.
High availability is achieved in Decision Server Insights by having multiple instances of the server components.
If you are using Oracle databases, you can implement the same replication and recovery strategy by using the Oracle Data Guard feature. See Introduction to Oracle Data Guard in the Oracle documentation.
Using what you learned in this tutorial, you can design topologies to successfully provide disaster recovery in your environments.
The authors would like to thank Richard Jacks for his reviews, contributions, and comments.
- DB2 system topology and configuration for automated multi-site HA and DR PDF document
- HADR Configuration and Tuning wiki page on the developerWorks DB2HADR community
- Install and configure a Decision Server Insights reference topology
- IBM Operational Decision Manager product page
- IBM Operational Decision Manager documentation on IBM Knowledge Center
- IBM Operational Decision Manager Developer Center