Data Center Automation
The Pulse 2012 workshops including VMware images (IBMers only) are available:
F05 DB2 High Availability and Disaster Recovery
In this lab exercise, you learn how the IBM DB2 High Availability and Disaster Recovery feature works with automatic failover using IBM Tivoli System Automation for Multiplatforms to make DB2 highly available.
Three virtualized Linux hosts are preinstalled with the appropriate software packages (DB2 production installation without instance, db2_install).
However, no other configuration has been applied to give you a realistic setup experience from start to finish.
F06 SAP High Availability
In this lab exercise, you learn how to make SAP highly available. You use the updated SAP wizard of Tivoli System Automation for Multiplatforms 3.2.2 to define automation in your own cluster. You learn to start and stop an entire SAP cluster with a single click. You perform a critical outage scenario of the SAP central instance and watch Tivoli System Automation for Multiplatforms recovering your SAP application.
Note: IBMers can download the VMs for this workshop at SPLL
MartinReitz 060001HHJ8 Tags:  recovery automation system hadr db2 replication disaster 4,426 Views
Tivoli System Automation Application Manager 3.2.2 has released its first pre-canned End-to-end policy. This policy is called "End-to-end Disaster Recovery with DB2 HADR" and is available on the ISM Connect library:
It is a disaster recovery (DR) solution with System Automation Application Manager as single point of control for operating and disaster recovery actions. Application data is replicated software-based with IBM DB2 high availability disaster recovery (HADR) across data centers. Because this generic solution is policy-based, it can be customized to your application landscape.
You should check this out if you are looking for a DR solution for your application that stores its data in a DB2 data store. You can even combine this solution with a high availability clustering in one site or extend a System Automation for Multiplatforms high availablity setup with this DR capabilities.
"Consolidation drives value." "Manage less and do more."
These initiatives, and many others like them, make sense from a financial standpoint, but can often lead to sleepless nights for operations managers once they realize the majority of their business, sometimes as much as 85%, is now run out of a single datacenter or a single platform. What happens when the system goes down? What happens when my datacenter loses connectivity? How quickly can I recover from an outage? Do I even need to recover, can I just roll-over to an active or passive backup?
Many mission critical core business applications, from larger vendors such as SAP and their ERP solutions, are run in datacenters much like this. However through high-availability and disaster recovery (HADR) capabilities provided by IBM, these datacenters can establish a failover policy to prevent even the slightest interruption to your company’s centralized business processes and financial systems.
IBM’s core HADR capabilities are provided by the Tivoli System Automation family of products, which we’ll specifically touch on System Automation for Multiplatforms here. Tivoli System Automation for Multiplatforms (SA MP) is a high availability clustering solution with advanced automation capabilities. It includes out-of-the-box resiliency policies for many IBM products delivering mission-critical capabilities to our customers. SA MP is the default, built-in HADR solution for IBM DB2 for Linux, Unix, and Windows available at no extra charge to DB2 LUW customers. The expansive capability of System Automation to provide a single point-of-view into your HADR capabilities and manage them, whether your fail-over datacenters are across the street or across the continent, provides immense value to our customers in managing their heterogeneous environments.
One such customer leveraging these out-of-the-box policies from SA MP for DB2, and by extension SAP, is China Ocean Shipping (Group) Company. For a detailed overview of the entire solution implemented by COSCO, including IBM POWER hardware capabilties, please follow the link below. In this post, I’ll briefly touch on how COSCO was able to leverage HADR capabilities to prevent major damage to the business during multiple datacenter outages. COSCO consolidated much of their business operations onto a single SAP ERP solution, but required the highest levels of service availability from this single system. By putting their SAP solution on top of DB2 and leveraging the HADR capabilities provided by SA MP, this customer was able to deploy a single SAP ERP solution across multiple datacenters worldwide, offering near real-time replication. These HADR capabilities and benefits were fully realized multiple times when the customer’s main datacenters experience prolonged outages, including the historic 2011 Tōhoku earthquake. They were able to seamlessly switch over from their datacenter in Tokyo to the off-site datacenter in Beijing, with virtually no service interruption.
Now the actual definition of “disaster recovery” may not always come to mind when you are planning your HADR strategies, but Tivoli System Automation, along with the many other IBM products it is embedded in are available to minimize your key HADR metrics, both recovery point objective (RPO) and recovery time objective (RTO). Many additional policies are available for other IBM and third-party products to monitor and ensure the availability for your business services. For more information on how you can be completely confident in your business’s HADR solution, check out the links below for a deeper dive into Tivoli System Automation for Multiplatforms.
For more information:
"The Majority Rules !"
In a nutshell, you have Quorum if you have the majority.
The main goals of quorum operations:
If critical resources are online on systems that lose Quorum, then the systems will :
What happens if you have an even number of nodes in the cluster and an even split ? For example, take the most popular cluster configuration, that is, a two node cluster. If the two nodes lose connectivity with each other, you essentially have two single-node sub-clusters, neither of which would have a majority (more than half), so quorum is not possible for either. If you don't have quorum, no automation is possible . This is when a TieBreaker would be needed !
TieBreaker is a mechanism (eg. disk, network, operator) that is used to decide which sub-cluster gets Quorum (gets control) ... check out my next blog in this series, called "TSA Blog Series: High Availability Concepts - Do I need a TieBreaker?"
Gareth Holl 100000C8M7 Tags:  db2 samp tsa tiebreaker tsamp ha quorum 2 Comments 11,113 Views
A Network TieBreaker is a popular configuration option for a Tivoli System Automation for Multiplatforms (TSA MP) managed environment.
... in a nutshell, a group of nodes is considered to have quorum if it represents more than half the nodes in the cluster.
A TieBreaker is needed to decide who takes control in a situation where its not possible to decide based on the number of operational nodes in a sub-cluster, in other words when you have an even number of nodes in a cluster and a cluster split that results in half the nodes in each sub-cluster. The most obvious example is a 2 node cluster ... if the two nodes cannot talk to each other, a TieBreaker is needed to decide who should take control ... who should proceed with the necessary automation actions to keep resources highly available.
For the sake of this explanation, we're keeping things simple by only talking about a "Network" TieBreaker (there are other types like "disk" and "nfs"). We would specify a pingable system in the network that is independent of the clustered nodes, for example the gateway router used by the clustered nodes. Actually, it is considered a best practice to use the default gateway router as the Network TieBreaker device, also known as a "Quorum Device".
Consider a two node cluster as follows :
"node1" and "node2" are our clustered nodes, each configured with 10.20.30.1 as its default gateway for basic TCP/IP communications.
Now consider a node failure scenario. In my example, "node1" suffers a power failure.
"node2" can no longer ping "node1" (no response to heartbeats).
"node2" is only 1 node out of a 2 node domain which is not considered a majority (more than half), so it uses the defined Network TieBreaker we setup when we first deployed the cluster, the gateway router.
"node2" successfully pings 10.20.30.1 and therefore regains quorum. If the resources were not already running on "node2", the TSAMP product would then perform the necessary automation actions to bring the resources online on "node2" in order to keep them highly available.
Now consider a network adapter failure scenario. First lets assume the power to "node1" was restored and both nodes are communicating (heartbeating) again. At some point there is a break in the network connectivity that isolates "node1" from the rest of the network.
"node2" can no longer heartbeat/ping "node1".
"node1" can no longer heartbeat/ping "node2".
In this case, both nodes lose quorum and attempt to ping the Network TieBreaker device, again the gateway router in this example.
"node1" cannot reach the default gateway because of whatever problem caused it to be isolated from the network in the first place.
"node2" is able to ping the gateway, our Network TieBreaker, so it regains quorum and hosts the resources in TSAMP's effort to keep resources highly available.
If "node1" had been hosting the online resources, it would have been forced to reboot at this point, to ensure the resources can be brought online on a surviving node without fear that they would be running concurrently on more than one server.
That's how a "Network TieBreaker" works. Here's the assumption: If "node1" can communicate (ping) with the default gateway and "node2" can communicate (ping) with the default gateway, then "node1" must be able to communicate (heartbeat) with "node2". If for some strange reason you have a network would allow each node to ping a common gateway/device, but not each other, then a "Network" style TieBreaker is not for you.
Ever tried to bring a resource offline only for it to result in a state of "Stuck online" ?
Your first sign of a "Stuck online" situation will likely be from the output of the 'lssam' command. Here is some sample lssam output :
A "Stuck online" situation is rarely the fault of the automation software (TSA MP). Think of a situation where you apply the brakes in your car while driving along an icy road. Although you are hard on the brakes, you just keep sliding. Do you blame the brakes or do you blame the icy road. The reality is, there is nothing wrong with the braking system, it is the road on which you are traveling. Its the same for the TSAMP product ... it has issued the stop order ... it has executed the stop script ... the brakes have been applied !
So what are the likely causes of a resource becoming "Stuck online" ? Consider the following :
Some of you may have spotted the flaw in the car braking analogy ... the car will eventually stop, unfortunately as result of hitting some object like a pole or another car. But hopefully you get my point that the brakes were not the problem, just like TSAMP is not the problem for a "Stuck online" situation.