This White Paper describes the SAP high availability solution with Tivoli System Automation (TSA) for Multiplatforms. Setup, configuration and tests are discussed.
SAP high availability with Tivoli System Automation for Multiplatforms - New WhitePaper released 2016
KonstantinKonson 0600028RFN Tags:  tivoli sap automation ha tsa solution availability high system 3,779 Views
This White Paper describes the SAP high availability solution with Tivoli System Automation (TSA) for Multiplatforms. Setup, configuration and tests are discussed.
A new paper has been released on the System Automation Application Manager WIKI:
It contains information and examples (in Java® and PERL) on how to control End-To-End resources via REST calls. For scripting this eliminates the necessity to log on to the node hosting the SA Application Manager and does not need to start a JVM (like eezcs does).
As always we are very interested in your feedback and any nice solutions (like handy scripts) you are developing.
- Sebastian Wegmann
Ever tried to bring a resource offline only for it to result in a state of "Stuck online" ?
Your first sign of a "Stuck online" situation will likely be from the output of the 'lssam' command. Here is some sample lssam output :
A "Stuck online" situation is rarely the fault of the automation software (TSA MP). Think of a situation where you apply the brakes in your car while driving along an icy road. Although you are hard on the brakes, you just keep sliding. Do you blame the brakes or do you blame the icy road. The reality is, there is nothing wrong with the braking system, it is the road on which you are traveling. Its the same for the TSAMP product ... it has issued the stop order ... it has executed the stop script ... the brakes have been applied !
So what are the likely causes of a resource becoming "Stuck online" ? Consider the following :
Some of you may have spotted the flaw in the car braking analogy ... the car will eventually stop, unfortunately as result of hitting some object like a pole or another car. But hopefully you get my point that the brakes were not the problem, just like TSAMP is not the problem for a "Stuck online" situation.
Gareth Holl 100000C8M7 Tags:  db2 tiebreaker tsa samp tsamp ha quorum 2 Comments 11,115 Views
A Network TieBreaker is a popular configuration option for a Tivoli System Automation for Multiplatforms (TSA MP) managed environment.
... in a nutshell, a group of nodes is considered to have quorum if it represents more than half the nodes in the cluster.
A TieBreaker is needed to decide who takes control in a situation where its not possible to decide based on the number of operational nodes in a sub-cluster, in other words when you have an even number of nodes in a cluster and a cluster split that results in half the nodes in each sub-cluster. The most obvious example is a 2 node cluster ... if the two nodes cannot talk to each other, a TieBreaker is needed to decide who should take control ... who should proceed with the necessary automation actions to keep resources highly available.
For the sake of this explanation, we're keeping things simple by only talking about a "Network" TieBreaker (there are other types like "disk" and "nfs"). We would specify a pingable system in the network that is independent of the clustered nodes, for example the gateway router used by the clustered nodes. Actually, it is considered a best practice to use the default gateway router as the Network TieBreaker device, also known as a "Quorum Device".
Consider a two node cluster as follows :
"node1" and "node2" are our clustered nodes, each configured with 10.20.30.1 as its default gateway for basic TCP/IP communications.
Now consider a node failure scenario. In my example, "node1" suffers a power failure.
"node2" can no longer ping "node1" (no response to heartbeats).
"node2" is only 1 node out of a 2 node domain which is not considered a majority (more than half), so it uses the defined Network TieBreaker we setup when we first deployed the cluster, the gateway router.
"node2" successfully pings 10.20.30.1 and therefore regains quorum. If the resources were not already running on "node2", the TSAMP product would then perform the necessary automation actions to bring the resources online on "node2" in order to keep them highly available.
Now consider a network adapter failure scenario. First lets assume the power to "node1" was restored and both nodes are communicating (heartbeating) again. At some point there is a break in the network connectivity that isolates "node1" from the rest of the network.
"node2" can no longer heartbeat/ping "node1".
"node1" can no longer heartbeat/ping "node2".
In this case, both nodes lose quorum and attempt to ping the Network TieBreaker device, again the gateway router in this example.
"node1" cannot reach the default gateway because of whatever problem caused it to be isolated from the network in the first place.
"node2" is able to ping the gateway, our Network TieBreaker, so it regains quorum and hosts the resources in TSAMP's effort to keep resources highly available.
If "node1" had been hosting the online resources, it would have been forced to reboot at this point, to ensure the resources can be brought online on a surviving node without fear that they would be running concurrently on more than one server.
That's how a "Network TieBreaker" works. Here's the assumption: If "node1" can communicate (ping) with the default gateway and "node2" can communicate (ping) with the default gateway, then "node1" must be able to communicate (heartbeat) with "node2". If for some strange reason you have a network would allow each node to ping a common gateway/device, but not each other, then a "Network" style TieBreaker is not for you.
"The Majority Rules !"
In a nutshell, you have Quorum if you have the majority.
The main goals of quorum operations:
If critical resources are online on systems that lose Quorum, then the systems will :
What happens if you have an even number of nodes in the cluster and an even split ? For example, take the most popular cluster configuration, that is, a two node cluster. If the two nodes lose connectivity with each other, you essentially have two single-node sub-clusters, neither of which would have a majority (more than half), so quorum is not possible for either. If you don't have quorum, no automation is possible . This is when a TieBreaker would be needed !
TieBreaker is a mechanism (eg. disk, network, operator) that is used to decide which sub-cluster gets Quorum (gets control) ... check out my next blog in this series, called "TSA Blog Series: High Availability Concepts - Do I need a TieBreaker?"
In a previous blog, I talked about the importance for collecting and providing diagnostic data. For the Tivoli System Automation for Multiplatforms (TSA MP) product, this means running its automated data collection utility called "getsadata".
If you're planning on asking IBM Support for help, more than likely there will be a minimum amount of detail they will need up front. The most obvious being details about your environment, such as platform/OS and product versions.
We've created a "landing page" for a collection of support resources pertaining to the Tivoli System Automation for Multiplatforms (TSA MP) product. Think of it as the home page for your initial Support needs. Here's the direct URL :
"Consolidation drives value." "Manage less and do more."
These initiatives, and many others like them, make sense from a financial standpoint, but can often lead to sleepless nights for operations managers once they realize the majority of their business, sometimes as much as 85%, is now run out of a single datacenter or a single platform. What happens when the system goes down? What happens when my datacenter loses connectivity? How quickly can I recover from an outage? Do I even need to recover, can I just roll-over to an active or passive backup?
Many mission critical core business applications, from larger vendors such as SAP and their ERP solutions, are run in datacenters much like this. However through high-availability and disaster recovery (HADR) capabilities provided by IBM, these datacenters can establish a failover policy to prevent even the slightest interruption to your company’s centralized business processes and financial systems.
IBM’s core HADR capabilities are provided by the Tivoli System Automation family of products, which we’ll specifically touch on System Automation for Multiplatforms here. Tivoli System Automation for Multiplatforms (SA MP) is a high availability clustering solution with advanced automation capabilities. It includes out-of-the-box resiliency policies for many IBM products delivering mission-critical capabilities to our customers. SA MP is the default, built-in HADR solution for IBM DB2 for Linux, Unix, and Windows available at no extra charge to DB2 LUW customers. The expansive capability of System Automation to provide a single point-of-view into your HADR capabilities and manage them, whether your fail-over datacenters are across the street or across the continent, provides immense value to our customers in managing their heterogeneous environments.
One such customer leveraging these out-of-the-box policies from SA MP for DB2, and by extension SAP, is China Ocean Shipping (Group) Company. For a detailed overview of the entire solution implemented by COSCO, including IBM POWER hardware capabilties, please follow the link below. In this post, I’ll briefly touch on how COSCO was able to leverage HADR capabilities to prevent major damage to the business during multiple datacenter outages. COSCO consolidated much of their business operations onto a single SAP ERP solution, but required the highest levels of service availability from this single system. By putting their SAP solution on top of DB2 and leveraging the HADR capabilities provided by SA MP, this customer was able to deploy a single SAP ERP solution across multiple datacenters worldwide, offering near real-time replication. These HADR capabilities and benefits were fully realized multiple times when the customer’s main datacenters experience prolonged outages, including the historic 2011 Tōhoku earthquake. They were able to seamlessly switch over from their datacenter in Tokyo to the off-site datacenter in Beijing, with virtually no service interruption.
Now the actual definition of “disaster recovery” may not always come to mind when you are planning your HADR strategies, but Tivoli System Automation, along with the many other IBM products it is embedded in are available to minimize your key HADR metrics, both recovery point objective (RPO) and recovery time objective (RTO). Many additional policies are available for other IBM and third-party products to monitor and ensure the availability for your business services. For more information on how you can be completely confident in your business’s HADR solution, check out the links below for a deeper dive into Tivoli System Automation for Multiplatforms.
For more information:
Less then two weeks left to Pulse 2013 (March 3 - 6) the top event on Cloud, IT and Service Management in Las Vegas to share best practices with 8,000 of your peers and hear from IBM business partners and top industry analysts on the latest trends and hottest IT topics.
Part of the Cloud and IT Optimization stream (http://www.ibm.com/software/tivoli/pulse/agenda/cloud.html#content) will be the Automated Operations Technical Council (AOTC).
Join this track to gain the knowledge you need to protect your business and IT services through end-to-end high availability, automation, and disaster recovery solutions. The AOTC continues its long term focus on the IBM Tivoli® System Automation family across distributed and z environments. This highly technical track is ideal for automation technical experts, IT project managers, and IT operations specialists who want to maximize the value of their system automation portfolio.
There will be developers giving updates on the latest development , practitioners, customers and users talking about their experience. For example there will be customer sessions on 'Advanced SAP Automation using Tivoli System Automation' an 'High Availability and Cross Platform Automation of a Complex Banking Applications'.
Isabell Sippli(IBM) 060000X124 Tags:  saam hadr system-automation tsa automation 1 Comment 5,473 Views
We're very close to our first Open Program Call on Tivoli System Automation Application Manager(https://ibm.biz/BdxKhQ)
Main focus of this call is the new User Interface we intend to ship based on that new platform in a v3.next release.
We'll post a recording on our Open Program developerWorks community (https://ibm.biz/BdxSCL)- stay tuned!
This IBM support webpage (http://www.ibm.com/support/docview.wss?uid=swg27024950&myns=swgtiv&mynp=OCSSRM2X&mync=R) features the documents most frequently requested by our customers, as well as other information identified by Support as valuable in helping answer questions related to IBM Tivoli System Automation for Multplatforms (TSAMP). A must read if you run System Automation on AIX, Linux, Solaris or Windows.
Downloads and Updates
such as 'Checking and Adjusting Heartbeat sensitivity'
Presentations from Support
Whitepapers from Support
such as 'Rolling Upgrade Procedure for a TSAMP Automated HADR Environment - Oct 2011.pdf'
Learning more about TSAMP
IBM Software Support Resources
JoergErdmenger 100000AHDG Tags:  agile tsa-am tsamp open-program tsa system-automation appman 4,757 Views
The Open Program for IBM Tivoli System Automation started.
We are developing the next version of our products:
In the Open Program, we are going to show you selected features and improvements that we intend to include in future releases of SA AppMan and SA MP. We are looking forward to discuss these features and improvements with you and receive your feedback.
You have the chance to participate in this new development and
influence during the Open Program for IBM Tivoli System Automation
This self-paced audio-visual course provides an overview of the System Automation for z/OS 3.4 functional differences as they relate to implementation and administration. This is is the first of three courses in a set of courses that cover implementation and administration differences. The other two courses provide demonstrations and additional details on specific topics. The functional differences associated with operational commands are covered in a separate set of courses.
With the 220.127.116.11 release of System Automation Application Manager, the capability to manage virtual guests on zEnterprise hardware has been introduced. This allows the SA Application Manager to start and stop virtual servers hosted inside the zEnterprise Ensemble.
As this functionality is not only available via the graphical interfaces but also for the command line, our team developed some sample scripts, written in PERL, to show several possibilities of eezcs scripting. You can also browse our manuals for detailed description of the new commands connect, lsnode, and nodereq.
Description: This script will accept one virtual server name (as it is shown in the zEnterprise HMC interface) and search for the mapping to a hostname in the domains known to the e2e manager. Once found the matching server is printed to STDOUT. If you want to adapt this script for later use, you should probably use it as a starting point for a general use method and instead return the name of the hostname to the caller.
Idea: Possibly you want to stop a VirtualServer, with the hostname available to you, you are able to exclude that node before stopping it.
Description: This script will accept one hypervisor name (as it is shown in the zEnterprise HMC interface) and search for all mappings to hostnames in the domains known to the e2e manager. At the end all matches are printed to STDOUT. If you want to adapt this script for later use, you should probably use it as a starting point for a general use method and instead return the names of the hostnames to the caller.
Idea: Use this script to determine all hostnames for preparation tasks prior to a hypervisor shutdown, or determine whether your applications are sufficiently spread over multiple hypervisors.
Description: This script will accept one hostname, a timeout (in seconds), and a resource name (fully qualified). It will then start to search for the hostname in the connected domains. On the first match, the node will be started in that domain. The script will wait until the node is online in the automation domain it connected to. Afterwards a resource (identified by the third parameter) will be started.
Idea: Tthis script can be used inside a large system bringup, managing the relationship from soft- to hardware.
Description: This script accepts one e2e resource reference or group name. It will start all servers necessary for this e2e resource, wait until they are operational and then start the e2e resource. This script uses a fixed timeout for each startWait operation of 10 minutes.
Idea: This script shows an example of a complete dependency, from hardware, to first-level-automation domain applications, to an end-to-end-reference.
Download the complete package from our Wiki.
Feedback on any of the scripts, new ideas for scripts and questions about other tasks is very welcome in our forum.