Data Center Automation
dplantz 1000003W46 3.297 Visualizações
It is officially live! The IT Service Management ( Tivoli)
support and development staff's are daily contributing Q&A in IBM's dWAnswers forum. Check it out at
Simply type in a "tag" - which is usually your products acronym to see what topics exist. For example: IBM Tivoli Monitoring version 6 is simply ITMv6
Gareth Holl 100000C8M7 5.252 Visualizações
I wrote a new tool I call "db2tsacheck" to help avoid unwanted surprises due to configuration problems. Its sole purpose is to look for as many configuration problems as possible for TSAMP managed DB2 HADR or DB2 HA Shared Disk environments. Its a tool you can run on a periodic basis to validate all is well with your configuration.
To download the new version, please visit the following URL:
Here's what it looks like after hitting <enter> to have db2tsacheck attempt to fix all the problems found :
The output is much more compact and makes it more immediately obvious what is and is not a problem.
Problems found can be selectively fixed by entering the individual numbers associated with each problem listed, or all problems can be fixed by simply hitting <enter> after all the problems are listed.
Here's what the output of db2tsacheck looks like when its shows a clean bill of health (no problems) :
Avoid unwanted surprises ... check your TSA/DB2 HA environment now with our new "configuration checker" tool
Why are you using the Tivoli System Automation for Multiplaforms (TSAMP) product with your DB2 HADR or DB2 HA Shared Disk environment ?
The answer should be to keep the DB2 service highly available (and provide some operational convenience via some basic automation TSAMP can provide).
But how do you know your HA environment is prepared for that unexpected problem. In my mind, its similar to a "backup" you take to help you recover from an unexpected event ... how do you know the backup is good and will restore when you need it?
[ Got your head stuck in a newspaper ? Make sure you look up occasionally to ensure you're not headed for disaster ]
I wrote a new tool I call "db2tsacheck" to help avoid unwanted surprises due to configuration problems. Its sole purpose is to look for as many configuration problems as possible for TSAMP managed DB2 HADR or DB2 HA Shared Disk environments. Its a tool you could run on a periodic basis to validate all is well with your configuration. For example, run it before each weekend ... run it before you are about to do a controlled failover/takeover ... run it before you start preparing the TSAMP environment for your planned maintenance activities.
There are a couple other tasks "db2tsacheck" can do:
A new paper has been released on the System Automation Application Manager WIKI:
It contains information and examples (in Java® and PERL) on how to control End-To-End resources via REST calls. For scripting this eliminates the necessity to log on to the node hosting the SA Application Manager and does not need to start a JVM (like eezcs does).
As always we are very interested in your feedback and any nice solutions (like handy scripts) you are developing.
- Sebastian Wegmann
I'm also including the URL to the official guide
that shows how to set up and operate an SA MP cluster to keep TSM highly available.
Markus Müller 1000004PXP 3.401 Visualizações
since it it not easy to find, I'm including the URL to the official TWS integration guide here.
Chapter 6 of this guide explains how to set up and operate a cluster with SA MP to keep TWS highly available.
BerndJostmeyer 110000B97C 3.079 Visualizações
there is a new cool video on youtube showing the (old) integration of SA Application Manager with ITM to add policy-based automation on resources of a datacenter being monitored by ITM. This video also impressively shows the integration of SA and ITM widgets in one combined dashboard. Have a look: http://youtu.be/_5OHhZ0czdU
Gareth Holl 100000C8M7 6.312 Visualizações
Given a 2 node domain, what option do you have to allow a single node to obtain Operational Quorum after bringing one node offline and knowing your TieBreaker device will not be accessible from the surviving node ?
Firstly, if you want to know more about "Quorum", check out one of my earlier blogs:
There are a number of scenarios where clients have ended up with a surviving node unable to obtain quorum during various maintenance activities. There is a way to avoid the dependency on the quorum device (for example a network TieBreaker) during a period where you know there will be network outages or node outages, or at the very least problems reserving the TieBreaker.
You can list the TieBreakers defined in your domain using the following command :
But here's the thing ... you can only change the active TieBreaker when you have Configuration Quorum ... this means in a 2 node domain, both nodes need to be online before you do this. This should make it very clear that such a change would need to be done in advance of a quorum problem. What is being offered here is not a means of restoring quorum after you find your domain in a pending or no quorum state [if you are in this predicament, call Support and we'll see what we can offer for your individual situation ].
Gareth Holl 100000C8M7 5.549 Visualizações
To understand how things are organized in a TSAMP/RSCT environment, we start with the idea that almost everything is considered a resource. Of course there are different kinds of resources and that is where we introduce the concept of a resource "class". Then there are different Resource Managers, each responsible for managing or controlling resources that belong to a particular set of resource classes. The following diagram shows the mapping of three key Resource Managers to some Resource Classes they manage and then to some example Resources :
Most will recognize the more common examples of resources, the ones that represent the entities that TSAMP manages, such as your applications using the resource class IBM.Application and virtual IP addresses using the class IBM.ServiceIP, as shown in the previous diagram.
Tivoli System Automation for Multiplatforms (TSAMP), the "Automation" software, is made up of 3 Resource Managers (two shown in blue in the previous diagrams), as follows :
Reliable Scalable Cluster Technology RSCT), the "Cluster" software, is made up of several core daemons and some Resource Managers (two shown in orange in the previous diagrams), the most important being the following two :
The concept of resource "class" is important because it dictates the attributes the resource contains. For example, an IBM.Application resource will have the attributes StartCommand, StopCommand, and MonitorCommand. Of course it has a bunch of others as well, many of which are mandatory. In contrast, a resource of class IBM.ServiceIP has a completely different set of attributes, like IPAddress and NetMask. The following shows a listing for each to help you see the difference between these two classes :
There are are many other classes, though not necessarily used within a TSAMP based solution.
Finally, my attempt to pull it altogether, the grouping of resources (of classes IBM.Application, IBM.ServiceIP, and IBM.AgFileSystem) and the defined relationships (IBM.ManagedRelationship) between them, make up what is called the Automation Policy, also referred to as the Resource Model.
The diagram to the right is an example of an automation policy for a 2 server clustered DB2 HADR environment.
The Automation Policy provides TSAMP with the knowledge of how to start, stop, and monitor the resources, as well as where it is allowed to start them, and what the dependencies are for each. Refer back to the listing examples of an IBM.Application and IBM.ServiceIP resource to see all the attributes that TSAMP can use to understand what it is being asked to manage and how.
I'll leave you with the thought that your employer probably refers to you as a resource also. What class of resource are you ?
Ever tried to bring a resource offline only for it to result in a state of "Stuck online" ?
Your first sign of a "Stuck online" situation will likely be from the output of the 'lssam' command. Here is some sample lssam output :
A "Stuck online" situation is rarely the fault of the automation software (TSA MP). Think of a situation where you apply the brakes in your car while driving along an icy road. Although you are hard on the brakes, you just keep sliding. Do you blame the brakes or do you blame the icy road. The reality is, there is nothing wrong with the braking system, it is the road on which you are traveling. Its the same for the TSAMP product ... it has issued the stop order ... it has executed the stop script ... the brakes have been applied !
So what are the likely causes of a resource becoming "Stuck online" ? Consider the following :
Some of you may have spotted the flaw in the car braking analogy ... the car will eventually stop, unfortunately as result of hitting some object like a pole or another car. But hopefully you get my point that the brakes were not the problem, just like TSAMP is not the problem for a "Stuck online" situation.