Ever tried to bring a resource offline only for it to result in a state of "Stuck online" ?
A "Stuck online" situation could also prevent a move request (failover) since the first step of a move is to stop/offline all resources that are in the scope of the move.
Your first sign of a "Stuck online" situation will likely be from the output of the 'lssam' command. Here is some sample lssam output :
Stuck online IBM.ResourceGroup:App-rg Request=Move Control=MemberInProblemState Nominal=Online
|- Offline IBM.Application:App1 Binding=Sacrificial
|- Offline IBM.Application:App1:node01
'- Offline IBM.Application:App1:node02
|- Stuck online IBM.Application:App2 Control=MemberInProblemState
|- Stuck online IBM.Application:App2:node01
'- Offline IBM.Application:App2:node02
In the above example, there was an attempt to move the resources from node01 to node02, but the resource called "App2" could not be brought offline on node01.
A "Stuck online" situation is rarely the fault of the automation software (TSA MP). Think of a situation where you apply the brakes in your car while driving along an icy road. Although you are hard on the brakes, you just keep sliding. Do you blame the brakes or do you blame the icy road. The reality is, there is nothing wrong with the braking system, it is the road on which you are traveling. Its the same for the TSAMP product ... it has issued the stop order ... it has executed the stop script ... the brakes have been applied !
So what are the likely causes of a resource becoming "Stuck online" ? Consider the following :
The stop script exits with a non-zero return code. This is telling TSAMP that the stop script could not stop the underlying application/resource. There's nothing TSAMP can do about this
Focus should be on what the stop script is doing so as to figure out why it could not stop the resource. Check out the syslog on the server where the resource would not stop.
But more often than not, the focus should be on the underlying application that would not stop. Check out the native logs of the application that could not be stopped.
The stop script exited with a return code of 0, suggesting a successful stop operation, however the monitor script continued to report the underlying application as online.
Focus on the monitor script to ensure it is accurately reporting the status of your application. Again use the syslog as this is where all start/stop/monitor scripts should be logging to.
Focus on the application to see if there is any evidence that the stop script tried to stop it ... maybe your application has its own auto-start mechanism that needs to be turned off ... maybe your application is hung.
Focus on the stop script ... why did it exit with return code of 0 if it did not actually stop the underlying application.
Some of you may have spotted the flaw in the car braking analogy ... the car will eventually stop, unfortunately as result of hitting some object like a pole or another car. But hopefully you get my point that the brakes were not the problem, just like TSAMP is not the problem for a "Stuck online" situation.
As far as recovery is concerned, you will probably need a tow truck followed by a car body shop. Oops, wrong focus. To recover from a "Stuck online" situation, the general advice is to manually stop the underlying application that could not be stopped by TSAMP executing the application's stop script. There might be times where you would like to clear the "Stuck online" state without stopping the underlying application/resource ... you can do one of two things :
Find the PID for the IBM.GblResRMd process on the node where the resource shows "Stuck online", and kill that PID (do not use the -9 option with the kill command). IBM.GblResRMd will automatically re-spawn.
For a resource of class "IBM.AgFileSystem" that is "Stuck online", use the following technote :
In summary, slow down when driving on icy roads, else you might find yourself "Stuck in a ditch"
Modified on by Gareth Holl
"The Majority Rules !"
In a nutshell, you have Quorum if you have the majority.
Quorum is the number of "operational" nodes in a cluster that are required to control the resources, modify the cluster definition, or perform certain cluster operations.
The main goals of quorum operations:
- identify who has the majority when a cluster is broken up into sub-clusters
- keep data consistent, especially when shared file systems are being used
- protect critical resources … maintain HA control
If critical resources are online on systems that lose Quorum, then the systems will :
- Commit suicide and re-boot (Default)
- Commit suicide and halt
- Do nothing (for testing only)
What happens if you have an even number of nodes in the cluster and an even split ? For example, take the most popular cluster configuration, that is, a two node cluster. If the two nodes lose connectivity with each other, you essentially have two single-node sub-clusters, neither of which would have a majority (more than half), so quorum is not possible for either. If you don't have quorum, no automation is possible . This is when a TieBreaker would be needed !
TieBreaker is a mechanism (eg. disk, network, operator) that is used to decide which sub-cluster gets Quorum (gets control) ... check out my next blog in this series, called "TSA Blog Series: High Availability Concepts - Do I need a TieBreaker?"
Modified on by Gareth Holl
A Network TieBreaker is a popular configuration option for a Tivoli System Automation for Multiplatforms (TSA MP) managed environment.
But what is a TieBreaker and why is it needed ? To understand the what and why, you first need to understand the concept of "quorum" ... please see my blog titled "TSA Blog Series: High Availability Concepts - What is Quorum ?" https://www.ibm.com/developerworks/community/blogs/d6a38b59-943a-434b-a473-b408ed64847d/entry/what_is_quorum?lang=en
... in a nutshell, a group of nodes is considered to have quorum if it represents more than half the nodes in the cluster.
A TieBreaker is needed to decide who takes control in a situation where its not possible to decide based on the number of operational nodes in a sub-cluster, in other words when you have an even number of nodes in a cluster and a cluster split that results in half the nodes in each sub-cluster. The most obvious example is a 2 node cluster ... if the two nodes cannot talk to each other, a TieBreaker is needed to decide who should take control ... who should proceed with the necessary automation actions to keep resources highly available.
For the sake of this explanation, we're keeping things simple by only talking about a "Network" TieBreaker (there are other types like "disk" and "nfs"). We would specify a pingable system in the network that is independent of the clustered nodes, for example the gateway router used by the clustered nodes. Actually, it is considered a best practice to use the default gateway router as the Network TieBreaker device, also known as a "Quorum Device".
Consider a two node cluster as follows :
"node1" and "node2" are our clustered nodes, each configured with 10.20.30.1 as its default gateway for basic TCP/IP communications.
Now consider a node failure scenario. In my example, "node1" suffers a power failure.
"node2" can no longer ping "node1" (no response to heartbeats).
"node2" is only 1 node out of a 2 node domain which is not considered a majority (more than half), so it uses the defined Network TieBreaker we setup when we first deployed the cluster, the gateway router.
"node2" successfully pings 10.20.30.1 and therefore regains quorum. If the resources were not already running on "node2", the TSAMP product would then perform the necessary automation actions to bring the resources online on "node2" in order to keep them highly available.
Now consider a network adapter failure scenario. First lets assume the power to "node1" was restored and both nodes are communicating (heartbeating) again. At some point there is a break in the network connectivity that isolates "node1" from the rest of the network.
"node2" can no longer heartbeat/ping "node1".
"node1" can no longer heartbeat/ping "node2".
In this case, both nodes lose quorum and attempt to ping the Network TieBreaker device, again the gateway router in this example.
"node1" cannot reach the default gateway because of whatever problem caused it to be isolated from the network in the first place.
"node2" is able to ping the gateway, our Network TieBreaker, so it regains quorum and hosts the resources in TSAMP's effort to keep resources highly available.
If "node1" had been hosting the online resources, it would have been forced to reboot at this point, to ensure the resources can be brought online on a surviving node without fear that they would be running concurrently on more than one server.
That's how a "Network TieBreaker" works. Here's the assumption: If "node1" can communicate (ping) with the default gateway and "node2" can communicate (ping) with the default gateway, then "node1" must be able to communicate (heartbeat) with "node2". If for some strange reason you have a network would allow each node to ping a common gateway/device, but not each other, then a "Network" style TieBreaker is not for you.
Modified on by Gareth Holl
In a previous blog, I talked about the importance for collecting and providing diagnostic data. For the Tivoli System Automation for Multiplatforms (TSA MP) product, this means running its automated data collection utility called "getsadata".
However, there are details about a problem situation that cannot be obtained by a tool, script, or bunch of commands. The most obvious is the problem description itself. So what does a good problem description entail ?
Well a timeline for one. Lets say Support staff have to dig into the log and trace data, a timeline will allow us to get to the most relevant messages much more quickly. Consider that at least one of the core trace files we use can contain thousands of lines of trace messages for only a few seconds of time. This means a quicker turn-around for you. On the flip side, if there has been any incorrect interpretation of your original problem description, we might start looking at the wrong time period with the log and trace data if you haven't given us a clear timeline of events ... ultimately this could result in an analysis that really doesn't make sense to you, because its not relevant to the problem that you are focused on ... bottom line, time wasted for all of us. So, timeline, timeline, timeline, and don't forget timeline
A common theme with problem descriptions we see is the incorrect use of terminology. This is not a criticism. This is the reality of many of our customers being thrown into the deep end, supporting a solution with a product they don't have a lot of experience with and don't have time to attend any formal education for. Where this becomes a problem is in how Support start interpret what is really being described or asked. So to alleviate this problem, the single most important piece of supporting information you can provide with your problem description is the output of 'lssam -nocolor'. You would be surprised how many times we have been able to explain a situation and answer a client's question without looking at any log or trace data, just by checking what 'lssam' has captured. But for this to be useful, you need to remember to save off the output of 'lssam' at the time you're observing the problem you need help with. The output of 'lssam' is just a snapshot for a very brief instant in time.
What can 'lssam' tell support? Firstly, its shows us what resources TSAMP is managing and how they are grouped. Its tells us which nodes (servers) these resources can run on or are running on. Of course, the primary reason 'lssam' exists is to show you the operational state (OpState) for each resource (online, offline, pending online|offline, failed offline, and so on), and this is certainly valuable information for Support staff, particularly if you're looking for guidance on what to do next as part of recovery efforts. But again, the OpState information is only valuable if it reflects the states that you observed during the problem period. Make it a habit to run 'date >> lssam.out; lssam -nocolor >> lssam.out' whenever you see something unusual or something you *think* you may want to follow-up with Support about.
Other useful hints:
1) If you're referring to a server as the primary or the standby or the failover server, please attach hostnames (nodenames) to them in your problem description since the concepts of primary, standby, etc are meaningless to TSAMP and therefore meaningless to TSAMP Support staff.
2) If you're referring to an application that failed to start or didn't failover, and so on, then please tell us what the resource name is for that application. The TSAMP product can be used to make practically any application highly available so there is a possibility we won't know what application you are referring to unless you tell us the "resource" name within the TSAMP automation policy is associated with your application ... this also goes back to what we would see in the output of 'lssam'.
3) Say an application (or resource) failed to start and you don't have 'lssam' output that shows this, say because recovery efforts were already performed. But you want the root cause to be determined, then you need to provide more details about the failed start attempt, for example, on what node(s) did the resource fail to start and "when" was this failed start. What did you do to try and make it start (start the domain, change a resource group's Nominal state to online, etc) ? How did you recover, assuming its not currently down/offline ?
4) Often telling us what you expected in addition to what you observed can help is understand what you're reporting as a problem.
5) Then there are the classic questions like, has this ever worked ? And what has changed recently (yes I know, nothing was changed )
Finally, as I said in a previous blog around diagnostic data collection, providing a detailed problem description at the time you open a PMR will result in a quicker answer and/or resolution steps from the TSAMP Support staff. Note that the electronic Service Request (SR) webpage (https://www-947.ibm.com/support/servicerequest/Home.action) is the ideal method for opening new PMRs (and even updating existing ones) as this will allow you to control the problem description and give you and immediate opportunity to upload supporting data ... you definitely cannot rely on the call center phone operators to accurately enter a problem description you dictate over the phone ... most of those problem descriptions I find to be useless, to be blunt
This self-paced audio-visual course provides an overview of the System Automation for z/OS 3.4 functional differences as they relate to implementation and administration. This is is the first of three courses in a set of courses that cover implementation and administration differences. The other two courses provide demonstrations and additional details on specific topics. The functional differences associated with operational commands are covered in a separate set of courses.
Main focus of this call is the new User Interface we intend to ship based on that new platform in a v3.next release.
Modified on by Rick Osowski
"Consolidation drives value." "Manage less and do more."
These initiatives, and many others like them, make sense from a financial standpoint, but can often lead to sleepless nights for operations managers once they realize the majority of their business, sometimes as much as 85%, is now run out of a single datacenter or a single platform. What happens when the system goes down? What happens when my datacenter loses connectivity? How quickly can I recover from an outage? Do I even need to recover, can I just roll-over to an active or passive backup?
Many mission critical core business applications, from larger vendors such as SAP and their ERP solutions, are run in datacenters much like this. However through high-availability and disaster recovery (HADR) capabilities provided by IBM, these datacenters can establish a failover policy to prevent even the slightest interruption to your company’s centralized business processes and financial systems.
IBM’s core HADR capabilities are provided by the Tivoli System Automation family of products, which we’ll specifically touch on System Automation for Multiplatforms here. Tivoli System Automation for Multiplatforms (SA MP) is a high availability clustering solution with advanced automation capabilities. It includes out-of-the-box resiliency policies for many IBM products delivering mission-critical capabilities to our customers. SA MP is the default, built-in HADR solution for IBM DB2 for Linux, Unix, and Windows available at no extra charge to DB2 LUW customers. The expansive capability of System Automation to provide a single point-of-view into your HADR capabilities and manage them, whether your fail-over datacenters are across the street or across the continent, provides immense value to our customers in managing their heterogeneous environments.
One such customer leveraging these out-of-the-box policies from SA MP for DB2, and by extension SAP, is China Ocean Shipping (Group) Company. For a detailed overview of the entire solution implemented by COSCO, including IBM POWER hardware capabilties, please follow the link below. In this post, I’ll briefly touch on how COSCO was able to leverage HADR capabilities to prevent major damage to the business during multiple datacenter outages. COSCO consolidated much of their business operations onto a single SAP ERP solution, but required the highest levels of service availability from this single system. By putting their SAP solution on top of DB2 and leveraging the HADR capabilities provided by SA MP, this customer was able to deploy a single SAP ERP solution across multiple datacenters worldwide, offering near real-time replication. These HADR capabilities and benefits were fully realized multiple times when the customer’s main datacenters experience prolonged outages, including the historic 2011 Tōhoku earthquake. They were able to seamlessly switch over from their datacenter in Tokyo to the off-site datacenter in Beijing, with virtually no service interruption.
Now the actual definition of “disaster recovery” may not always come to mind when you are planning your HADR strategies, but Tivoli System Automation, along with the many other IBM products it is embedded in are available to minimize your key HADR metrics, both recovery point objective (RPO) and recovery time objective (RTO). Many additional policies are available for other IBM and third-party products to monitor and ensure the availability for your business services. For more information on how you can be completely confident in your business’s HADR solution, check out the links below for a deeper dive into Tivoli System Automation for Multiplatforms.
For more information:
Tivoli System Automation for Multiplatforms
Tivoli System Automation for Multiplatforms SAP Success Stories
China Ocean Shipping (Group) Company surges into new markets with IBM and SAP
If you're planning on asking IBM Support for help, more than likely there will be a minimum amount of detail they will need up front. The most obvious being details about your environment, such as platform/OS and product versions.
Now if you're needing the root cause for some event that has since pasted, then keep in mind that someone providing "remote" support will likely need historical log or trace data before they will be able to offer you anything significant.
For the Tivoli System Automation for Multiplaforms (TSAMP) product, we have a diagnostic data collection utility called "getsadata". This tool is a one stop, collect all, very exhaustive data collector that will provide TSAMP Support staff with everything they need to help you in 95% of cases. Of course we would still like an accurate problem description and a timeline, but often even the information collected by getsadata can be used by Support to work out what you need help with
So first let me point you to a link that explains how to use TSAMP's data collector (getsadata) :
The "getsadata" utility is bundled and installed with the TSAMP product within the "/usr/sbin/rsct/install/bin" directory. However, if you're many fixpacks behind the latest, or if you are not using the latest release level of TSAMP, then I would encourage you to download the latest version of getsadata, which can always be obtained via the above URL.
Here are some key things to remember:
1. Execute 'getsadata' with root authority. Running as any other user will likely result in data Support cannot use to help you.
2. Run getsadata on all nodes in the domain where possible, but only after first running it on the node hosting the "master" automation engine (IBM.RecoveryRM), identified by using either of the following commands (executed from any node):
- lssamctrl -V
- lssrc -ls IBM.RecoveryRM | grep -i master
3. It is important to collect data as soon as possible after a problem is observed in order to collect all log and trace data before data is lost (First In, First Out, fixed size trace files). This doesn't necessarily apply if your environment has trace spooling enabled.
4. It is equally important to run the utility before any manual (user) recovery efforts are attempted. This will ensure an accurate snapshot of the current states which can then be correlated with the logs and traces collected.
Let me leave you with one fact ... if you provide the tarballs created by running getsadata at the time you open the PMR, you will enable the TSAMP Support team to provide you with a root cause analysis much more quickly than if you wait for us to perform an initial contact to request the data
The Open Program for IBM Tivoli System Automation started.
We are developing the next version of our products:
Tivoli System Automation Application Manager (SA AppMan)
Tivoli System Automation for Multiplatforms (SA MP)
In the Open Program, we are going to
show you selected features and improvements that we intend to include in
future releases of SA AppMan and SA MP. We are looking forward to
discuss these features and improvements with you and receive your
You have the chance to participate in this new development and
influence during the Open Program for IBM Tivoli System Automation
Have a look at the Open Program for IBM Tivoli System Automation Distributed and learn more.
Automation with Tivoli System Automation products family in virtual environments has been presented on the GSE Power-Systems conference in Munich on 21.11.2011.
Virtualization technologies play an important role in datacenters – they also provide the base for currently hot discussed “cloud” infrastructures.
There is a lot of focus on virtualization technologies for distributed server platforms like zVM, VMware, System p’s Hypervisor, SUN Solaris Zones, and others.
Of course, virtualization provides several benefits this presentation concentrated on the aspects of availability, high availability and disaster recover.
More than 80% of enterprises have adopted server virtualization, but only 20% of all server workload is on virtual machines
- Lack of confidence when it comes to high availability of virtual infrastructure
- Better management tools predict increase in adoption rate to 48% by 2011
Virtualized landscapes have the same high availability needs… - stay in business 24x7x365. It is required to consider that that failures causing service outages happen on hardware as well as on software stack. Planned outages have to be avoided which are caused whenever maintenance is required – if possible avoid service interruption. In a real disaster you have to recover your business on another site - it is required to be prepared for the worst. Virtual machines, applications and data have to be available on the failover DR site.
Key points which have been addressed in the presentation:
These topics are addressed in the given presentation - if you are interested to learn more pls. contact us.
- High Availability needs of business applications.
- A high level overview of virtualization technologies and their value for reducing out-times in planned scenearios
- Promises and limitation of virtualization technologies for true high availability
- High Availability clustering with SA has application knowledge, knows about relationships between applications and can react more intelligent in planned and unplanned outage scnearios
- System Automation Multiplatforms and SA Application Manager as management utilities for composite business applications including virtualization tolerations and exploitation.
- Cross-site Disaster Recovery Solutions with virtualized environments - control (virtual) systems, application stacks and replicated data with System Automation
- Explained in scnearios
We've created a "landing page" for a collection of support resources pertaining to the Tivoli System Automation for Multiplatforms (TSA MP) product. Think of it as the home page for your initial Support needs. Here's the direct URL :
This page is permanently linked off TSAMP's IBM Support Portal site, referred to as "Featured documents".
The landing page is actually divided into a collection of categories, each with their own home page. For example :
- Installation & Migration
- Configuration & Customization
- Operation & Maintenance
Each of these pages has a collection of technotes, whitepapers, presentations, and other useful links.
Lastly, the landing page has a set of quick links to the most common resources, such as the latest fixpacks, product guides, diagnostic data collection tool, and documents to help you with data collection strategies.
With the 220.127.116.11 release of System Automation Application Manager, the capability to manage virtual guests on zEnterprise hardware has been introduced. This allows the SA Application Manager to start and stop virtual servers hosted inside the zEnterprise Ensemble.
As this functionality is not only available via the graphical interfaces but also for the command line, our team developed some sample scripts, written in PERL, to show several possibilities of eezcs scripting. You can also browse our manuals for detailed description of the new commands connect, lsnode, and nodereq.
Description: This script will accept one virtual server name (as it is shown in the zEnterprise HMC interface) and search for the mapping to a hostname in the domains known to the e2e manager. Once found the matching server is printed to STDOUT. If you want to adapt this script for later use, you should probably use it as a starting point for a general use method and instead return the name of the hostname to the caller.
Idea: Possibly you want to stop a VirtualServer, with the hostname available to you, you are able to exclude that node before stopping it.
Description: This script will accept one hypervisor name (as it is shown in the zEnterprise HMC interface) and search for all mappings to hostnames in the domains known to the e2e manager. At the end all matches are printed to STDOUT. If you want to adapt this script for later use, you should probably use it as a starting point for a general use method and instead return the names of the hostnames to the caller.
Idea: Use this script to determine all hostnames for preparation tasks prior to a hypervisor shutdown, or determine whether your applications are sufficiently spread over multiple hypervisors.
Description: This script will accept one hostname, a timeout (in seconds), and a resource name (fully qualified). It will then start to search for the hostname in the connected domains. On the first match, the node will be started in that domain. The script will wait until the node is online in the automation domain it connected to. Afterwards a resource (identified by the third parameter) will be started.
Idea: Tthis script can be used inside a large system bringup, managing the relationship from soft- to hardware.
Description: This script accepts one e2e resource reference or group name. It will start all servers necessary for this e2e resource, wait until they are operational and then start the e2e resource. This script uses a fixed timeout for each startWait operation of 10 minutes.
Idea: This script shows an example of a complete dependency, from hardware, to first-level-automation domain applications, to an end-to-end-reference.
Download the complete package from our Wiki
Feedback on any of the scripts, new ideas for scripts and questions about other tasks is very welcome in our forum
Modified on by s.we
A new paper has been released on the System Automation Application Manager WIKI:
Paper: Integrating OSLC
It contains information and examples (in Java® and PERL) on how to control End-To-End resources via REST calls. For scripting this eliminates the necessity to log on to the node hosting the SA Application Manager and does not need to start a JVM (like eezcs does).
As always we are very interested in your feedback and any nice solutions (like handy scripts) you are developing.
- Sebastian Wegmann
Less then two weeks left to Pulse 2013
(March 3 - 6) the top event on Cloud, IT and Service Management in Las Vegas to share best practices with 8,000 of your peers and hear from IBM business partners and top industry analysts on the latest trends and hottest IT topics.
Part of the Cloud and IT Optimization
stream (http://www.ibm.com/software/tivoli/pulse/agenda/cloud.html#content) will be the Automated Operations Technical Council (AOTC).
Join this track to gain the knowledge you need to protect your business and IT services through end-to-end high availability, automation, and disaster recovery solutions. The AOTC continues its long term focus on the IBM Tivoli® System Automation family across distributed and z environments. This highly technical track is ideal for automation technical experts, IT project managers, and IT operations specialists who want to maximize the value of their system automation portfolio.
There will be developers giving updates on the latest development , practitioners, customers and users talking about their experience. For example there will be customer sessions on 'Advanced SAP Automation using Tivoli System Automation' an 'High Availability and Cross Platform Automation of a Complex Banking Applications'.
This IBM support webpage
(http://www.ibm.com/support/docview.wss?uid=swg27024950&myns=swgtiv&mynp=OCSSRM2X&mync=R) features the documents most frequently requested by our customers, as well as other information identified by Support as valuable in helping answer questions related to IBM Tivoli System Automation for Multplatforms (TSAMP). A must read if you run System Automation on AIX, Linux, Solaris or Windows.
Downloads and Updates
such as 'Checking and Adjusting Heartbeat sensitivity'
Presentations from Support
Whitepapers from Support
such as 'Rolling Upgrade Procedure for a TSAMP Automated HADR Environment - Oct 2011.pdf'
Learning more about TSAMP
IBM Software Support Resources