High availability scenarios

Monitoring, Version 6.2

High availability scenarios

There are two primary technological approaches to configure resiliency or high availability for the Tivoli monitoring platform components. One approach exploits common, commercial high availability cluster manager software. (Examples are High-Availability Cluster Multi-Processing (HACMP) and System Automation-Multi Platform (SA-MP) from IBM, or Microsoft Cluster Server (MSCS).) An alternative approach may be applicable for some sites to make the hub monitoring server resilient to specific failure scenarios. This alternative approach is called Hot Standby in the Tivoli publications. The two approaches provide different resiliency and failover characteristics, depending on your requirements.

The first approach requires the use of a high availability cluster manager such as IBM's SA-MP, HACMP or Microsoft's MSCS. You are entitled to cluster management software with some operating systems and platforms. Using this approach, you can configure all of the components of the monitoring platform for resiliency in the case of component failure. IBM has produced white papers describing specific resiliency configurations using common commercial HA solutions. In addition, since so many users use such a wide range of customized HA solutions, we also provide a general description of Tivoli monitoring platform resiliency requirements that should suffice to set up any typical HA configuration. For more information about using IBM Tivoli Monitoring in a clustered environment, search for "Clustering IBM Tivoli Monitoring" at the IBM Tivoli Open Process Automation Library (OPAL) Web site.

For users primarily concerned with the availability of the hub monitoring server, the monitoring platform provides the built in Hot Standby option. This solution replicates selected state information between the hub monitoring server and a secondary hub monitoring server running in a listening standby mode, heart-beating the active hub and keeping current with much of the hub's environment information. In an appropriately configured environment, the secondary hub monitoring server takes over as the acting hub in the event that the primary hub fails. This solution operates without the requirement for shared or replicated persistent storage between the two monitoring server computers, and does not require cluster manager software. However, it only addresses the hub monitoring server component of the monitoring platform, and is therefore suited to customers without stringent resiliency requirements on the other components of the monitoring platform. See Table 8 for the failover options available for each monitoring component and Table 9 for the resiliency characteristics for each option.

Table 8. Options for Tivoli Monitoring components resiliency
Component	Potential single point of failure?	Cluster failover available?	Hot Standby failover available?
Hub monitoring server	Yes	Yes	Yes
Portal server	Yes	Yes	No
Tivoli Data Warehouse database	Yes	Yes	No
Warehouse proxy	Yes, if single Warehouse Proxy in environment.	Yes	No
Summarization and Pruning agent	Yes	Yes	No
Remote monitoring server	No. Another monitoring server can assume the role of a remote monitoring server for connected agents. This is known as "agent failover".	N/A	N/A
Agent	Not a single point of failure for the whole monitoring solution, but a specific point of failure for the specific resource being monitored.	Yes	No

Table 9. Resiliency characteristics of Tivoli Monitoring components and features
Component/Feature	Characteristics on hub cluster failover	Characteristics on hub Hot Standby failover
Hub monitoring server	The hub monitoring server is restarted as soon as the cluster manager detects failure.	Communication failure between hubs causes the standby hub to start processing to establish itself as master.
Portal server	The portal server reconnects to hub as soon as it is restarted.	The portal server needs to be reconfigured to point to the new hub.
Tivoli Data Warehouse database	No relationship to hub	No relationship to hub
Warehouse proxy	As an agent, the Warehouse Proxy reconnects to its hub and continues to export data from agents to the Tivoli Data Warehouse.	As an agent configured with secondary connection to hub, the Warehouse Proxy connects to its secondary hub and continues to export data from agents to the Tivoli Data Warehouse.
Summarization and Pruning agent	As an agent, the Summarization and Pruning agent reconnects to its hub and continues to summarize and prune data from the Tivoli Data Warehouse.	As an agent configured with secondary connection to hub, the Summarization and Pruning agent connects to its secondary hub and continues to summarize and prune data from the Tivoli Data Warehouse.
Remote monitoring server	The remote monitoring server detects the hub restart and tries to reconnect, synchronizing with the hub.	When configured with secondary connection to hub, the remote monitoring server retries the connection with the primary hub and if unsuccessful tries to connect to the secondary hub. When the new hub has been promoted to master, the remote monitoring server detects the hub restart and tries to reconnect, synchronizing with the hub.
Agent	All agents directly connected to the hub reconnect to the hub after restart and begin synchronization.	When configured with secondary connection to hub, agents directly connected to the hub perceive the loss of connection and retry. With the first hub down, the agent tries to connect to the second hub, and begin synchronization which includes restarting all situations.
Event data	Agents resample all polled situation conditions and reassert all that are still true. Situation history is preserved.	Agents resample all polled situation conditions and reassert all that are still true. Previous situations history is not replicated to mirror and thus lost. To persist historical event data, use the OMNIbus or Tivoli Enterprise Console.
Hub failback	Available through cluster manager administration/configuration	The secondary hub must be stopped so that primary hub can become master again.
Time for failover	The detection of a failed hub and restart is quick and can be configured through the cluster manager. The synchronization process until all situations are restarted and the whole environment is operational depends on the size of the environment, including the number of agents and distributed situations.	The detection of a failed hub is quick. There is no restart of the hub, but the connection of remote monitoring server and agents to the standby hub require at least one more heartbeat interval, because they try the primary before trying the secondary. The synchronization process until all situations are restarted and the whole environment is operational depends on the size of the environment, including the number of agents and distributed situations.
z/OS environments	The z/OS hub clustered solution has not yet been tested and therefore is not a supported configuration. Remote monitoring servers on z/OS are supported.	Hot Standby not supported on z/OS. Hot Standby has not been tested with remote monitoring servers on z/OS and therefore is not a supported configuration.
Data available on failover hub	All data is shared through disk or replication.	All EIB data, except for the following, is replicated through the mirror synchronization process: Situation status history Publishing of any UA metadata, versioning CCT - Take Action DB remote deployment Depot
Manageability of failover	Failover can be automatic or directed through cluster administration. You control which hub is currently the “master" and the current state of the cluster.	Failover can be directed by stopping the hub. Note that starting order controls which hub is the “master".

Using Hot Standby

Figure 7 depicts an environment with the Hot Standby feature configured.

Figure 7. IBM Tivoli Monitoring configuration with hot standby

The following sections provide an overview of how each component in the IBM Tivoli Monitoring environment is configured to enable the Hot Standby feature.

Hub monitoring servers

In a Hot Standby environment, there are two hub monitoring servers. The configuration of each hub designates the other hub as the Hot Standby hub. At any given time, one of the two hub monitoring servers is operating as the hub. This server is referred to as the Acting Hub. The other hub monitoring server is in standby mode and is referred to as the Standby Hub. (See Figure 7.)

When the two hub monitoring servers are running, they continuously synchronize their Enterprise Information Base (EIB) data, enabling the Standby Hub to take over the role of the Acting Hub in case the Acting Hub becomes unavailable. The EIB contains definition objects such as situations and policies, information about managed systems, and information about the distribution or assignment of situations and policies to managed systems.

The two hub monitoring servers are symmetrical, but for reasons that will become clear later, one hub monitoring server is designated as the Primary Hub and the other is designated as the Secondary Hub. While it is not necessary, you can designate as the Primary Hub the server that you expect to be the Acting Hub most of the time.

Note that the terms Acting and Standby refer to an operational state, which can change over a period of time. The terms Primary and Secondary refer to configuration, which is relatively permanent.

Remote monitoring servers

All remote monitoring servers must be configured to operate in the Hot Standby environment. When you configure each remote monitoring server, you specify the Primary and Secondary hub monitoring servers to which the remote monitoring server reports.

It is important that you specify the same Primary and Secondary hub monitoring servers for each remote monitoring server. In Figure 7, the connections from the remote monitoring servers to the Primary Hub are depicted with solid arrows. The connections to the Standby Hub are depicted with dashed arrows.

Monitoring agents

Monitoring agents that report directly to the hub monitoring server, as well as the Warehouse Proxy agent and the Summarization and Pruning agent, must be configured to operate in the Hot Standby environment. When you configure each of these agents, you specify the Primary and Secondary hub monitoring servers to which the agents report.

In Figure 7, the connection between these monitoring agents and the Primary Hub is depicted with a solid arrow. The connection to the Standby Hub is depicted with a dashed arrow.

Portal server

The Tivoli Enterprise Portal Server cannot be configured to fail over to a Standby Hub. When the Standby Hub takes over as the Acting Hub, the portal server needs to be reconfigured to point to the Acting Hub. Portal clients do not need to be reconfigured. They automatically reconnect to the portal server when it restarts after it is reconfigured.

Failover scenario

The Hot Standby operation is best illustrated by describing a scenario. In this example, all monitoring components are started in order, a scenario that might take place only when the product is initially installed. After installation, components can be started and stopped independently.

Starting up the components

The following list describes the order of startup and what happens as each component is started.

The Primary Hub is started first.
1. When the Primary Hub starts, it adds a message to its operations log to indicate that it is configured for Hot Standby:
```
04/17/07  09:38:54 KQM0001  FTO started at  04/17/07  09:38:54.
```
2. The Primary Hub then attempts to connect to the Standby Hub (the Secondary Hub). Because the Standby Hub is not yet available, the Primary Hub assumes the role of the Acting Hub.
The Secondary Hub is started.
1. When the Secondary Hub starts, it adds a message to its operations log to indicate that it is enabled for Hot Standby:
```
04/17/07  09:47:04 KQM0001  FTO started at  04/17/07  09:47:04.
```
2. The Secondary Hub attempts to connect to the Primary Hub. The connection succeeds. Sensing that the Primary Hub started earlier (and is therefore the Acting Hub), the Secondary Hub assumes the role of the Standby Hub, indicated by the following messages:
```
04/17/07  09:47:18 KQM0003  FTO connected to IP.PIPE:#9.52.104.155 
at 04/17/07 09:47:18.
04/17/07  09:47:33 KQM0009  FTO promoted HUB_PRIMARY as the acting HUB.  
```
3. The Primary Hub also succeeds in connecting with the Secondary Hub, and issues the following messages:
```
04/17/07  09:45:50 KQM0003  FTO connected to IP.PIPE:#9.52.104.155 
at 04/17/07 09:45:50.
04/17/07  09:45:58 KQM0009  FTO promoted HUB_PRIMARY as the acting HUB.  
```
4. The Standby Hub queries the Acting Hub to see if there are any updates to the EIB data since it last communicated with the Acting Hub. It replicates all updates.
5. After the initial startup and connections, the two hubs monitor connections with each other periodically to ensure that the other hub is running and that there is no change in status.
6. The Standby Hub also monitors the Acting Hub periodically for further updates to the EIB, and replicates the updates in its own EIB. By default, this monitoring takes place every 5 seconds. As a result, the Standby Hub is ready to take over the role of the Acting Hub when required.
The remote monitoring servers and monitoring agents are started.
When these components start, they attempt to connect to the Primary Hub in their configuration. In this scenario, the Primary Hub is also the current Acting Hub. Therefore, the connection attempt is successful, and these components start reporting to the Primary Hub.
The Tivoli Enterprise Portal Server connects to the Primary Hub.
The portal server is configured to connect to the Primary Hub. One or more portal clients are connected to the portal server for monitoring purposes.

Failing over

The Acting Hub might become unavailable for a number of reasons. It might need to be shut down for scheduled maintenance, the computer on which it is running might need to be shut down or might have crashed, or it could be experiencing networking problems.

When the Standby Hub discovers that the Acting Hub is unavailable, it takes over the role of the Acting Hub and issues the following messages:

04/17/07  10:46:40 KQM0004  FTO detected lost parent connection 
at 04/17/07 10:46:40.
04/17/07  10:46:40 KQM0009  FTO promoted HUB_SECONDARY as the acting HUB.

The Primary Hub is now the Standby Hub and the Secondary Hub is the Acting Hub, as depicted in Figure 8:

Figure 8. Configuration after failover

As the remote monitoring servers and agents connected to the previous Acting Hub discover that this hub is no longer available, they switch and reconnect to the new Acting Hub. Because these components are in various states of processing and communication with the hub monitoring server, the discovery and re-connection with the new hub is not synchronized.

The Tivoli Enterprise Portal Server must be reconfigured to point to the new Acting Hub and then restarted. All portal clients reconnect to the portal server after it is restarted.

The processing that takes place after re-connection is similar to the processing that takes place after re-connection in an environment without a Hot Standby server. The following processing takes place with regard to situations and policies:

Pure events that occurred before the failover are not visible. Subsequent pure events are reported when they occur.
Sampled situations are reevaluated, and are reported again if they are still true.
A Master Reset Event is sent to the Tivoli Enterprise Console when the failover occurs. Events that result from situations being reevaluated are resent to the Tivoli Enterprise Console if the monitoring server has been configured to send events to the Tivoli Enterprise Console.
Policies are restarted.

The new Acting Hub retains its role even after the other hub monitoring server becomes operational again. The other hub monitoring server now becomes the Standby Hub. When the new Standby Hub starts, it checks the EIB of the new Acting Hub for updates, and replicates the updates to its own EIB if necessary. The two hubs also start monitoring connections with each other to ensure that the other hub is running.

All remote monitoring servers and agents now report to the new Acting Hub. There is no mechanism available to switch them back to the Standby Hub while the Acting Hub is still running. The only way to switch them to the Standby Hub is to shut down the Acting Hub.

If a remote monitoring server or agent experiences a transient communication problem with the Acting Hub, and switches over to the Standby Hub, the Standby Hub instructs it to retry the connection with the Acting Hub, because the Standby Hub knows that the Acting Hub is still available.

The environment continues to operate with the configuration shown in Using Hot Standby until there is a need to shut down the Acting Hub, or if the computer on which the Acting Hub is running becomes unavailable. Each time the Acting Hub becomes unavailable, the failover scenario described in this section is repeated.

Using clustering

This section provides an overview of Tivoli Monitoring high availability when using clustering technologies. It explains clustering concepts, describes the supported Tivoli Monitoring cluster configurations, provides an overview of the setup steps, and describes the expected behavior of the components running in a clustered environment.

For more information about using IBM Tivoli Monitoring in a clustered environment, search for "Clustering IBM Tivoli Monitoring" at the IBM Tivoli Open Process Automation Library (OPAL) Web site: http://catalog.lotus.com/wps/portal/topal/. Detailed instructions on how to set up Tivoli Monitoring components on different cluster managers are provided in separate papers, initially including: Microsoft® Cluster Server, Tivoli System Automation Multiplatform and HACMP (High-Availability Cluster Multi-Processing).

Review the following concepts to enhance your understanding of clustering technology:

A cluster is a group of individual computer systems working together to provide increased application availability.
A node is a computer system that is a member of a cluster.
A resource is a physical or logical entity that is capable of being managed by a cluster, brought online, taken offline and moved between nodes.
Resource groups are collections of resources that are managed as a single unit, and hosted on one node at any point in time.
Failover is the process of taking resource groups offline on one node and bringing them back on another node. Resource dependencies are respected.
Failback is the process of moving resources back to their original node after the failed node comes back online.

Although clusters can be used in configurations other than basic failover (for example, load sharing and balancing), the current Tivoli Monitoring design does not support multiple, concurrent instances of the monitoring components.

Clustering configurations

Consider the following three clustering configurations:

Note:

IBM DB2 was the database used in all of the configuration tests. Other Tivoli Monitoring-supported databases can also be clustered by following the specific database cluster setup procedures. All the clustering configurations include at least one agent directly reporting to the hub. For simplicity, this is not shown in the configuration diagrams that follow.

Configuration A

This configuration has a hub cluster, portal server cluster, and data warehouse cluster (including the Summarization and Pruning agent and the Warehouse Proxy), with multiple remote monitoring servers (RT1, RT2, RT3) and agents, and Tivoli Enterprise Console integration.

Figure 9. Separate component clusters

Configuration B

In this configuration, the Warehouse Proxy and Summarization and Pruning agents are running outside of the data warehouse cluster.

Figure 10. Separate component clusters, with the Warehouse Proxy and Summarization and Pruning agents outside the data warehouse cluster

Configuration with three clusters, with warehouse proxy and summarization and pruning agent outside warehouse cluster

Configuration C

In this configuration, all main components are clustered in just one cluster. Also included is one agent running on the clustered environment and directly reporting to the hub, which is not included in the picture for simplification.

This configuration represents a degenerated case of Configuration A. It is important to guarantee that the computers used for such an environment have enough power to handle all the components. The behavior of the clustered components is the same in this case, but the setup of Tivoli Monitoring on such an environment has some differences.

Figure 11. One cluster for the hub, portal server, and data warehouse

Monitoring server, portal server, data warehouse and backups in a single cluster

Setting up Tivoli Monitoring components in a clustered environment

This section describes the overall procedure for setting up Tivoli Monitoring components on clustered environments.

Tivoli Monitoring cluster setup

A basic cluster setup includes:

Two computers to participate as nodes on the cluster
Each computer has two network interface cards (NICs): one for the heartbeat function between them, the other for public access

The Tivoli Monitoring requirements for the cluster server are:

A shared persistent storage for the Tivoli Monitoring installation directory as well as for any component-specific data
A virtual IP Address to be assigned to the component running on the cluster

Table 10 shows the high-level steps required to set up the main Tivoli Monitoring components running in a cluster.

Table 10. Basic steps to set up Tivoli Monitoring on a cluster
Step	Hub monitoring server	Portal Server	Data warehouse
1		Install the database software. Configure the database software for cluster.	Install the database software. Configure the database software for cluster.
2	Create the cluster. Define virtual IP and shared persistent storage for the resource group.	Create the cluster. Define virtual IP, shared persistent storage, and database for the resource group.	Create the cluster. Define virtual IP, shared persistent storage, and database for the resource group.
3	Install and set up the monitoring server on the first node of the cluster. Set up the monitoring server on the second node of cluster.	Install and set up the portal server on the first node of the cluster. Set up the monitoring server on the second node of cluster.	Optional: Install and set up the Summarization and Pruning agent and Warehouse Proxy on the first node of the cluster. Set up the Summarization and Pruning agent and Warehouse Proxy on the second node of the cluster.
4	Add the monitoring server as a resource to the resource group.	Add the portal server as a resource to the resource group.	Optional: Add the Summarization and Pruning agent and Warehouse Proxy as a resource to the resource group.

Some important characteristics of the Tivoli Monitoring setup on a cluster include:

The shared persistent storage and virtual IP Address must be available before you can install Tivoli Monitoring, and the node where Tivoli Monitoring is being installed must own these resources.
The Tivoli Monitoring component (such as the hub monitoring server) is installed only once on the first node of the cluster (on the shared directory) and then properly configured on the second node. In the Windows environment this means that the Tivoli Monitoring registry keys must be replicated to the second node.
The Tivoli Monitoring component has to be bound to the virtual IP Address such that the other Tivoli Monitoring components of the infrastructure can reach it no matter which node it is running on. This is done by setting the network interface where the component listens.
Only one instance of the component will be active at a time.
When Tivoli Monitoring is running under the cluster, its services cannot be automatically started. Likewise, its services cannot be directly started or stopped through the command line or through the Manage Tivoli Enterprise Monitoring Services (or administrative GUI). Instead, the services must be controlled through the cluster manager.

Monitoring server setup

The monitoring server cluster resources include:

Shared persistent storage
Virtual IP Address
Monitoring server service

The generic setup of the hub monitoring server on a cluster includes the following task:

Set up the basic cluster resource group with a shared persistent storage and virtual IP Address.
Install GSKIT, the Java runtime, and the monitoring server on the first node of the cluster or on the shared persistent storage. This depends on which platform you are running on, since IBM Tivoli Monitoring install on Windows allows the specification of a driver and location (so shared persistent storage can be specified) but the UNIX install does not.
Remove automatic startup of the monitoring server service (if applicable to this platform).
Bind the monitoring server to the virtual IP Address.
Set up GSKIT, the Java runtime, and the monitoring server on the second node of the cluster. Depending on the platform, this may involve copying registry keys or environment variables or both to the second node.
Add the monitoring server as a resource to the resource group.

Note that the specific setup of the monitoring server varies on different platforms and cluster managers.

Private and public interfaces when Tivoli Enterprise Monitoring Server is installed on an AIX server

Note:

This section applies to clustered environments if the Tivoli Enterprise Monitoring Server is installed on an AIX server.

By default, the location broker registers all available interfaces for a server. Thus, an internal IP address is configured on each node to be used for cluster internal communication between nodes. However, these internal (private) IP addresses should not be included on a list to be used by other servers or agents. If the Tivoli Enterprise Monitoring Server is installed on an AIX server with a public and a private interface, the Tivoli Enterprise Portal Server cannot connect to the Tivoli Enterprise Monitoring Server. There are two environment variables you can set to control which interfaces to publish. For IPV4 use KDEB_INTERFACELIST, for IPV6 use KDEB_INTERFACELIST_IPV6. In either address family, you can set those variables to set, restrict, or add to the interfaces in use. To avoid communication errors, remove the internal IP address registration, as explained in the following table.

Table 11. Control interface publishing
Interface control	Environment variable
To set specific interfaces for consideration:	`KDEB_INTERFACELIST=ip4addr-1 ... ip4addr-n KDEB_INTERFACELIST_IPV6=ip6addr-1 ... ip6addr-n`
To remove interfaces from consideration:	`KDEB_INTERFACELIST...=-ip4addr-1 ... -ip4addr-n KDEB_INTERFACELIST_IPV6=ip6addr-1 ... ip6addr-n`
To add interfaces for consideration:	`KDEB_INTERFACELIST=+ ip4addr-1 ... ip4addr-n KDEB_INTERFACELIST_IPV6=+ ip6addr-1 ...ip6addr-n`
where: ip4addr Specifies either a symbolic network name, or a raw form dotted decimal network address. ip6addr Specifies either a symbolic network name, or a raw form colon-separated hex digit network address. Note: The plus sign must stand alone.

Portal server setup

The portal server cluster resources include the following:

Shared persistent storage
Virtual IP Address
Database middleware (such as DB2)
Portal server service

The generic setup of the portal server on a cluster includes the following tasks, performed in order:

Install the database middleware locally on both nodes.
Set up the database users and groups to be exactly the same on both nodes.
Remove automatic startup of the database middleware, such that it can run under the cluster control.
Set up the basic cluster resource group with shared persistent storage, virtual IP Address and the database middleware.
Create the portal server database on the first node of the cluster (shared persistent storage).
Catalog the portal server database on the second node of the cluster.
7. Install GSKIT, Java runtime, portal server and the IBM Eclipse Help Server on the first node of the cluster (shared persistent storage). Note: The Eclipse Help Server is automatically installed when you install the portal server. 8.
Remove auto-start of the portal server service and Eclipse Help Server service (if applicable on this platform). 9.
Bind the portal server and Eclipse Help Server to the virtual IP address. 10.
Set up GSKit, Java runtime, portal server and Eclipse Help Server on the second node of the cluster (depending on the platform, this may involve copying registry keys or environment variables, or both, to the second node.).
Add the portal server and Eclipse Help Server as resources to the resource group. 12.
Add the Eclipse Help Server to the resource dependencies of the portal server.

Note that the setup of the portal server varies on different platforms and cluster managers.

Data warehouse setup

The data warehouse cluster resources include the following:

Shared persistent storage
Virtual IP Address
Database middleware, such as DB2
Warehouse Proxy and Summarization and Pruning agent processes (optional)

The generic setup of the data warehouse on a cluster involves the following tasks, performed in order:

Install the database middleware locally on both nodes.
Set up the database users and groups to be exactly the same on both nodes.
Remove automatic startup of the database middleware, such that it can run under the cluster control.
Set up the basic cluster resource group with shared persistent storage, virtual IP address, and the database middleware.
Create the data warehouse database on the first node of the cluster (shared persistent storage).
Catalog the data warehouse database on the second node of the cluster.

The following steps are optional and only necessary if the Summarization and Pruning agent and the Warehouse Proxy are included in the cluster:

Install the Summarization and Pruning agent and the Warehouse Proxy on the first node of the cluster (shared persistent storage).
Install GSKIT and Java runtime on both nodes of the cluster (or on the shared persistent storage, depending on the platform)
Turn off automatic startup of the Summarization and Pruning agent and Warehouse Proxy services (if applicable to this platform).
Bind the Summarization and Pruning agent and the Warehouse Proxy to the virtual IP Address.
Set up the Summarization and Pruning agent and Warehouse Proxy on the second node of the cluster (depending on the platform, this might involve copying the registry keys or environment variables or both to the second node).
Add the Summarization and Pruning agent and Warehouse Proxy as resources to the resource group.

The specific setup of the data warehouse on different platforms and cluster managers varies and is described on the specific cluster managers on this series.

What to expect from the Tivoli Management Services infrastructure in a clustered environment

In general, the failover or failback of a clustered component is treated by the other Tivoli Monitoring components as a restart of the clustered element.

Clustered hub

When the hub is clustered, its failover/failback is perceived by all the other Tivoli Monitoring components as a hub restart. This means that the other components automatically reconnect to the hub and some synchronization takes place.

As part of the remote monitoring server to hub synchronization after re-connection, all situations that are the responsibility of the remote monitoring server (distributed to the monitoring server itself or to one of its connected agents) are restarted. This restarting of situations represents the current behavior for all re-connection cases between remote monitoring servers and the hub, independent of clustered environments. See Situations for more information.

For agents directly connected to the hub, there might be a periods in which situation thresholding activity on the connected agents is not occurring because the situations are stopped when a connection failure to the reporting hub is detected. As soon as the connection is reestablished, the synchronization process takes place and situations are restarted. (Note that historical metric collection is not stopped.)

The portal server, the Summarization and Pruning agent, and the Warehouse Proxy reconnect to the hub and perform any synchronization steps necessary.

The portal user's perception of the apparent hub restart depends on the size of the environment (including the number of agents and situations). Initially you are notified that the portal server has lost contact with the monitoring server, and views might be unavailable. When the portal server reconnects to the hub, the Enterprise default workspace displays, allowing access to the Navigator Physical view. However, some agents may have delays in returning online (due to the reconnection timers) and triggering polled situations again (as the situations are restarted by the agents).

While the hub failover or failback (including the startup of the new hub) may be quick (on the order of 1-3 minutes), the resynchronization of all elements to their normal state may be delayed in large-scale environments with thousands of agents. This behavior is not specific to a cluster environment, but valid any time the hub is restarted or when connections from the other Tivoli Monitoring components to the hub are lost and later re-established.

Clustered portal server

When the portal server is clustered, its failover or failback is perceived by its connected Tivoli Monitoring components as a portal server restart. The components that connect to the portal server include the portal consoles and the Summarization and Pruning agent (at the start of its summarization and pruning interval).

When a portal client loses connection to the portal server the user is notified and some views become unavailable. When the portal client re-establishes connection with the portal server, the home workspace displays and the Navigator refreshes.

If the Summarization and Pruning agent is connected to the portal server at the time of portal server failover, it loses the connection and attempts to reestablish contact on the next Summarization and Pruning agent interval.

Clustered data warehouse

When the data warehouse is clustered, its failover/failback is perceived by its connected Tivoli Monitoring components as a database restart. The components that connect to the data warehouse database include the Summarization and Pruning agent, the Warehouse Proxy and portal server (when retrieving long-term data collection for the portal workspace views).

When the Summarization and Pruning agent loses contact with the data warehouse database, it attempts to reestablish contact on the next Summarization and Pruning agent interval and then restart its work.

When the Warehouse Proxy loses contact with the data warehouse database, it attempts to reestablish the connection and restart its work.

When the portal server loses contact with the data warehouse, it reconnects on the next query request from a portal client.

Clustered Summarization and Pruning agent

When the data warehouse cluster resource group includes the clustered Summarization and Pruning agent, the agent fails together with the data warehouse database and must be restarted after the data warehouse. As the Summarization and Pruning agent uses transactions for its operations to the data warehouse database, it resumes its summarization and pruning work where it left off prior to failure.

Clustered Warehouse Proxy

When the Warehouse Proxy is part of the data warehouse cluster resource group, the proxy fails together with the data warehouse database and must be restarted after the data warehouse. The proxy then resumes the work of uploading the short-term data collection to the data warehouse.

Situations

Situations are potentially affected by failures on the hub, the remote monitoring server, and at the agent. During a hub cluster failover, situations are affected in the same way as a restart of the hub or when the hub gets disconnected from the other components.

When a remote monitoring server loses connection with the hub and then reestablishes contact, the server synchronizes after re-connection. This process involves restarting all the situations under that remote monitoring server's responsibility. Polled situations are triggered on the next polling interval, but pure events that were opened before the failure are lost. Use an event management product, such as OMNIbus or Tivoli Enterprise Console, as the focal point for storing and manipulating historical events from all event sources.

If you have agents directly connected to the hub, the situations distributed to them are stopped when connection is lost and the situations are restarted when the agent reconnects to the hub. This behavior also applies for agents that are connected to a remote monitoring server and lose connection to it.

Workflow policies

Workflow policies can be set to run at the hub. If the hub fails over while a workflow policy is running, the processing stops and then restarts at the beginning of the workflow policy (upon restart of the hub and triggering).

Short-term data collection

When a hub failover or failback occurs, any remote monitoring servers and agents reconnect to it, causing all situations to be restarted. Short-term data collection is performed at the agents through special internal situations called UADVISOR. These situations are also restarted and collect data only after the next full interval has passed, resulting in the loss of one collection interval of data.

Long-term data collection

The Warehouse Proxy and Summarization and Pruning agents connect to the hub. After a hub failover/failback occurrs, the proxy and agent reconnect to it without impacting their work on the data warehouse.

Note:

Other behavior associated with long-term data collection components when they are clustered is described in the sections above.

Tivoli Event Integration Facility

When Tivoli Monitoring is configured to forward events to the Tivoli Enterprise Console event server or a Netcool/OMNIbus Objectserver, its behavior is to send a “MASTER_RESET" event to the event server every time the hub is restarted. This behavior also occurs after failover/failback for a clustered hub. The purpose of this event is to signal to the event server that the monitoring server restarted all its situations and that the event server will receive a new set of currently triggered situations. The default event synchronization rule for Tivoli Monitoring integration addresses that event by closing all the events that came from this hub monitoring server. Although this behavior guarantees that both Tivoli Monitoring and Tivoli Enterprise Console or Netcool/OMNIbus Objectserver have consistent triggered situations and events, the behavior may not be desirable in some environments. For example, this behavior closes pure events that are not reopened until the next pure event occurs. In such cases, you can filter out the MASTER_RESET either at the hub (EIF configuration file) or at the event server by using correlation rules. In this case, the rules must handle duplicate events caused by the restarted situations.

Maintenance

The maintenance requirements of Tivoli Monitoring (patches, fix packs or release upgrades) when running in a cluster resembles those of running an unclustered Tivoli Monitoring environment. The cluster controls the starting and stopping of Tivoli Monitoring services. In order to return this control to the install procedure, you must stop the cluster while the maintenance is being performed. Also, due to current restrictions of the Tivoli Monitoring installation procedures, some settings that are completed during Tivoli Monitoring cluster setup might need to be repeated after the maintenance is completed and before the cluster is restarted.

Feedback

[ Top of Page | Previous Page | Next Page | Contents | Index ]