An Overview of Clustering

What Is Clustering?

Clustering is an advanced feature that substantially extends the reliability, availability, and scalability of Integration Server. Clustering accomplishes this by providing the infrastructure and tools to use multiple Integration Servers as if they were a single virtual server and to deliver applications that leverage that architecture.

Scalability. Without clustering, only vertical scalability is possible. That is, increased capacity requirements can only be met by deploying on larger, more powerful machines, typically housing multiple CPUs. Clustering provides horizontal scalability, which allows virtually limitless expansion of capacity by simply adding more machines of the same or similar capacity.
Availability. Without clustering, even with expensive fault-tolerant systems a failure of the system (hardware, Java runtime, or software) may result in unacceptable downtime. Clustering provides virtually uninterrupted availability by deploying applications on multiple Integration Servers; in the worst case, a server failure produces degraded but not disrupted service
Reliability. Unlike a server farm (that is, an independent set of servers), clustering provides the reliability required for mission-critical applications. Distributed applications must address network, hardware, and software errors that might produce duplicate (or failed) transactions. Clustering makes it possible to deliver "exactly once" execution as well as checkpoint restart functionality for critical operations.

Clustering can be stateful or stateless, as described in later in this chapter. The diagram below shows stateful clustering in its simplest form.

Important: Do not perform development work in a clustered environment. Basic namespace locking and local service development are not supported across clustered Integration Servers.

Other webMethods products support the use of clustering to synchronize requests, transactions, notifications, and so on across Integration Server instances in the cluster. For more information about whether a particular product supports clustering and how to configure that product for a clustered environment, see the product documentation.

What Data Is Shared in a Cluster?

Data storage depends on whether the cluster is stateful or stateless.

In both types of clusters, Integration Server stores data from cluster nodes (for example, service results, scheduled tasks, audit data, and documents) in shared database components. For information about database components, see Installing Software AG Products .

In a stateful cluster, the session state for clients connected to cluster nodes is stored in a distributed cache in a Terracotta Server Array. The cache makes state available to all servers in the cluster, which enables multi-step transactions in a conversation to begin on one node in the cluster and continue on other nodes. For example, in the Cross-site Request Forgery (CSRF) Guard feature, a token is created locally on a server and saved in the distributed Terracotta Server Array cache, which makes the token available to all other cluster nodes.

In a stateless cluster, and the session state of a client is not stored in a distributed Terracotta Server Array cache; state is only available locally. Stateless clusters do not support all Integration Server-related features and components (see table below). Stateless clusters require less hardware and less software, and less work is needed for set up and maintenance. However, stateless cluster cannot support multi-step transactions on multiple cluster nodes because state is not available.

Integration Server Component or Feature	Supported by Stateless Cluster?
Cross-site request forgery (CSRF) Guard	No
Exactly once for document history	Yes
Failover support	No
File polling	Yes
Guaranteed delivery	Yes
OAuth authorization server	Yes
OpenID Connect	No
Scheduler	Yes, target node of Any or a specific server; all option not supported
Service results caching	Yes
Business Rules	Yes
CloudStreams	Yes
Dynamic Business Orchestrator	No
Process Engine and Monitor operation	Yes
Interaction between Process Engine and Trading Networks	Yes
Trading Networks	Yes
Adapters, including notifications produced by Integration Server scheduler services	Yes
1SYNC, AS4, ebXML, EDI, EDIINT, SWIFT eStandards Modules	Yes
Chem, FIX, HIPAA, Papinet, Rosettanet eStandards Modules	No

Load Balancing in a Stateful or Stateless Cluster

Load balancing is an optimizing feature you use with clustered Integration Servers. Load balancing controls how requests are distributed to the servers in the cluster. A load balancer can be useful if you need load balancing for multiple types of servers, for example, web servers and application servers, in addition to Integration Servers. Load balancers also offer virtual IP support, but they are not "out of the box." Most load balancers perform load balancing in a round-robin manner or based on network level metrics such as TCP connections and network response time.

You can always use load balancing with stateful clusters. You can use load balancing with stateless clusters if you are not using any of the unsupported features or components listed in the table above..

Failover Support for Stateful Clusters

Failover support enables recovery from system failures that occur during processing, making your applications more robust. For example, by having more than one Integration Server, you protect your application from failure in case one of the servers becomes unavailable. If the Integration Server to which the client is connected fails, the client automatically reconnects to another Integration Server in the cluster.

Note: Integration Server clustering provides failover capabilities to clients that implement the webMethods Context and TContext classes. Integration Server does not provide failover capabilities when a generic HTTP client, such as a web browser, is used.

You can use failover support with stateful clusters.

Reliability

The guaranteed delivery and checkpoint restart features improve reliability.

Guaranteed Delivery for Stateful or Stateless Clusters

Guaranteed delivery ensures that a service executes once and only once. It is particularly useful when used with clustering to prevent a restarted service from running on more than one server. This feature is only for use with server-to-server communications. You can use guaranteed delivery with both stateful and stateless clusters.

Guaranteed delivery ensures one-time execution of services by guaranteeing the following:

Requests from the client to execute services are delivered to the server.
Services are executed once, and only once.
Responses from the execution of services are delivered to the client.

When not clustering, guaranteed delivery makes sure a client resubmits a service request to an Integration Server until it succeeds and a response is returned, and makes sure the service executes only once. For example, if the network connection between the client and the Integration Server fails after execution but before the response is successfully redirected to the client, the service might be executed twice. Guaranteed delivery prevents this from happening.

With clustering, guaranteed delivery makes sure that if the server on which the service is running becomes unavailable, the client retries the service on another server in the cluster until it succeeds and a response is returned. As in an unclustered environment, guaranteed delivery prevents a service from executing more than once.

For more information about guaranteed delivery, see Configuring the Server for Guaranteed Delivery . and the Guaranteed Delivery Developer’s Guide .

Exactly Once for Document History

Exactly-once processing is a facility that ensures one-time processing of a guaranteed document by a webMethods messaging trigger; the trigger does not process duplicates of the document. You can use exactly-once document processing with both stateful and stateless clusters.

Checkpoint Restart for Stateful and Stateless Clusters

You can use the pub.storage or pub.cache service to code flow services to store state information and other pertinent information in the short-term data store. If a flow service fails because a server becomes unavailable, the flow service can be restarted from the last checkpoint rather than at the beginning. For more information about pub.storage and pub.cache services, see the webMethods Integration Server Built-In Services Reference . You can use both checkpoint restart services with stateful clusters. For stateless clusters, use the pub.storage service to use checkpoint restart.

Putting It All Together

This table summarizes how clustering, guaranteed delivery, and checkpoint restart work together to provide availability, failover, and reliability in different situations.

Checkpoint restart specified	Clustering specified	Guaranteed Delivery specified	If the server on which the service is running becomes unavailable	Point at which the service restarts
			After the server becomes available, you must manually restart the service.	At the beginning
		X	After the server becomes available, the client automatically restarts the service. The client keeps trying to run the service until it runs successfully.	At the beginning
	X		The client automatically retries the service on another server in the cluster. If that server fails, the client retries the service on the next server in the cluster, and so on. If all attempts fail, you must manually restart the service.	At the beginning
	X	X	The client retries the service on the next server in the cluster. If that server fails, the client retries the service on the next server in the cluster, and so on. The client continues to try running the service until it runs successfully.	At the beginning
X			When the server becomes available again, you must manually restart the service.	Flow services: At the specified checkpoint Other services: At the beginning
X		X	When the server becomes available again, the client automatically retries the service. The client continues trying to run the service until it completes successfully.	Flow services: At the specified checkpoint Other services: At the beginning
X	X		The client automatically retries the service on the next server in the cluster. If that server fails, the client retries the service on the next server in the cluster, and so on. If attempts on all servers fail, you must restart the service manually.	Flow services: At the specified checkpoint Other services: At the beginning
X	X	X	The client automatically retries the service on the next server in the cluster. If that server fails, the client retries the service on the next server in the cluster, and so on. The client keeps trying until the service runs successfully.	Flow services: At the specified checkpoint Other services: At the beginning.

Integration Server Session Objects for Stateful Clusters

Session objects are maintained for stateful clusters.

Clustered Integration Servers create and maintain the session objects that are stored in the distributed Terracotta Server Array cache.

In a non-clustered environment, Integration Server maintains session information in its own local memory. In a cluster, however, Integration Server creates a session in the distributed (or shared) cache, so that when a load balancer redirects a client for failover, the new server can access the session information.

The Integration Server that initially receives a request from a client creates the session object in the cache. Other Integration Servers in the cluster can access the session object to access and update session information as necessary.

When you configure an Integration Server to use clustering, you specify a setting that indicates how long inactive session objects are maintained in the cache. Periodically, each Integration Server in the cluster checks the session objects in the cache to determine if any have expired, and if so, removes them.

Usage Notes for Session Storage

Only objects that are serializable can be successfully stored in the session when Integration Server is running in a cluster. That is, those objects must implement the java.io.Serializable interface or one of the com.wm.util.coder interfaces such as com.wm.util.coder.IDataCodable. In a cluster, a session is serialized to the shared session store. When the session is restored from the shared store to an Integration Server, the complete state of the session, including any application data that had been saved to it, is available. If the session contains objects that are not serializable, those objects are converted into strings that hold the objects' class names. The actual state of those objects is lost.

The requirement that these objects be serializable applies to the entire object graph, including the object placed into the session and every object it contains, no matter how deeply nested.

Although your production application can run in a cluster, you will be developing it on a stand-alone Integration Server (clustered development is not supported). It is important to be aware of the serializable requirement so that you do not encounter problems with your session data once you start to test in a cluster.
The processing speed of a cluster is determined in large part by network I/O. Adding application data to the session state will increase the amount of I/O the cluster must perform, and make it operate more slowly. The addition of a single large or complex object to each session can have a noticeable effect on the overall throughput of a cluster.

If you are concerned that saving application data to the session might impact the performance of your Integration Server cluster, consider other ways of saving this data. If it does not have to persist across server restarts and does not have to be shared throughout the cluster, a simple Java collection such as a Vector or HashMap might be appropriate. If the data needs to survive server restarts but does not have to be shared throughout the cluster, writing it to the local file system is an option. If the data needs to be shared throughout the cluster, consider saving it in a database.
Even though sessions are created and maintained in a distributed cache, each Integration Server keeps a portion of the cache locally. The local cache stores information relating to the sessions that are active on Integration Server as well as a list of nodes in the cluster. If you anticipate that your Integration Server sessions will use a large portion of the cache, you should increase the Maximum Elements In Memory or Maximum Off-Heap settings for each Integration Server in the cluster. For information about changing these settings, see Working with Caches .

Scheduled Jobs for Stateful or Stateless Clusters

In a stateful cluster, you can schedule jobs to run on one, any, or all Integration Servers in the cluster. For jobs to run in the cluster, the server must be enabled for clustering and existing jobs must be flagged to run in the cluster.

In a stateless cluster, you can schedule jobs to run on one or any Integration Servers in the cluster, but not all.

Integration Server stores information about scheduled jobs in the ISInternal database component.

For instructions, see the chapter about managing services in About Services .

Client Applications

Server clustering is almost transparent to the client. A client can issue requests to a server that is in a clustered environment in the same way it issues a request to a server that is not in a clustered environment. Integration Server clustering provides failover capabilities only to HTTP-based webMethods clients, such as those clients built using the webMethods Context and TContext classes.

When a client connects to a server in a cluster by calling the Context or TContext class, the server returns information about the other servers in the cluster to the client. To improve failover capability, before your client calls Context or TContext, have your client issue the setRetryServer method in that class to specify another server to try in case the first server the client tries to connect to is unavailable.

You can use setRetryServer in a stateful cluster, and in a stateless cluster if your application does not require features that are not supported by stateless clusters. If a request is not processed, the client can use this information to connect to another server in the cluster to have the request fulfilled. In a stateful cluster, the server returns a list of cluster members to the client, which can then automatically switch to other servers if the one it is connected to becomes unresponsive.

Note: You can use the setAllowRedir method in the Context class on each client to specify whether the client should connect to other servers in the cluster after a connectivity failure.

No special processing is required in your clients.

Using remote Integration Servers in a Stateful or Stateless Cluster

An Integration Server can be configured to connect to a remote server for a number of reasons, including:

Allow clients to run services on other Integration Servers using the pub.remote:invoke service and the pub.remote.gd:* services.
Connect publisher and subscriber Integration Servers to each other for the purpose of package replication.
Facilitate the process of presenting different certificates to different Integration Servers.

If you want to specify a backup server in case the remote server is not available, specify that backup server as the retry server on the remote server's alias definition.

In a stateful cluster, use the $clusterRetry parameter that is passed to the pub.remote:invoke service to control what happens when an Integration Server tries to connect to the remote server in a stateful cluster. If you want the Integration Server to first try other servers in the cluster when the remote server is not available, and then go to the retry server if no cluster servers are available, set $clusterRetry to true. If you do not want the Integration Server to first try other servers in the cluster but rather to go directly to the retry server, set the parameter to false. For more information about remote servers, see Setting Up Aliases for Remote Integration Servers . For more information about the pub.remote:invoke service, see the pub.remote:invoke .

Stateless clusters do not support the $clusterRetry parameter; in this scenario, the Integration Server will automatically connect to the retry server.