The WebSphere Contrarian: Run time management high availability options, redux

IBM® WebSphere® Application Server Network Deployment provides for failover and recovery of application workload, but how do you provide for failover of the management workload in a Network Deployment cell? The WebSphere Contrarian explains the steps you need to take to achieve this, whether you're using WebSphere Application Server V6.x or V7.0. This content is part of the IBM WebSphere Developer Technical Journal.

Share:

Tom Alcott, Senior Technical Staff Member, IBM

Tom AlcottTom Alcott is Senior Technical Staff Member (STSM) for IBM in the United States. He has been a member of the Worldwide WebSphere Technical Sales Support team since its inception in 1998. In this role, he spends most of his time trying to stay one page ahead of customers in the manual. Before he started working with WebSphere, he was a systems engineer for IBM's Transarc Lab supporting TXSeries. His background includes over 20 years of application design and development on both mainframe-based and distributed systems. He has written and presented extensively on a number of WebSphere run time issues.


developerWorks Professional author
        level

27 January 2010

Also available in Chinese Russian Spanish

In each column, The WebSphere® Contrarian answers questions, provides guidance, and otherwise discusses fundamental topics related to the use of WebSphere products, often dispensing field-proven advice that contradicts prevailing wisdom.

Bringing back an old topic -- and Latin

It's been nearly seven years since I authored an article on implementing a highly available infrastructure for IBM WebSphere Application Server Network Deployment without clustering, and while the procedure I described then is still generally applicable to WebSphere Application Server Network Deployment (hereafter referred to as Network Deployment) V6.x and V7.0, there have been updates to Network Deployment that change some of the specifics, as well as provide additional options for Network Deployment run time management high availability (HA). Therefore, I thought an update on the topic would be appropriate. As a bonus, I'm always looking for a way use the high school Latin that I was required to take many (many, many) years ago, and revisiting an earlier article presents an opportunity for me to use redux in the title! This is especially appropriate since the subject is restoring, or bringing back, the Network Deployment management run time function (my Latin instructor would be so proud!). Last, but not least, I get asked about this topic on a fairly regular basis, so from a purely selfish perspective, this discussion will provide a generally accessible resource to which I can direct future such queries. Everybody wins!

Before delving into the options and associated techniques for Network Deployment run time management HA, let's first quickly review the Network Deployment application server architecture. Network Deployment application servers are designed to be generally self sufficient from the management runtime. Each application server has its own:

  • Web container.
  • EJB container.
  • Name service.
  • Security service.
  • Transaction manager service.
  • JMS messaging engine (optional, depending on configuration).
  • JCA connection manager (which provides for JDBC and EIS connections).
  • Java™ Management Extensions (JMX) management server.
  • High availability manager service.

The Web container is actually a "converged container" that hosts components accessed via HTTP(S), such as servlets, JSPs, and portlets, plus components employing Session Initiation Protocol (SIP).

As a result of the services listed above, the failure of a node agent or deployment manager doesn't impact already running Network Deployment application servers (subject to some issues we'll discuss in a moment), which are able to continue to service application requests even in the event of a failure.

If you are experienced in Network Deployment, then you already know that the evolution from V5.x to V6.x removed most of the run time dependencies that existed in the node agent and deployment manager. Those with a sharp eye likely noticed that I stated generally self sufficient from the management runtime, which is not to be confused with completely self sufficient... since there are some functions that still reside solely in the node agent and deployment manager -- but these functions don't impact application servers already running (or the applications running on them). Of course, you're not going to let me off without a discussion of the functions that reside solely in the node agent and deployment manager, and I don't blame you, So, let's get to it.


Node agent function

While there is an option to disable WebSphere Application Server workload management and to provide your own static workload management routing definition (see the Information Center) -- which, as a side effect, permits Network Deployment application servers to start independent of the node agent -- do not use this option. The reason why is because the only way to make changes, such as adding a cluster member, is to:

  1. Undo all of the static routing changes.
  2. Get the entire environment back to dynamic routing.
  3. Add the member.
  4. Take a new snapshot of the environment.
  5. Export the workload management route table (which is in a flat file).
  6. Re-enable all the properties to get back into static routing mode.

This adds considerable complexity to cell management. Be aware also that, while application servers bootstrap from the node agent LSD, standalone application clients should bootstrap from the name service inside each application server using Corbaloc (CORBA Object URL).

Given the role the node agent plays in bootstrapping application servers, you wouldn't want a node agent to be unavailable for an extended period of time. This is especially true during a period when lots of server starts were occurring, such as after applying OS maintenance, WebSphere Application Server maintenance, or an application (re)deployment.

In an environment with multiple machines running a Network Deployment cluster, the ability that the OS provides to monitor and restart the node agent in case of a failure should be sufficient for node agent HA, since the loss of a single node shouldn't be catastrophic in a multi-machine environment. Depending on the operating system, there are various "OS process nanny" capabilities (Windows Service on Windows, or an initab entry on Linux or UNIX) that can be employed for this purpose, which will function as long as the physical server and OS continue to run -- which in turn enables the OS process nanny to monitor and restart the node agent. Of course, if the OS isn't running then you wouldn't be able to start any processes, let alone Network Deployment processes!


Deployment manager function

Beginning with Network Deployment V6.x, the deployment manager is no longer a single point of failure for the IIOP workload management routing table, as was the case in WebSphere Application Server V5.x. In Network Deployment V6.0 and later, the HA manager ensures that the singleton that maintains this information is always available in one of the Network Deployment processes (deployment manager, node agent, application server, and so on). If the process that is hosting this singleton fails, then another process is elected to host the singleton. As a result, the deployment manager in Network Deployment V6.x and V7.0 is only used for making configuration changes and managing JMX routing.

While you would configure the deployment manager in a high availability cluster using the clustering software appropriate for your operating system (for example, HACMP for AIX®, SunCluster for Solaris™, MC/Serviceguard for HP-UX, and Microsoft® Cluster Server for Windows® Server), such a configuration adds cost and complexity. I'm not suggesting that it's prudent to make no provision for recovery or failover of the deployment manager, only that there are alternatives that can be used to minimize the loss of ability to make configuration changes or to route JMX traffic, including performance monitoring information, which the deployment manager provides. We'll start our discussion of availability techniques by ensuring that administrative processes remain running.


OS process nanny

This technique doesn't provide for process failover from one server to another, but it does provide a means -- again, as long as the OS and server are functioning -- to ensure that the node agent or deployment manager process is running. The Network Deployment Information Center provides a discussion on creating a Windows service for Windows platforms and on modifying an example file: was.rc for use in creating a Linux® or UNIX® OS process nanny. But for Linux and UNIX, another variation is available.

On Linux and UNIX, you're probably familiar with using the startManager.sh and startNode.sh scripts to start the deployment manager and node agent. But there is also a -script option that you can use to create scripts that can be implemented by the OS to start these two processes (or restart in case of a failure).

  1. First you create the script. For the deployment manager:
    Listing 1
    >/wasconfig/OjaiCell01/profiles/Dmgr01/bin # ./startManager.sh –script OSstartManager.sh 
    
    ADMU0116I: Tool information is being logged in file
               wasconfig/OjaiCell01/profiles/Dmgr01/Dmgr01/logs/dmgr/startServer.log
    ADMU0128I: Starting tool with the Dmgr01 profile
    ADMU3100I: Reading configuration for server: dmgr
    ADMU3300I: Launch script for server created: OSstartManager.sh

    For the node agent, you create the script from the node agent profile bin directory using startNode.sh instead of startManager.sh.
  2. Next, for the file you just created, modify the /etc/inittab file with this entry:
    Listing 2
    # WAS inittab entry 
    was:235:respawn:/ wasconfig/OjaiCell01/profiles/Dmgr01/bin/OSstartManager.sh 
    >/dev/console 2>&1
  3. Save the inittab file and run init –q to reload the file. You should see the deployment manager starting (unless it's already running, in which case you would want to either stop the deployment manager first or not run init –q).

    Depending on the version of WebSphere Application Server you are running, you might need to be aware of this APAR if you're going to employ this technique.

A word of caution: Once you’ve configured the inittab to start processes this way, you will not be able to stop the node agent or deployment manager manually -- or rather, if you do stop either one, it will immediately start again. In many environments this causes additional issues, thus you should use this option with caution.

Now that you've provided for process availability, let's explore deployment manager failover options that don't require hardware clustering.


Cell configuration backup and restore using multiple servers

Although the approach described here does not provide for fully automatic failover and recovery, it is relatively inexpensive and can be easily implemented, and given its simplicity and the minimal impact of a deployment manager outage, it's likely more than sufficient.

In general, the steps are to:

  1. Make regular backups of the cell configuration using the backupConfig script. In addition, manually copy critical files from the <was deployment manager profile>/etc and <was deployment manager profile >/properties directories.
  2. Install Network Deployment and create a deployment manager profile on a backup or alternate server.
  3. Restore the configuration on a backup server using the restoreConfig script.
  4. Change the IP address (or host name) on the backup server to resolve to the IP address (or the host name) of the original server.
  5. Start the deployment manager on the backup server.

First, you must make backups of your cell on a regular basis (in case you were to experience a loss of access to the disks containing this information). WebSphere Application Server provides a command line tool for this purpose, backupConfig.sh/bat, located in the bin directories of all Network Deployment profiles. For the purpose of backing up your cell configuration, however, you should run it from the deployment manager profile. The execution of the script is shown below:

Listing 3
>/opt/IBM/WebSphere/AppServer/profiles/Dmgr01/bin # ./backupConfig.sh 
	20090801_backup.zip

ADMU0116I: Tool information is being logged in file
           /opt/IBM/WebSphere/AppServer/profiles/Dmgr01/logs/backupConfig.log
ADMU0128I: Starting tool with the Dmgr01 profile
ADMU5001I: Backing up config directory
           /opt/IBM/WebSphere/AppServer/profiles/Dmgr01/config to file
           /opt/IBM/WebSphere/AppServer/profiles/Dmgr01/bin/20090801_backup.zip
.................................................................................
ADMU5002I: 689 files successfully backed up

The default execution stops the deployment manager. While it is a good idea stop the deployment manager to prevent changes from being made while the backup is running, this action is not necessary. If you run the backupConfig script using the -nostop option, the deployment manager will not be stopped, in which case you'd want to be sure that no configuration changes were being made while the backup was running. Also, you can choose to specify a file name, as shown above, or omit the file name, which will result in a file name with the format: WebSphereConfig_YYYY-MM-DD.zip.

At the same time, create a backup, as described, of the <was deployment manager profile>/etc and <was deployment manager profile>/properties directories:

  • The signer certs from <was deployment manager profile>/etc; for example, /opt/IBM/WebSphere/ApplicationServer//profiles/Dmgr01/etc.
  • The client security property files from the properties directory; for example, /opt/IBM/WebSphere/ApplicationServer/profiles/Dmgr01/properties.

The properties directory needs to be copied, as you might have changed their contents to customize how the WebSphere Application Server clients behave. The etc directory needs to be copied, as it contains (among other things) keys and certificates used for establishing SSL connections from clients to the servers. Depending on your configuration, failure to copy those files could result in minor issues (such as being inappropriately prompted for a password or to import a certificate) or serious failures.

This copy should be performed periodically, or in conjunction with the use of backupConfig, using normal OS file system copy tools (such as tar), and will need to be performed in advance of machine failure to ensure that you can access the files.

When you have the backup, place a copy of it in a highly available file system; otherwise, a disk outage on your deployment manager server could make the backups unavailable to you.

Next, you must install Network Deployment and create a deployment manager profile on your backup server. The important part of this step is that you specify the Profile name, Node name, Host name (or IP address), and Cell name from the original server (Figure 1).

Figure 1. Node, host, and cell name specification during profile creation
Figure 1. Node, host, and cell name specification during profile creation

You need to specify the values for the original server because your cell configuration uses the server name when constructing the cell configuration.

When your Network Deployment installation is complete, you have created a deployment manager profile on the backup server, and you have restored the files from the <was deployment manager profile>/etc and <was deployment manager profile>/properties directories, you are now ready to restore the cell configuration from the original server. You will use the restoreConfig.sh/bat script to do this:

Listing 4
>/opt/IBM/WebSphere/AppServer/profiles/Dmgr01/bin #./restoreConfig.sh 
	20090801_backup.zip

ADMU0116I: Tool information is being logged in file
           /opt/IBM/WebSphere/AppServer/profiles/Dmgr01/logs/restoreConfig.log
ADMU0128I: Starting tool with the Dmgr01 profile
ADMU0505I: Servers found in configuration:
ADMU0506I: Server name: dmgr
ADMU2010I: Stopping all server processes for node N1CellManager01
ADMU0512I: Server dmgr cannot be reached. It appears to be stopped.
ADMU5502I: The directory /opt/IBM/WebSphere/AppServer/profiles/Dmgr01/config 
			already exists; renaming to 
				/opt/IBM/WebSphere/AppServer/profiles/Dmgr01/config.old
ADMU5504I: Restore location successfully renamed
ADMU5505I: Restoring file 20090801.zip to location
           /opt/IBM/WebSphere/AppServer/profiles/Dmgr01/config
....................................................................
ADMU5506I: 689 files successfully restored
ADMU6001I: Begin App Preparation -
ADMU6009I: Processing complete.

At this point you have two options. You can either:

  • Add or modify the DNS entry (or the /etc/hosts file on all machines) so that the server name for the failed server on which you were running the deployment manager now resolves to the server to which you just moved the deployment manager.
  • Change the IP address on the backup server to match that of the original server or add a network interface card with the IP address of the original server. The steps to do this differ depending on your operating system; you simply need to use the appropriate command or tool for your operating system.

The first option requires you to stop and restart all the node agents in order to clear the "stale" cache in the node agent JVM that points to the old IP address for the server. The stale cache occurs because Java "remembers" the IP resolution of a host name. As a result, once a connection has been made, all running node agent JVMs have a cache of the IP address for the original server that the deployment manager was running on. Alternatively, you can configure a command-line argument for each node agent that will force a DNS cache refresh on a periodic basis. The property is:

-Dsun.net.inetaddr.ttl=<time in seconds>

An example is shown in Figure 2.

Figure 2. Node agent JVM arguments
Figure 2. Node agent JVM arguments

A reasonable value for this property is 60 seconds. Stopping and starting the node agents ensures that this action occurs in a timely manner, and is preferred in small cells (<10 nodes). The second option does not require you to stop and restart the node agents because you have added or changed the IP address to the new server. When using the first option, know also that during the creation of the deployment manager profile, you must specify the host name for the server, while the second option requires that you specify the IP address.

Finally, start the deployment manager by running the startManager script. When you receive this message:

ADMU3000I: Server dmgr open for e-business; process id is xxxx

you're ready to administer your cell using the administrative console or wsadmin. You can continue to do this until your original server is repaired.

A reminder, when you go back to the original server you might need to stop and restart all the node agents at this point, depending upon how you chose to change the IP address, described earlier.


Multiple servers with a shared file system

This is a variation on the previous technique. As before, this approach does not provide for fully automatic failover and recovery, but again, it is relatively inexpensive and can be easily implemented, and given its simplicity and the minimal impact of a deployment manager outage, it's likely more than sufficient. One word of caution though: make sure that your shared file system provides adequate performance, otherwise you'll suffer from degraded performance during normal operations, as you maintain the Network Deployment cell configuration as well, as you do in the event of an outage that requires running from a backup server.

In general, the steps are to:

  1. Mount a shared (and highly available) file system to be used for the cell configuration.
  2. Install Network Deployment and create a deployment manager profile on a primary machine, and install Network Deployment on a backup machine. (Creating a profile on the backup is not required.)
  3. Change the IP address (or host name) on the backup server to resolve to the IP address (or the host name) of the original server.
  4. Start deployment manager on the backup server.

First, mount a shared file system (such as NFS) or a SAN file system to both the primary and backup server machines. In this example, a shared directory named "wascell" has been mounted.

Next, install Network Deployment on both machines. When you create the deployment manager profile, you'll need to specify the Profile directory in the graphical Profile Management Tool (PMT), shown in Figure 3, or use the –dmgrProfilePath directive, if you're using the manageprofiles(bat/sh) command. If you use the PMT, use the advanced profile creation option so you can specify the path for the profile.

Figure 3. Profile name and directory in the PMT
Figure 3. Profile name and directory in the PMT

You'll notice that I have included the cell name, OjaiCell01, as part of the directory path in order to be able to store the configuration for multiple cells on the shared file system.

When a server failure occurs at this point, follow the steps described earlier for changing the DNS entry or starting a network card on the secondary server that was originally associated with the primary server.

Last, simply open an OS shell from the backup server and navigate to the configuration directory for the deployment manager profile and start the deployment manager using startManager(sh/bat):

>/wascell/profiles/Dmgr01/bin #./startManager.sh

As before, once you receive this message:

ADMU3000I: Server dmgr open for e-business; process id is xxxx

you're ready to administer your cell using the administrative console or wsadmin.


Cell configuration backup using a node agent

A variation on the process employing Cell configuration backup and restore using multiple servers described above is detailed in this article. Rather than take backups using backupConfig one, this process:

  1. Federates a node to the Network Deployment cell.
  2. Configures the node to obtain a copy of the entire Network Deployment cell configuration, as opposed to the node specific cell configuration that is obtained by default during configuration synchronization.
  3. Creates a modified deployment manager startup script for use with the configuration maintained with the backup node.

While this procedure does replace the need to take regular backups of the Network Deployment cell, it does have the downside of being limited to operational control with the backup deployment manager, so you can't use this approach to make configuration changes. As a result, this procedure would only be suitable for a very brief period of time until the primary deployment manager was restored to service, and doesn’t provide a means of restoring cell configuration function in the deployment manager in the event of a failure.


Other considerations

Related to the topic of deployment manager HA is the placement of the deployment manager. My recommendation is to place your production deployment manager on a server separate from the ones running Network Deployment application servers. My rationale for this is that running the deployment manager on its own server (or perhaps a server dedicated to your deployment managers) makes the application of maintenance easier on both the OS and WebSphere Application Server. Placing the deployment manager on a separate machine means that performing maintenance on the deployment manager server doesn't impact any running application servers, which wouldn't be the case if the deployment manager was on the same machine. When you're applying maintenance in this scenario, you're not dealing with outages of both the deployment manager and part of your application server infrastructure, so you're not impacting both your management runtime and application runtime simultaneously.

Although this does require a separate license for the deployment manager server (or nodes if you prefer), the cost of a license is likely outweighed by the cost of an unplanned outage if high availability is important to your organization. Additionally, while placing the deployment manager on a separate virtual image (for example, LPAR, Zone, and so on) does eliminate the issue of OS or WebSphere Application Server maintenance being a source of an outage, it doesn't eliminate the machine as a single point of failure. Thus, if you do use LPAR or related technologies, you’ll need provisions for starting the deployment manager virtual image on another physical machine should a machine fail.


Summary

While the options described in this article don't ensure completely automatic failover and recovery in all cases, they do provide reasonable alternatives to a number of more costly automated solutions -- which your management might welcome in the current economic climate.

Now if I could only recall how to conjugate reducere. Let's see, reduco, reducere, reduxi....


Acknowledgements

Thanks for Keys Botzum for his suggestions and comments.

Resources

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into WebSphere on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=WebSphere
ArticleID=464155
ArticleTitle=The WebSphere Contrarian: Run time management high availability options, redux
publish-date=01272010