Planning your backup plan
Having a "B" cell is much like having a "plan B."
A two- (or more) cell strategy provides the capability to divert traffic to an alternate cell (cell B) while maintenance is being applied to a primary cell (cell A). Similarly, if any problems are discovered after cell A is activated after the changes are applied, it can simply be shut down and all traffic will flow to cell B. This is also useful in cases of new application deployments, fixpacks, testing out new configuration parameter settings, and so on.
You can have a multiple cell strategy and share the same physical servers. Instead of building out the existing cluster or cell, creating a cell B on the same nodes provides redundancy without the added cost of additional hardware. However, having cells deployed on dedicated hardware, while more expensive, does provide less chance that a hardware failure will affect more than one cell.
Multiple cells provide the ability to selectively enable specific infrastructural components to participate in an active production configuration. Through careful control and configuration, various parts of the infrastructure can be removed from the production environment on a planned or unplanned basis.
The objectives of multiple cell configurations are to enable you to:
- Easily move users from one running environment to another.
- Minimize (or eliminate) downtime when taking down a part of the environment for planned or unplanned maintenance.
- Easily revert to a previously known configuration, should a catastrophic failure occur in the primary production environment.
- Prevent accidental changes to an active configuration running in production.
The load balancer in IBM WebSphere Application Server is configured to send traffic to either cell A, cell B, or both. If a cell needs to be removed for maintenance or an upgrade, the load balancer can be directed to send traffic to just the cell that is running. If the environment is expanded to have more than two cells, the same basic strategy applies.
Figure 1. WebSphere Application Server load balancer
Safety and security through scripting
All changes to your WebSphere Application Server configuration should be executed through wsadmin scripts. Your scripts should be able to:
- Enable (make active) a configuration at the load balancer.
- Disable (make inactive) a configuration at the load balancer.
- Make changes to an inactive configuration.
Identify active cells and configurations
An important part of the process is knowing which cell or configuration is active (identified as one of the above permutations). There are a couple of ways you can do this:
- Easiest: Static objects on an HTTP server can identify which cell is active. This can be a text object that identifies the active configuration. The scripts can access the static object (for example, through the use of curl) and use that to determine the active configuration. The static object can be a simple text file built by the script as it activates various parts of the run time environment, or as it analyzes each of the configuration files.
- More thorough: Write a custom servlet to dynamically determine and display the current active configuration. This will be able to pull information directly from the various configuration files to determine which cell is active, which grid the applications are pointed to, what version of the application is running, what the version of the coherence data is, and so on. This also enables easier auditing of the environment to ensure that everything is properly configured in production. Of course, this will involve some development effort.
Some monitoring should be provided on the active configuration as a matter of course, and an informational alert should be sent to those responsible for the infrastructure when a change occurs. This enables immediate analysis in case the change was accidental or otherwise unplanned.
Preventing changes to active servers
Your scripting should also incorporate some level of intelligence to further prevent production downtime, with regard to which configuration is active so that changes are not made on active servers. Scripts should use the captured information mentioned above to identify the active configuration, and be able to map the configuration to specific servers, which are also active.
You could provide an override flag to enable actions on active servers, but this should require some level of userID/password authorization – at the very least -- to do so.
Reverting to a previous configuration
The easiest way to back out to a previous deployment and configuration of the production environment is to not immediately make any changes to the “inactive” configuration after a new configuration is made active. Leave the inactive configuration in a warm ready state (that is, the servers are left running even though no traffic is flowing through that configuration). If it is determined that the new (or current) active configuration is not working as desired, you can simply point the load balancers to the previous configuration and make the inactive configuration active again. This enables troubleshooting in production without affecting the availability of the environment.
Providing redundancy is one strategy for sites with high availability requirements. Scripts that perform repeatable and test-able tasks provide confidence in knowing that changes made in one cell can be propagated to the other cells. Scripts also manage where users get directed to, whether it's to a particular cell or group of cells, and that the shift is accomplished seamlessly. Upgrades and other maintenance activities can also be conducted with assurance that a viable back out strategy is available. With adequate hardware isolation, any unplanned outages can be contained to one cell. Likewise, migrations and fixpack updates can be conducted independently on each cell without affecting the others.
In other words, coming up with a plan B should always be a part of plan A.