A cost-effective high availability solution on IBM Power Systems

Today’s business requires 24×7 availability for its critical applications. For most organizations, unexpected application downtime translates directly into loss of revenue, loss of business or loss of reputation. Companies want reliable hardware for their IT infrastructure, but regardless of how reliable your hardware and software is, you cannot eliminate server or application downtime completely. So, you need to have a high availability (HA) solution implemented to minimize downtime.

The goal of high availability is to eliminate single points of failure (SPOF) in your IT environment. Eliminating SPOFs requires a careful study of each hardware and software component involved in your IT setup and necessitates providing redundancy at every layer. Organizations whose infrastructure is built on IBM Power Systems servers can use redundant network adapters, redundant host bus adapters (HBAs), redundant virtual input/output (VIO) servers, and so forth, to eliminate SPOFs at the hardware level within a physical server (frame). But there is always a chance that the entire frame may go down. For such scenarios, IBM Power Systems clients can use local HA solutions like IBM PowerHA, Oracle RAC (for Oracle database servers), SUSE-HA (for SUSE Linux) and the like. To handle entire data center failure, clients use their disaster recovery solutions.

Redundant servers for HA means increased cost of hardware, software and maintenance. The goal of this blog post is to discuss how we can minimize the cost of local HA in a PowerHA scenario without compromising on HA capabilities.

Most common implementation scenarios of IBM PowerHA

Figure 1: PowerHA active-passive configuration

In a PowerHA active-passive configuration, shown in figure 1, the application is running on LPAR-1 on frame-1 and is moved to LPAR-2 on frame-2 when LPAR-1 or frame-1 fails. So, the application is highly available.

Figure 2: Power HA mutual takeover configuration

In a PowerHA mutual takeover configuration, shown in figure 2, there are two applications, application A and B. A is active on LPAR-1 on frame-1 and B is active on LPAR-2 on frame-2. If LPAR-1 or frame-1 fails, application A will move to LPAR-2 on frame-2, and if LPAR-2 or frame-2 fails, application B will move to LPAR-1 on frame-1.

In both scenarios, you need to have hardware resources like CPU and memory for both the active and the standby nodes. For example, if an LPAR has 20 CPU cores and 400 GB memory on LPAR-1, you need same amount of CPU and memory for LPAR-2 if you need full performance in the active-passive setup. In a mutual takeover scenario, if application A requires 20 cores/400 GB and application B also requires 20 cores/400 GB, then both LPAR-1 and LPAR-2 need 40 cores/800 GB so that if one LPAR fails, the other LPAR can handle the load of both applications.

The scenarios shown in figure 1 and figure 2 are for a single cluster, but the same applies for multiple clusters on the same set of frames. For example, frame-1 could have 10 LPARs and frame-2 could have 10 LPARs, and each LPAR may be acting as either active or standby. In summary, only 50 percent of resources are used, and 50 percent of resources are kept idle for HA standby. Based on the applications used, you may need double the application licenses for this kind of HA setup, increasing cost further.

In large Power Systems environments, some clients have hundreds of Power servers and thousands of LPARs. If this is your organization’s environment, how can you optimize your HA design to reduce hardware and application licensing costs? The following PowerHA design can help in such a case.

Many to one PowerHA design to save on hardware and software costs

Figure 3: Many to one PowerHA configuration

In the design shown in figure 3, we have a large number of LPARs on hundreds of frames acting as cluster active nodes, and one or two frames are dedicated for all cluster standby LPARs. Optionally, you may want to use idle CPU/memory of the standby frames to run some non-critical workloads when failover is not running.

In the case of failure of any cluster active node, a standby node on a standby frame will get activated, and production can continue. Because a large number of standby nodes are configured on a standby frame, initial CPU/memory configuration will be less, and PowerHA uses the Dynamic LPAR (DLPAR) feature of the Power server to dynamically increase the CPU/memory to the desired value during failover. Thus, you can replace 50 percent standby capacity for HA with around 10 percent, without losing any HA capabilities. If there are multiple frame failures at the same time, the standby frame won’t have adequate resources to start all LPARs, and the result would be data center failure. In this case, you need to switch to DR, just like with a regular PowerHA solution in the active-passive or mutual-takeover configurations.

So, this design can help you to bring down the standby HA capacity needed for local HA from 50 percent of total hardware requirements to around 10 percent and help you to save around 40 percent of infrastructure and software licensing costs.

Need help on planning and designing your Power server environment?

If you’re due for a server refresh, talk to IBM Systems Lab Services. We can provide consolidation design and HA planning to help you to optimize the required hardware and software while keeping availability and performance at the best possible level. If you want to engage IBM Systems Lab Services, you can contact your IBM client rep or reach out to IBM Systems Lab Services directly.