Provide high availability and disaster recovery in IBM Bluemix
What's the difference between HA and DR, and how can you keep your Bluemix apps up and running even in a disaster?
This tutorial was written using a previous version of the IBM Bluemix® interface. Given the rapid evolution of technology, some steps and illustrations may have changed.
When it comes to applications deployed in the cloud, perhaps the most fundamental question that's asked about non-functional requirements is "How do I ensure that my application stays running even if something fails?" This is as true of applications deployed on IBM Bluemix® as it is of any other cloud platform, but the answer to the question can vary depending upon the particular variety of IBM Bluemix the application is deployed on.
“Find out how you can implement high availability (HA) and disaster recovery (DR) solutions for applications running in different varieties of Bluemix.”
Since its inception, the Bluemix platform has evolved from one platform type to many:
- Bluemix Public offers a shared public cloud environment that's hosted in IBM SoftLayer Data Centers, providing hundreds of services across mobile, IoT, Watson, and many other service types.
- Bluemix Dedicated offers an isolated customer-specific environment, which, while still hosted in an IBM SoftLayer Data Center, provides the added level of isolation many enterprises require.
- Bluemix Local provides the same level of cloud platform agility and management as the others, but is hosted within a client data center.
All three varieties of Bluemix are delivered as a service, the platform IBM is responsible for maintaining. In the case of Bluemix Public and Dedicated, IBM is also responsible for the hardware and virtualization infrastructure supporting Bluemix. The real variation occurs with Bluemix Local, in which the customer—the one responsible for keeping the underlying hardware and software up and running—faces an additional question: "What happens when the data center is down or unreachable?"
In this article, we describe how high availability (HA) and disaster recovery (DR) solutions can be implemented for applications running in Bluemix Local and Dedicated, as well as runtimes and services in all Bluemix platforms. We'll examine the differences between HA and DR, and then discuss considerations for the Bluemix platforms and applications. We also describe some possible architectures for an HA/DR solution for Bluemix Local.
What are HA and DR?
Although the terms high availability and disaster recovery are often used interchangeably, they are two distinct concepts:
- High availability is a characteristic of a system that aims to ensure an agreed-upon level of operational performance for a higher than normal period. Building HA into a system normally entails designing the system so as to avoid single points of failure by adding redundancy to the components. Often HA is described as the ability to keep a system functioning within a geographical region barring a natural or human-caused disaster that can affect an entire data center.
- Disaster recovery is a set of policies and procedures that enable the recovery or continuation of vital technology infrastructure and systems following a natural or human-induced disaster. DR is the practice of ensuring that a particular system can be returned to a working state within a specified period of time (the Recovery Time Objective, or RTO) while losing at most a limited number of in-flight transactions (the Recovery Point Objective, or RPO).
In the case of a Platform as a Service, such as the PaaS components of Bluemix, HA is achieved by ensuring that all components are duplicated to guarantee there is no single point of failure. However, if these components are run within a single region, a DR issue can arise if all the hardware in that region becomes unavailable (which might happen due to networking issues). In order to achieve DR, you need to specify the procedures to guarantee platform availability within the RTO, even if an entire region becomes unavailable.
Some Bluemix programming model terminology
Applications written in Bluemix can be built using any combination of Cloud Foundry Runtimes, Docker Containers, or Virtual Machines. In reality, only the Cloud Foundry and Docker parts constitute a Platform as a Service, so in this article we'll focus on those types of applications and systems. Before we dive in into the detailed technical considerations for the Bluemix Platform, it's necessary to define some terms used in Bluemix and Cloud Foundry in order to facilitate the discussion.
- An application is what's deployed in Cloud Foundry. It's defined by a manifest file and may contain multiple runtimes and services.
- A buildpack provides framework and runtime support for the application. Examples include the Java™ Liberty buildpack and the Node.js buildpack.
- A runtime represents the instantiation of a buildpack, as part of the application deployment.
- A service represents an external resource that can be used by an application.
- A Droplet Execution Agent (DEA) schedules and manages the running of runtime instances inside warden containers. (Note that this is true of Cloud Foundry version 2, which is currently the version on which Bluemix is based. The Diego version of Cloud Foundry differs in some ways, which we will address in a future article.)
All of this comes together when an application is constructed. Each application consists of one or more runtime instances, each of which runs inside warden containers within the confines of DEAs. Likewise, each runtime instance relies on one or more service instances. (Service instances are defined and shared at the stage level in Bluemix.)
In the Cloud Foundry platform, the combination of health managers and BOSH agents provides many levels of high availability, as described in "The Four Levels of HA in Pivotal CF" by Cornelia Davis. Since the platform is already highly available, we really need only to focus on what an application needs to do to take advantage of these capabilities.
Building applications with Docker in Bluemix is similar to building applications with the Cloud Foundry Runtimes. In this case, there are two new terms we need to introduce:
- A container in the Bluemix Container service runs a single instance of a Docker image. You can create single containers either through the Bluemix GUI or the command line.
- A container group in the Bluemix Container service hosts multiple copies of the same Docker image. Container groups have built-in recovery capabilities.
Just as with the Cloud Foundry Runtimes, you construct an application using Docker containers with Bluemix service instances. Again, the platform provides HA, but the application needs to take certain steps to take advantage of this capability.
Now that you understand the basic architecture of a Bluemix application, the considerations for HA in the Bluemix environments become clear. Essentially, you need to consider the HA aspects for Cloud Foundry runtimes, Docker containers, and the services. We'll consider the runtime aspects first.
Although IBM ensures that the shared components of the Bluemix Public environment remain highly available, there are some considerations to be made for any application to be highly available. As we mentioned earlier, an application is made up of runtime instances and service instances, so you have to consider the HA capacity of both segments in order to achieve HA across your entire application.
In order to make an application available in the face of a potential
runtime instance failure, the user needs to deploy multiple instances of
the runtime. This can be done in several ways. The first might be by using
cf scale command to set the number of available runtime
Alternatively, you can specify the number of instances directly in the manifest file. In order to survive any single point of failure in Bluemix Public, you need to set that number to least two. However, in order to survive other potential failure modes, you might want to set it to a higher number. (In general, we've found that the more instances the better).
Another possibility is to use the Auto-Scaling service in Bluemix to let it scale your application automatically up and down, according to demand. See "Using the Bluemix Auto-Scaling service" later in this article for factors to consider when using Auto-Scaling to achieve HA.
Likewise, if your application is deployed using the IBM Container Service instead of Cloud Foundry Runtimes, there are similar capabilities that are built into the IBM Container Service. For instance, any application that is deployed using the Container Service should be deployed into a Container Group in order to achieve HA within a single Bluemix Public Region. If you create a Container Group with at least two instances for your application, then your application will continue to function even when a single container instance fails. This should be considered to be the minimum deployment configuration for production applications using the IBM Container Service. For more information on the benefits and limitations of the IBM Container Service and Container Groups, see the IBM Bluemix Container Service documentation.
When it comes to the Bluemix Public services, the quality of service (QoS) and availability vary among different services. Bluemix services provide many plans and approaches, offering alternatives for the user to guarantee the needed QoS.
Because Bluemix Dedicated runs on the same infrastructure as Bluemix Public, the considerations described above apply to Bluemix Dedicated in terms of runtime, with some slight differences. In Bluemix Dedicated (and Bluemix Local as well), the number of DEAs that are available is limited by the total purchased capacity of the Bluemix Dedicated or Local installation. This means that instead of the myriads DEAs that are deployed in a Bluemix Public region, you might have only a handful of DEAs deployed into your Dedicated or Local installation. This could affect the number of runtime instances you might want to allocate for very important applications. For example, in a standard 64GB Bluemix Dedicated installation, two VMs are allocated for DEAs. In a scenario where one of those VMs is brought down for maintenance, you would have only three DEAs available to service work. Therefore, in order to avoid a situation where an application suffers from an unexpected runtime failure during DEA maintenance, and in order to maintain uptime, you might want to deploy your application in a minimum of three instances to guarantee the highest availability.
Just as the runtime considerations are slightly different from those of Bluemix Public, Bluemix Dedicated provides some additional flexibility in terms of services, considering that the environment is exclusive to one customer. Customers have the choice to work with IBM to ensure a particular quality of services and availability for dedicated services. In addition, customers can choose to set up their own highly available services (most likely in SoftLayer, to mitigate the latency between the Bluemix platform and the service) and expose it in Bluemix, using service brokers.
All of the considerations that we have discussed for Bluemix Dedicated with regard to VM and DEA allocation also apply to Bluemix Local. In addition, there is another aspect to consider: Bluemix Local is installed on the customer's existing IaaS environment. This means that customers who experience hardware failures are responsible for replacing that hardware. In situations where hardware needs to be replaced, or when the IaaS (hypervisor) infrastructure itself is undergoing upgrades, there are additional planned failure modes in Bluemix Local that are not required in Bluemix Public or Dedicated.
In addition to the core Cloud Foundry platform, the Bluemix Local platform includes several built-in services such as autoscaling, and optionally may include local versions of services such as Cloudant. As each service has a different implementation and different characteristics, we can't categorically state that all services provide HA; however, chances are good that they do provide it.
When it comes to service integration, Bluemix Local may provide more flexibility than the Public or Dedicated options in that it can be integrated with many existing services. Many enterprise services (such as Enterprise-grade DB2 or Oracle databases) probably already have HA capability and can be registered as Bluemix services using the service broker capability (which is also available in Dedicated as well).
Another interesting aspect of Bluemix Local and Dedicated is their ability to use syndicated services, available in Bluemix Public and Dedicated. Syndicated services are configured in a Bluemix Local environment and exposed to the user, like a regular Bluemix Local service. The beauty of using syndicated services is that they provide HA as they inherit it from the Bluemix Public environment. The drawback is that these services are not hosted behind the client's firewall, which might restrict their use.
As stated above, high availability is about avoiding single points of failure within a given region. Disaster recovery is a harder problem, in that it's about surviving the catastrophic failure (or loss of availability) of an entire region. In order to have such an "always-on" scenario, it's necessary to deploy multiple Bluemix environments, which can be a combination of Public, Dedicated, or Local platforms. For example, a customer could do one of the following:
- use the Bluemix Public environment in Dallas and London
- deploy two different Bluemix Dedicated environments (each in a different Softlayer location)
- deploy two different Bluemix Local environments in separate customer-owned data centers
Core Cloud Foundry
The Bluemix team is in the process of introducing availability zones into its Cloud Foundry fabric for its public sites. As of the publication of this article, there are multiple CF availability zones in the Dallas region, so any application deployed in that region to more than one CF instance will have the instances spread across at least two zones. We will roll out this support to the other regions over time. From the standpoint of the core Cloud Foundry platform (Cloud Controller, health manager, DEAs, routers, and so on), Bluemix provides an acceptable level of high availability. But the environment is still limited to the overall availability of the region. If the entire region is unreachable—as in the case of a natural disaster or, more likely, a local loss of network connectivity—then that region of Bluemix will be unavailable. Later in this article, in "Application considerations," we describe some architectures that help address this issue for the various Bluemix versions.
Considerations for Bluemix Public and Dedicated
Currently, availability zones are not yet implemented for Bluemix Local and Dedicated. Therefore, any DR solution must involve multiple Bluemix environments installed in different data centers. Because of that, there are two issues that we have to add to the HA considerations described earlier:
- Cross-site IP Sprayer or DNS routing. Although each Bluemix environment has its own IBM DataPower Gateway appliance, in order to keep the solution available when a data center is lost, the solution needs a global load balancer to either dispatch the requests among the runtimes or to re-route the DNS requests. Note that you can implement this as either an active-active or an active-passive solution. Which one you choose depends upon the design of your application. (For an example of setting this solution up as active-active, see "Configure and run a multi-region Bluemix application with IBM Cloudant and Dyn," by Lee Surprenant.)
- A DevOps process to keep the runtime levels in sync among all the environments.
Another aspect to consider is session affinity. If the application uses an intermediate cache layer, such as Redis, it's important to be aware that such a service might not synchronize the information across Bluemix environments.
The situation gets even more complex in the case of an active-active configuration, where the global load balancer needs to be aware of the session affinity to ensure that the requests are directed to the original Bluemix environment. In the case of a disaster where there is no synchronization of information among the intermediate cache layer, the users' sessions will be lost. The next section discusses other considerations specific to the case of Bluemix Local.
Bluemix Local HA/DR architectures
As we have seen, a single Bluemix Local environment doesn't guarantee availability of your application in every scenario, as it is restricted to a single region. In order to achieve such overall availability, you need multiple Bluemix environments. In this section, we describe how to achieve better availability with an architecture of multiple Bluemix environments.
A clear choice for obtaining a DR-capable solution with Bluemix is to deploy multiple Bluemix Local environments in different regions. That way, if one region becomes unavailable, the solution will still be available, by directing the traffic to the other Bluemix Local environments. However, this solution addresses only the providing of recovery for the runtimes. Just like the core Cloud Foundry components, locally deployed services are bound to one region and don't provide a disaster recovery mechanism. That must be handled separately, as you'll see below. But first, let's explore another alternative.
Bluemix Public or Dedicated as a backup for Bluemix Local
To protect a single Bluemix Local against failures in a region, you might consider using Bluemix Public or Dedicated to temporarily host the applications and services.
One appealing advantage of Bluemix Local is that it is hosted behind the client's firewall, allowing the placement of sensitive information and the integration of in-house services. If those options are important, using Bluemix Public or Dedicated might not be viable, and a solution with multiple Bluemix Local environments might be the only option.
Multi-region architectures for Bluemix Local
Now, let's discuss the possible configurations for a solution with multiple Bluemix Local environments. For this discussion, we classify the Bluemix Local components into services and runtimes.
Ideally in a multi-Local solution, you want to use all the environments simultaneously. However, that depends on the characteristics of the services and runtimes, and also the proximity (and network latency) among the Local environments.
In this discussion, we are assuming that there are two Bluemix Local environments, although the same principles can be applied to more than two environments. For the purposes of this discussion, we'll call a single Bluemix Local (or Dedicated) installation a "region," to maintain compatibility with the terminology used in Bluemix Public.
In the simplest case, many services are stateless and do not persist data, making it possible for them to run in tandem with the same services in another Bluemix Local environment. An example of this would be the Watson Text-to-Speech or Speech-to-Text service. If a solution relies only in this kind of service, then it can run a truly active-active solution, with the services and runtimes deployed to both environments. However, even in this scenario, you need to make sure your application is constructed with appropriate timeout and retry logic to deal with the potential for failure in one or the other environment.
In most cases, however, the solution requires the use of stateful services, such as a Cloudant or Mongo Service. Often, these services are not aware of other Bluemix Local environments and are not prepared to replicate or synchronize data between service instances running in separate Bluemix Local environments, making the creation of a full active-active solution unfeasible. An alternative is described in the next section.
If the service can't run on an active-active configuration, you need to consider how to replicate the data from the active service to the passive one, in the case of a disaster. For this discussion, we'll consider a service that persists data, such as a MySQL service.
One way to replicate the data is to keep regular backups, storing them outside the data center where the active Bluemix Local environment is running. In the case of a disaster, this backup can be restored in the passive environment, either manually or through an automated process. In cases with a tight RPO, this solution might be unacceptable, due to the delay in restoring the database in the passive environment. Furthermore, it requires creating regular backups to avoid losing data. Here are some examples of how backups are supported by different services:
- In SQLDB, Bluemix users have the ability to schedule backups as needed in the premium plan. Note that SQLDB is available only for Bluemix Public, or it can be federated into Bluemix Dedicated or Local.
- IBM DB2 on Cloud can be backed up at any time and the backup moved onto IBM's Swift Object storage, which can then be used to restore the data either onto the same or a different (potentially remote) instance.
- Postgres on Compose has a fixed 24-hour RPO for its backups, but you can restore from the backups at any time, even to a different data center.
Another option is to perform continuous data replication, which is often the most reasonable option in an active-passive model. However, this also puts restrictions and limitations on the set of services that can be used. Not every data service available for Bluemix offers multi-region data replication. In these cases, data replication may not be supported by the plans available in Bluemix; it may require you to use an external service provider connection. As with most other situations, your solution is service-dependent. Here are examples of four services that currently provide data replication:
- The DB2 on Cloud Service is compatible with and can host a full DB2 HADR solution. When you choose this service, you can pick the options that you want to implement DB2, including specifying if you want your servers to be virtual or bare metal, or to include features such as DB2 HADR. Detailed instructions for setting up DB2 HADR on DB2 on cloud are available online in the DB2 support documentation.
- The SQL Database Service on Bluemix also provides high availability in its premium support plan. Likewise, instructions for using the alternate server are available on the SQL Database Service site. However, you should note that this solution provides high availability within a data center and not disaster recovery across data centers. It will address the failure of any single server, but will not address entire region failures.
- Postgres on Compose.io allows you to set up high availability within a single data center using etcd as the shared configuration repository for determining which of two Postgres nodes is the leader (see "High Availability for PostgreSQL, Batteries Not Included," by Chris Winslett). This solution does not support setting up cross–data center disaster recovery solutions in the same way. At this time, all Compose.io services require separate contracts and configuration outside Bluemix, so configuring Postgres for HA would be done when you specify that service.
- The Cloudant Service solves both the HA and DR problems: Cloudant is a naturally distributed system that automatically stores data across multiple servers, making it immune to single-point-of-failure problems, so high availability is a built-in capability. Cloudant also supports cross-region replication for topologies that need to support disaster recovery scenarios, as it allows for multi-site active-active configurations on a per-database basis. However, if you need this cross-region replication, it will need to be set up outside of Bluemix as you work out a separate solution with Cloudant.
If your services can't run on an active-active configuration, an alternative approach is to have the services run in an active-passive mode, with the runtimes running in an active-active mode, accessing the active services in only one Bluemix Local environment at a time.
This topology requires the Bluemix Local environments to have fast network connectivity, as the runtimes in one region are accessing the services in another. You must know the performance characteristics of your application in order to determine if the latency caused by accessing services in a different data center will be acceptable or not.
With this setting, you are effectively using both Bluemix Local environments, and the runtimes can be deployed considering that half of the workload will go to each environment.
If an active-active configuration of the runtimes is not viable, you should then consider the options for an active-passive configuration.
In this scenario, the runtimes are deployed to both regions, but the load balancer is directing traffic to just one zone. If necessary, the data is replicated between the active region and the passive one.
This solution offers the benefit of having the passive environment ready to go when a disaster happens. In this case, the disaster procedure involves only reconfiguring the load balancers and the data replication services.
The drawback of this solution is that an entire environment is kept running and up to date, but is not being used for either development or in processing customer traffic. The alternative below might present a more effective solution.
In this scenario, the production runtimes are deployed to both regions, but are not started in the passive region. As part of the disaster procedure, the runtimes must be started promptly, so that the user will be able to recover faster. Note that in this case the overall cost of a Local or Dedicated solution would be the same as in a hot standby mode, since you pay for the capacity of the solution. However, you could easily use your cold standby side for other purposes, such as development and test or production staging, so long as those applications can be shut down promptly. Although such a solution may be more cost-effective overall, the need to start the runtimes in the passive region quickly might represent a problem in terms of platform availability.
Limited performance in case of a disaster
An intermediate solution between the hot and cold standby is to keep a limited number of runtime instances running in the passive region and increase the number in the case of a disaster. That way, the user wouldn't suffer from any lack of availability, but would probably experience slow performance while the other instances are started.
Using the Bluemix Auto-Scaling service
In the case of an active-active configuration (whether runtime or service level), an effective way to keep the same performance during a disaster is to use the Bluemix Auto-Scaling service. With that, you can deploy the runtime in both regions, then configure the SLA metrics (CPU or memory utilization, for example), and let the platform scale up and down, as needed. In the case of a disaster, all the traffic will be directed to a single region, increasing the runtime load. Bluemix will then be responsible for scaling up the number of instances to keep the performance within acceptable bounds. Once the disaster has passed, the runtime will scale down to a normal number of instances.
There are a number of additional application-layer considerations that should be taken into account when building applications that are designed for high availability and disaster recovery. First, you need to carefully consider the needs of each application separately. Many organizations try to create a one-size-fits-all solution for DR and HA, which can be expensive and unnecessary. Therefore, your first step should be to create a classification system. Not all apps are mission critical or need DR. Limited availability for some apps should be acceptable. By classifying your applications into different buckets based on their requirements, you will probably find that you can begin your cloud journey by starting with those that have less rigorous needs.
We've already covered what can be done at the data services layer to help guarantee that an application can continue to function even in the face of the loss of a service or of one or more application runtimes. However, another consideration is that service availability extends beyond IBM-provided services. In many cases, an application will build on services provided by an on-premise service implementation hosted in a corporate data center. Remember that your application is only as resilient as its weakest link; if a back-end SOA service is not multi-site-enabled or highly available, it does not matter how highly available the applications built on it are.
All of these solutions only address how to make the runtime stateless—that is, to store all of its state in a back-end database (SQL or NoSQL). When you look at what it takes in the application to make that layer resilient in the face of failure, you find that there are other considerations to take into account as well. Essentially, there are three different cases we need to examine:
- The simplest case, and the one that offers the least resilience, is to store all of your application data in a single copy of a database in Bluemix. That is the most traditional mechanism for building applications, and it can be combined with traditional HA techniques such as data replication and DR techniques like backup and recovery (see above). The downside of this approach is that it limits the set of potential databases you can use to those that support HA use cases through their infrastructure. The upside is that it avoids any question of partitioning your data.
- The next most complicated case is to store your data in separate copies of the database in separate regions in Bluemix. This requires partitioning your data in an active-active model, and it would require the ability to do either backup and restore or data replication in an active-passive model. The upside of this model is that it means that your application is never entirely down, even after the failure of an entire data center or region. However, it would require that your users log back in to the application, at the least, and may result in data loss or the need to reconcile data between regions during a recovery period. In the case of an active-active partitioned model, you need to consider the latency involved in a data replication or backup approach—this will affect the RPO of your solution.
- A final case, one that may turn out to be the path of least resistance for existing applications, is to "punt" the problem and build your application as a hybrid cloud application in which the data is stored on your own premises, on existing databases using any HA and DR procedures your existing IT team has in place. That doesn't so much solve the problem as it defers it, but it may be easiest in the short term.
Taking a step back from these layering considerations, you should also make sure that you're following a couple of principles for building cloud-native applications that may influence the way that your applications behave:
- Let your runtime environment inform your application design. This seems simple, but in fact it's at the heart of a revolution in how applications are built. In the past, when teams were limited in the runtimes and environments that they could choose, they often made compromises driven by the limitations or strengths of those environments. For example, when application build times were long and application runtimes were expensive and heavyweight, it made more sense to put lots of functionality into a single application runtime. In more modern applications, you can separate application functionality into different runtimes, which means that even if part of your application infrastructure fails, not all of the application functionality will be lost.
- Your application must take some responsibility for resiliency and high availability. For instance, the application should implement probes (sometimes called synthetic transactions) to let external monitoring systems determine the up/down/slow state of each application component. Likewise, applications can take advantage of well-known design patterns, such as Circuit Breaker and Bulkhead, to reduce the impact of component failure.
In this article we have examined the most common ways of achieving high availability in Bluemix applications, and we have looked at issues around setting up Bluemix and Bluemix services to address disaster recovery needs. We have only scratched the surface of this complex set of issues, but we hope this article can help you understand the different options available in implementing HA and DR capabilities.
- Auto-Scaling for Bluemix
- The Four Levels of HA in Pivotal CF
- Configure and run a multiregion Bluemix application with IBM Cloudant and Dyn
- High Availability for PostgreSQL, Batteries Not Included
- Restoring Pivotal Cloud Foundry After Disaster
- IBM Bluemix Containers documentation
- Zero Downtime Deployment and Scaling in CF
- Top 9 rules for cloud applications
- Scaling applications in IBM Bluemix