There are several factors that have an impact on the availability of services, mostly related to infrastructure failures. Failures are not only related to unrecoverable hardware outages, but also to recoverable OS or middleware failures.
Not so long ago, the most common approach to high availability was to assume one could deploy infrastructures with the highest Mean Time To Failure (MTTF) possible, which required expensive systems and assumed the possibility to write error-safe software applications. It was also assumed that some degree of down-time was acceptable, with vendors boasting of the number of 9's that they could support (e.g. 99.999% availability). In today's always-on Internet, any downtime of major services becomes headline news. The traditional approach is no longer applicable, and a new approach has to be considered.
Given the requirement to reduce infrastructure costs, service providers are using commodity hardware. Given also the requirement to reduce operational costs, hardware failures are commonly dealt with by directly replacing the failed component rather than manual debugging and recovery by skilled (and expensive) administrators. Thus, to maintain the objective of continuous availability of the service, the Cloud system must be built in order to expect failure of the underlying infrastructure, and not only for temporary periods but it must assume that components will disappear forever. This cannot be limited to only hardware components, as no matter how well a software element is tested, unexpected edge conditions will appear at some point-in-time. So, to guarantee continuous availability, a Cloud solution must also expect its own components to fail too.
Given that we are forced to expect failure, the high MTTF approach is no longer valid, and instead we have to increase availability by flipping the approach to minimizing Mean Time To Recovery (MTTR). The quicker the system can recover from failure, the higher the availability of the service will be. Given however that even a tiny percentage of downtime is no longer acceptable, we also need a means to maintain service availability during the recovery process. One way of doing this is through providing redundancy of all critical services within the Cloud solution.
SmartCloud Provisioning is designed according to the ROC principles, because it is based on a highly distributed, redundant and robust infrastructure, with near zero downtime, and automated recovery across heterogeneous platforms, and it does not require expensive systems, but can run on a relatively low-cost commodity infrastructure.
The key factors that allow SmartCloud Provisioning to be a low-touch and robust cloud infrastructure are the following:
- the infrastructure is as stateless as possible: this avoids issues related to single points of failure
- management agents are deployed on the physical nodes of the infrastructure (compute nodes and storage nodes) and are connected in a peer-to-peer network to form a self-monitoring and self-managing infrastructure
- core services are redundant being deployed in clusters to tolerate individual faults
- master images are replicated in multiple copies across the storage nodes in the storage cluster; this tolerates HW failures of the storage nodes in the cluster as well as network failures when accessing one copy of the image
- hypervisor (compute) nodes are deployed via a stateless boot so that it becomes easier to re-deploy a failing hypervisor by simply rebooting it and getting a fresh new copy of the hypervisor image. This also allows easy deployment of new nodes if needed, to augment the capacity of the infrastructure
Let's consider some typical failure scenarios that can happen in a real environment and let's see how the SmartCloud Provisioning is designed to tolerate them and react appropriately.
First example is related to the management agents that are used by SmartCloud Provisioning to perform the standard provisioning operations.
Management agents are deployed on both the compute nodes and the storage nodes and are organized in dynamic hyerarchies, where a leader (manager) is dynamically elected. The leader is just the entry point for distributing the requests across the infrastructure and a coordinator of any operation, but this role does not imply any special information being associated with the agent itself (stateless infrastructure): any agent can be a leader.
All the agents have a watch-dog mechanism that is used to prevent, detect and correct failures; they also monitor each other in the neighborhood and can start simple actions to fix other agents issues.
So, if an agent fails, the watch-dog mechanism tries to restart it. If the watch-dog is not able to restart the agent, neighbours try some simple actions to restart the failing agent. If the agent cannot be restarted, the system keeps on working without that node, thanks to the redundant infrastructure.
If the failing agent was a leader, and it cannot be restarted, the managed agents can re-elect their leader dynamically, without losing any information.
Another example is related to failures either in a storage node or in a compute node.
If a storage node fails, thanks to the redundant deployment and to the multiple copies of the same image available in the storage cluster, the deployment of VMs can continue without issues, and the leader agent will try to restart the failing node.
If a compute node fails, the leader detects the failures and stops sending requests to that node. Moreover it tries to restart the node, forcing a fresh copy of the compute node to be re-deployed via PXE boot.
If you're interested in trying the SmartCloud Provisioning product, you can download a trial version from the following link: