July 15, 2014 | Written by: Alexei Karve
Share this post:
With shared hosting and overcommitment, larger virtual machine (VM) density can cause noisy neighbor problems, where a small number of applications can monopolize all resources such as network bandwidth, CPU, RAM and disk input/output (I/O). Disk capacity increases but the speed of reading and writing does not. This causes performance to be spread over more data and applications.
The cloud provider can monitor hypervisors and overcome the noisy neighbor problem by dynamically moving workloads across physical servers in order to ensure that every application gets resources assigned to the instance. This process incurs an overhead. Cloud consumers can increase both the reliability and availability of their applications by leveraging the resiliency of cloud-based resources with data centers in multiple regions and availability zones. Clients will see faster access and better performance with a content delivery network that distributes content where it is needed. In this blog post, I explain some of the functionality your cloud provider may offer that enables a collaborative approach between the provider and the client for building scalable and fault-tolerant applications in a cloud.
When your application runs into resource limits and does not scale out, you have to scale up. SoftLayer allows you to change the memory, CPU and storage on your servers. You can start with a smaller set of resources and then order additional memory or CPU for your servers or move to larger systems as the need arises. There is, however, a limit to how much you can scale up the resources and it may be cost prohibitive. Most of these resize operations require downtime. When you resize the primary disk on a dedicated server, the cloud provider may reinstall the operating system.
With the proper partitions, OpenStack allows the resizing of ephemeral storage without loss of data and will move the VM to another hypervisor in the process. You can set the property allow_resize_to_same_host to allow resizing on the same host. If supported, you could take advantage of hot-add RAM and hot-plug CPU functionality, so you do not need to shut down your virtual machine or application. Balloon drivers installed in each virtual machine can transfer the memory shortage from the host (where the shortage exists) to the VM. CPU frequency scaling enables the operating system to scale the CPU frequency up or down in order to save power. CPUs can be dynamically disabled and re-enabled on a Linux system.
Replication or cloning
A cloud enables capturing custom images with pre-installed applications. A benefit of exploiting virtual machine images is the ability to replicate the image to different environments. These clones may require some reconfiguring of the applications or reactivation on first boot. A good scale-out architecture can self heal without the requirement for extra redundant servers. A SoftLayer Flex-Image is a platform-neutral imaging solution that captures an image and gives you the ability to replicate it on another instance. This is available for use and replication on both dedicated and SoftLayer CloudLayer Computing Instances (CCI). It expands the options for how you mix and match your servers in the cloud. Scaling up to dedicated servers would be beneficial if the application reaches a workload utilization that requires more raw horsepower for your processor-intensive and disk I/O-intensive workloads and you want to avoid sharing resources.
The hybrid approach would be to have on-premises cloud with cloud bursting to meet spikes in demand. When demand recedes, you bring down the provisional servers in the hosted cloud, thus only incurring the temporary utilization costs rather than purchasing surplus on-premises hardware. IBM Platform LSF or Platform Symphony software provisioned on SoftLayer and the on-premises infrastructure expands capacity as needed by seamlessly bursting jobs from on-premises to secure off-premises resources in the cloud. Music Mastermind uses a hybrid cloud solution with global cloud bursting to SoftLayer for meeting global traffic demands required for its active users, hosting and application development needs.
A cloud may expose functionality to enable disaster recovery. Multi-region support and failover strategies enable transitions to alternate sites in different geographical locations if a major catastrophe strikes a particular area. With Virtualized Server Recovery (VSR), SoftLayer businesses will be able to replicate entire systems in near real time, including system files, databases, applications and user data, in a way that is independent of the make and model of the underlying hardware.
OpenStack saves images as full snapshots into the image service Glance. The OpenStack Orchestration service Heat and telemetry service Ceilometer work together to provide autoscaling functionality to expand and contract resources for composite cloud applications (stacks). This allows quick replication of additional servers to meet increased demand.
When migrating to a cloud, your on-premises VM is not necessarily identical to your new cloud instance. Migrating VMs between cloud providers may not be straightforward; however, it is possible by either copying and converting the image from source to target hypervisor or selecting an equivalent base image on the target and reinstalling the application stack using an Image Construction and Composition tool. SoftLayer allows seamless migration between virtual and physical environments. For example, if you run into resource limits that are I/O bound, you can move to more powerful dedicated servers with solid-state drives (SSDs) and RAID.
In OpenStack, migration provides a scheme to move instances from one OpenStack compute node to another. Migration is useful for redistributing the load among the available hypervisors. There are two types of migration:
- Non-live is where the instances will be shut down for the move to another hypervisor.
- Live migration is where the instance will be kept running. There are two types of live migration, including shared storage-based live migration and block live migration. Live migration offers extreme versatility but may result in degraded performance during the migration.
Evacuation and rebuild
If a hypervisor needs to be taken down for maintenance, you need to empty the source hypervisor by moving virtual machines to other target hypervisors. If the OpenStack compute service is deployed with a shared file system, it preserves the user disk data on the evacuated server. Rebuilding instances is required when something goes horribly wrong. The instance is booted from a new disk, but preserves its configuration including the IP address. On servers in SoftLayer, a rebuild would mean reloading the operating system on your server. In this process, you could re-partition the drives or even change the operating system and re-purpose the server. A rescue kernel lets you bring a server online in order to troubleshoot system problems that would normally only be resolved by an OS reload.
Snapshot or checkpointing
If you have a virtual desktop and something goes wrong, you may want to restore to a previous working state. If you have taken snapshots of your VM instances, you can revert back to a working state. File systems such as Linux Logical Volume Manager (LVM), IBM General Parallel File System (GPFS), Z File System (ZFS) and others provide snapshot capability. Most common are copy on write (COW) snapshots that are good if you need a point-in-time backup taken really fast for short-term recovery needs. This enables recovery of files that users accidentally delete or if they need to roll back an entire system to a pre-update configuration.
A cloud may not provide you with everything you want. It can, however, provide you with what you need. The cloud can give you the building blocks to develop scalable and resilient applications. Share your thoughts in the comments below or engage in the conversation with me on Twitter @aakarve. I look forward to hearing about how you take advantage of the ingredients provided by a cloud to accomplish resiliency and fault tolerance.