Enable virtual machine high availability in IBM SmartCloud Enterprise+

SLA-based HA for System x and System p in the cloud

High availability (HA) is an essential feature of cloud infrastructure. This article gives you an overview of the multifaceted approach that IBM SmartCloud Enterprise+ takes to ensuring HA. It also provides HA implementation details for virtual machines that run on System x® and System p® platforms.

Bhanu P Tholeti (btholeti@in.ibm.com), Systems Engineer and Architect, IBM

Bhanuprakash has worked in the software industry for the past 10 years on various technologies and products such as application development for Pocket PCs, web-based applications, video streaming solutions, and products like Tivoli Workload Scheduler, WebSphere Data Interchange, Tivoli Service Automation Manager, and Tivoli Provisioning Manager. He is part of IBM SmartCloud Enterprise+, working on cloud infrastructures and hypervisors.



K. Sowjanya Chakravarthi (csowjany@in.ibm.com), Systems Engineer, IBM

Sowjanya CK has worked in IBM for more than 6 years on various products. He is involved in porting of Tivoli Provisioning Manager to z/Linux, Go Symphony plugin development, and SCEplus development.



18 September 2012

Also available in Chinese Japanese Portuguese

High availability (HA) — a term frequently associated with cloud infrastructure solutions — essentially means business continuity with minimal machine downtime. Specifically, HA enablement in any cloud infrastructure should have these objectives:

  • Reducing planned downtime
  • Preventing unplanned downtime
  • Rapid recovery from outages
  • Continuous availability

The modern hypervisors that underlie cloud infrastructures provide most of the features and functionality that make it possible to achieve HA. This article briefly explains how IBM SmartCloud Enterprise+ handles planned and unplanned server downtime, recovery from outages, and ensuring continuous server availability. It then describes the HA implementation in IBM SmartCloud Enterprise+ for virtual machines (VMs) that run on VMware and AIX logical partitions (LPARs) on IBM System x and System p platforms.

Reducing planned downtime

Planned downtime is usually scheduled for purposes of software maintenance or releases, upgrades, or scheduled equipment repairs. Most cloud providers schedule some planned downtime, but because their business is based on providing high uptime, planned downtime is kept to a minimum.

IBM SmartCloud Enterprise+ has an automated way of patching VMs with security and non-security updates to the OS. It automatically deploys the updates in predefined regular cycles — with the customer determining the set of VMs requiring patches for that cycle — without any manual intervention. This fully automatic way of patching reduces planned downtime considerably, making the VMs available for most of the time for business continuity.

Preventing unplanned downtime

Unplanned server downtime in a cloud environment can have several causes. Chief among them are failures in hypervisor infrastructure, OS failures, and network failures.

IBM SmartCloud Enterprise+ handles most of these common failures with very minimal downtime. As you'll read later in this article, monitoring agents on System x and a native daemon in System p can detect OS failures, and VMware heartbeat time intervals on System x and some native daemons in System p can detect network failures.

Rapid recovery from outages

With outages that are due to unplanned downtime, recovery speed depends on the nature of the failure. Outages can result from host-platform failures or storage failures, as well as from OS or network failures. Outages caused by host-platform or storage failure can result in a higher magnitude of data and runtime loss if the cloud provider hasn't planned for them adequately.

Failover mechanisms in IBM SmartCloud Enterprise+ enable quick recovery from host-platform and storage failures. All the workload on a failed host platform is distributed to other host platforms with minimal downtime. Storage failures are handled with mirrored datastores. All the data in a VM is replicated in two datastores; if one datastore fails, the VM can be up and running with the duplicate datastore.

Continuous availability

Reducing planned and unplanned downtime, and rapid recovery from outages, all contribute to continuous availability, whereby the server (in a Platform-as-a-Service cloud) is live for most of the time, with extremely minimal downtime. Continuous availability can be achieved by:

  • Proper configuration of the HA features in the underlying hypervisors
  • Using OS-provided features for some failure detection
  • Monitoring services that can monitor the OS for any failures
  • Application monitoring for application high availability

IBM SmartCloud Enterprise+ uses most of the hypervisor-provided HA availability features such as failover mechanisms over host platforms, restart priority, heartbeat intervals, OS monitoring and failure detection, and crash detection.

SLA-based HA

HA configuration in IBM SmartCloud Enterprise+ is based on the service-level agreement (SLA) that you as a customer choose for your particular VMs. For VMs on System x and System p platforms, IBM SmartCloud Enterprise+ defines four SLA levels:

  • Platinum
  • Gold
  • Silver
  • Bronze

Platinum SLA has the highest-priority setting for restart, with minimal timeouts in error situations. Gold, Silver, and Bronze have decremented restart priorities and longer timeouts in error situations. The rest of this article explains these priorities and timeouts in detail. Note: The SLAs include an infrastructure component as well as a VM component. This article covers the VM component only.

Enabling HA for System x

The VMware vSphere High Availability (VMware HA) feature in IBM SmartCloud Enterprise+ enables automatic HA configuration for VMs provisioned on IBM System x. Its two key features are restart priority and heartbeat.

Restart priority

VM restart priority values resolve resource contention. The priority determines the preference that VMware HA gives to a VM if sufficient capacity is not available to power on all failed VMs. High-priority VMs on a host get preference over low-priority VMs.

The valid parameters for a single VM HA configuration are disable, high, medium, and low.

Heartbeat time interval

The VMware VM health-monitoring feature in IBM SmartCloud Enterprise+ always checks for the response on a VM's heartbeat services, which run in the VMware tools on every VM on a given host. If the heartbeat service is unable to respond to the health-monitoring service within a configurable timeout interval, the VM is a failed VM, and the corresponding reset action will be performed.

The lower the value of a heartbeat timeout interval, the faster the VM reboots. Heartbeat time interval is minimal for Platinum SLA, followed by Gold, Silver and Bronze.

Table 1 shows the restart-priority and heartbeat values configured for each SLA level:

Table 1. Restart priorities and heartbeat values for VMs on System x
SLARestart priorityHeartbeat timeout interval (seconds)*
Platinumhigh30
Goldmedium60
Silverlow120
Bronzelow180

*IBM recommends these values. Vendors, cloud administrators, or users can adjust the timeout intervals, depending on the prevailing environment and workload conditions.

In addition to the settings in Table 1, Platinum VMs are allocated storage in the mirrored datastores, which provides continuous availability of the VMs even in the event of storage-devices failure.

Implementation with VMware vSphere APIs

VMware provides flexible and easy-to-use vSphere APIs for programmatically configuring required HA configuration settings.

To configure the restart priority, IBM SmartCloud Enterprise+ uses the reconfigureComputeResource_Task API and a number of vSphere data objects. The code segment in Listing 1 shows that the ClusterConfigSpecEx data object is passed to the reconfigureComputeResource_Task method of the VIPort interface:

Listing 1. Configuring restart priority programmatically
// Initialize the ClusterConfigSpceEx data object and subobjects 
// required for enabling restart priority.
ClusterConfigSpecEx spec = new ClusterConfigSpecEx();
ClusterDasVmConfigSpec[] clusterDasVmConfigSpec = new ClusterDasVmConfigSpec[1];
clusterDasVmConfigSpec[0] = new ClusterDasVmConfigSpec();
spec.setDasVmConfigSpec(clusterDasVmConfigSpec);
ClusterDasVmConfigInfo clusterDasVmConfigInfo = new ClusterDasVmConfigInfo();
clusterDasVmConfigSpec[0].setInfo(clusterDasVmConfigInfo);
ArrayUpdateOperation arrayUppdateSpec = ArrayUpdateOperation.add;
clusterDasVmConfigSpec[0].setOperation(arrayUppdateSpec);

// VM managed object reference (MOR) must be provided as a key for ClusterDasVmConfigInfo.
clusterDasVmConfigInfo.setKey(VM MOR);

// Set the restart priority for the VM
ClusterDasVmSettings clusterDasVmSettings = new ClusterDasVmSettings();
clusterDasVmConfigInfo.setDasSettings(clusterDasVmSettings);

// Restart priority value is obtained from the SLA (see Table 1). 
clusterDasVmSettings.setRestartPriority(restartPriority based on SLA);
ManagedObjectReference taskMor 
   = con._service.reconfigureComputeResource_Task(clsMor, spec, true);

Configuring the heartbeat interval also uses the reconfigureComputeResource_Task API and a number of vSphere data objects The code segment in Listing 2 shows the ClusterConfigSpecEx data object being passed to the VIPort interface's reconfigureComputeResource_Task method:

Listing 2. Configuring the heartbeat interval programmatically
// Initialize the ClusterConfigSpceEx data object and subobjects
//  required for enabling the heartbeat interval.
ClusterConfigSpecEx spec = new ClusterConfigSpecEx();
ClusterDasVmConfigSpec[] clusterDasVmConfigSpec = new ClusterDasVmConfigSpec[1];
clusterDasVmConfigSpec[0] = new ClusterDasVmConfigSpec();
spec.setDasVmConfigSpec(clusterDasVmConfigSpec);
ClusterDasVmConfigInfo clusterDasVmConfigInfo = new ClusterDasVmConfigInfo();
clusterDasVmConfigSpec[0].setInfo(clusterDasVmConfigInfo);
ArrayUpdateOperation arrayUppdateSpec = ArrayUpdateOperation.add;
clusterDasVmConfigSpec[0].setOperation(arrayUppdateSpec);

// VM managed object reference (MOR)must be provided as a key for ClusterDasVmConfigInfo.
clusterDasVmConfigInfo.setKey(VM MOR);

// Set the heartbeat interval for the VM.
ClusterDasVmSettings clusterDasVmSettings = new ClusterDasVmSettings();
clusterDasVmConfigInfo.setDasSettings(clusterDasVmSettings);
ClusterVmToolsMonitoringSettings clusterVmToolsMonitoringSettings = 
   new ClusterVmToolsMonitoringSettings();
clusterDasVmSettings.setVmToolsMonitoringSettings(clusterVmToolsMonitoringSettings);

// Heartbeat interval is obtained from the SLA (see Table 1) 
clusterVmToolsMonitoringSettings.setFailureInterval(heartBeatInterval based on SLA);
ManagedObjectReference taskMor 
   =con._service.reconfigureComputeResource_Task(clsMor, spec, true);

Enabling HA for System p

The properties that enable HA on System p are priority hang detection, lost I/O hang detection, and crash detection.

Table 2 shows the values configured for these HA features in each SLA:

Table 2. SLA settings for HA-enabling properties on System p
SLAPriority problem timeout (seconds)*Lost I/O timeout (seconds)*Crash detection
Platinum1020enable
Gold2040enable
Silver3560enable
Bronze60180enable

*IBM recommends these values. Vendors, cloud administrators, or users can adjust the timeout intervals, depending on the prevailing environment and workload conditions.

Priority hang detection

All processes (also known as threads) run at a priority. This priority is in the range 0-126, with 0 highest priority and 126 the lowest. The default priority for all threads is 60. Any user can lower the priority of a process by using the nice command. Anyone with root authority can also raise a process's priority.

The kernel scheduler always picks the highest-priority runnable thread to put on a CPU. It is therefore possible for a sufficient number of high-priority threads to completely tie up the machine such that low-priority threads can never run. If the running threads are at a priority higher than the default of 60, this can lock out all normal shells and logins to the point where the system appears hung. The system hang detection feature provides a mechanism to detect this situation and give the system administrator a means to recover from it. This feature is implemented as a daemon (shdaemon) that runs at the highest process priority. This daemon queries the kernel for the lowest-priority thread run over a specified interval. If the priority is above a configured threshold, the daemon can take one of several actions. Each of these actions can be independently enabled, and each can be configured to trigger at any priority and over any time interval.

System hang detection is configured in IBM SmartCloud Enterprise+ with an action to reboot the system using the shconf command:

$: shconf -l pio -a pp_reboot=enable -a pp_rto=priority hang timeout based on SLA

Lost I/O hang detection

AIX can also detect I/O hang conditions and try to recover from them, based on user-defined actions.

I/O errors can cause the I/O path to become blocked, further affecting I/O on that path. In these circumstances it is essential that the OS alert the user and execute user-defined actions. As part of the lost I/O detection and notification, the shdaemon— with the help of the Logical Volume Manager — monitors the I/O buffers over a period of time and checks if any I/O is pending for too long a time period. If the wait time exceeds the threshold wait time defined by the shconf file, a lost I/O is detected and further actions are taken. The information about the lost I/O is documented in the error log. Also based on the settings in the shconf file, the system might be rebooted to recover from the lost I/O situation.

Lost I/O hang detection is configured in IBM SmartCloud Enterprise+ with an action to reboot the system, via the shconf command:

$: shconf -l lio -a lio_reboot=enable -a lio_to=Lost I/O timeout based up on SLA

Crash detection

If the OS crashes, an automatic restart should be enabled for continuity. Crash detection in IBM SmartCloud Enterprise+ enables a reboot using the chdev command, which changes the system object device property called autorestart to true.

$: chdev -l sys0 -a autorestart=true

Conclusion

The HA features discussed in this article make IBM SmartCloud Enterprise+ one of the most reliable cloud offerings in the enterprise market, promising business continuity at all times. If your HA requirements change, it's easy to scale from a lower SLA level to a higher one.

Resources

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Cloud computing on developerWorks


  • Bluemix Developers Community

    Get samples, articles, product docs, and community resources to help build, deploy, and manage your cloud apps.

  • Cloud digest

    Complete cloud software, infrastructure, and platform knowledge.

  • DevOps Services

    Software development in the cloud. Register today to create a project.

  • Try SoftLayer Cloud

    Deploy public cloud instances in as few as 5 minutes. Try the SoftLayer public cloud instance for one month.

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Cloud computing
ArticleID=835527
ArticleTitle=Enable virtual machine high availability in IBM SmartCloud Enterprise+
publish-date=09182012