SLA options for your VM in IBM SmartCloud Enterprise+ (Part 2)

Welcome back! In my earlier post SLA options for your VM in IBM SmartCloud Enterprise+ (Part 1) we discussed about the SLA types and how they define the HA for VMware virtual machines. In this blog post I’ll be discussing the SLA options and HA for the LPARs over SystemP.

High availability (HA) of AIX LPARs in IBM Smart Cloud Enterprise+ (SCE+) is achieved through some of the configurations in the AIX operating system directly. AIX operating system has some of the finest mechanisms to help us achieve this. SCE+ configures priority hang detection, lost I/O hang detection and crash detection as part of HA enablement. Let’s discuss what these properties are and how they enable the HA environment.

Priority hang detection

All processes (also known as threads) run at a priority. This priority is in the range 0-126, with 0 highest priority and 126 the lowest. The default priority for all threads is 60. Any user can lower the priority of a process by using the nice command. Anyone with root authority can also raise a process’s priority.

The kernel scheduler always picks the highest-priority runnable thread to put on a CPU. It is therefore possible for a sufficient number of high-priority threads to completely tie up the machine such that low-priority threads can never run. If the running threads are at a priority higher than the default of 60, this can lock out all normal shells and logins to the point where the system appears hung. The system hang detection feature provides a mechanism to detect this situation and give the system administrator a means to recover from it. This feature is implemented as a daemon (shdaemon) that runs at the highest process priority. This daemon queries the kernel for the lowest-priority thread run over a specified interval. If the priority is above a configured threshold, the daemon can take one of several actions. Each of these actions can be independently enabled, and each can be configured to trigger at any priority and over any time interval.

System hang detection is configured in IBM SmartCloud Enterprise+ with an action to reboot the system, via the shconf command:

$: shconf -l pio -a pp_reboot=enable -a pp_rto=priority hang timeout based on SLA

Lost I/O hang detection

AIX can also detect I/O hang conditions and try to recover from them, based on user-defined actions.

I/O errors can cause the I/O path to become blocked, further affecting I/O on that path. In these circumstances it is essential that the OS alert the user and execute user-defined actions. As part of the lost I/O detection and notification, the shdaemon— with the help of the Logical Volume Manager — monitors the I/O buffers over a period of time and checks if any I/O is pending for too long a time period. If the wait time exceeds the threshold wait time defined by the shconf file, a lost I/O is detected and further actions are taken. The information about the lost I/O is documented in the error log. Also based on the settings in the shconf file, the system might be rebooted to recover from the lost I/O situation.

Lost I/O hang detection is configured in IBM SmartCloud Enterprise+ with an action to reboot the system, via the shconf command:

$: shconf -l lio -a lio_reboot=enable -a lio_to=Lost I/O timeout based up on SLA

Crash detection

If the OS crashes, an automatic restart should be enabled for continuity. Crash detection in IBM SmartCloud Enterprise+ enables a reboot via the chdev command, which changes the system object device property called autorestart to true.

$: chdev -l sys0 -a autorestart=true

The table below shows the values configured for these HA features in each SLA:

smartcloud-high-availabilitySLA settings for HA-enabling properties on System p

IBM recommends these values. Vendors, cloud administrators, or users can adjust the timeout intervals, depending on the prevailing environment and workload conditions.

For more details about the above features refer to the AIX infocenter.

Share this post:

Share on LinkedIn

Add Comment
No Comments

Leave a Reply

Your email address will not be published.Required fields are marked *

More Archive Stories

June #cloudchat recap: Cloud and the Sharing Economy

This month's chat took place on Thursday, June 13, on how cloud is powering the sharing economy.

New threats in virtualization

All traditional threats could apply to VMs, but new complexity adds several new types of vulnerabilities.

Top 12 IBM Cloud Milestones in 2012

Just when you thought every iteration of 12/12/12 has been tweeted and blogged about, we reflect on the top 12 announcements for IBM Cloud Computing in 2012.