SLA options for your VM in IBM SmartCloud Enterprise+ (Part 2)

Welcome back! In my earlier post SLA options for your VM in IBM SmartCloud Enterprise+ (Part 1) we discussed about the SLA types and how they define the HA for VMware virtual machines. In this blog post I’ll be discussing the SLA options and HA for the LPARs over SystemP.

High availability (HA) of AIX LPARs in IBM Smart Cloud Enterprise+ (SCE+) is achieved through some of the configurations in the AIX operating system directly. AIX operating system has some of the finest mechanisms to help us achieve this. SCE+ configures priority hang detection, lost I/O hang detection and crash detection as part of HA enablement. Let’s discuss what these properties are and how they enable the HA environment.

Priority hang detection

All processes (also known as threads) run at a priority. This priority is in the range 0-126, with 0 highest priority and 126 the lowest. The default priority for all threads is 60. Any user can lower the priority of a process by using the nice command. Anyone with root authority can also raise a process’s priority.

The kernel scheduler always picks the highest-priority runnable thread to put on a CPU. It is therefore possible for a sufficient number of high-priority threads to completely tie up the machine such that low-priority threads can never run. If the running threads are at a priority higher than the default of 60, this can lock out all normal shells and logins to the point where the system appears hung. The system hang detection feature provides a mechanism to detect this situation and give the system administrator a means to recover from it. This feature is implemented as a daemon (shdaemon) that runs at the highest process priority. This daemon queries the kernel for the lowest-priority thread run over a specified interval. If the priority is above a configured threshold, the daemon can take one of several actions. Each of these actions can be independently enabled, and each can be configured to trigger at any priority and over any time interval.

System hang detection is configured in IBM SmartCloud Enterprise+ with an action to reboot the system, via the shconf command:

$: shconf -l pio -a pp_reboot=enable -a pp_rto=priority hang timeout based on SLA

Lost I/O hang detection

AIX can also detect I/O hang conditions and try to recover from them, based on user-defined actions.

I/O errors can cause the I/O path to become blocked, further affecting I/O on that path. In these circumstances it is essential that the OS alert the user and execute user-defined actions. As part of the lost I/O detection and notification, the shdaemon— with the help of the Logical Volume Manager — monitors the I/O buffers over a period of time and checks if any I/O is pending for too long a time period. If the wait time exceeds the threshold wait time defined by the shconf file, a lost I/O is detected and further actions are taken. The information about the lost I/O is documented in the error log. Also based on the settings in the shconf file, the system might be rebooted to recover from the lost I/O situation.

Lost I/O hang detection is configured in IBM SmartCloud Enterprise+ with an action to reboot the system, via the shconf command:

$: shconf -l lio -a lio_reboot=enable -a lio_to=Lost I/O timeout based up on SLA

Crash detection

If the OS crashes, an automatic restart should be enabled for continuity. Crash detection in IBM SmartCloud Enterprise+ enables a reboot via the chdev command, which changes the system object device property called autorestart to true.

$: chdev -l sys0 -a autorestart=true

The table below shows the values configured for these HA features in each SLA:

smartcloud-high-availabilitySLA settings for HA-enabling properties on System p

IBM recommends these values. Vendors, cloud administrators, or users can adjust the timeout intervals, depending on the prevailing environment and workload conditions.

For more details about the above features refer to the AIX infocenter.

Add Comment
No Comments

Leave a Reply

Your email address will not be published.Required fields are marked *

More Archive Stories

The key differentiators of Docker technology

Without a doubt, Docker is emerging as a next-generation image building and management solution. As widely known, one of the largest objections to the “golden image” model is that we end up with image sprawl: large numbers of (deployed) complex images in varying states of versioning. It is a common concern expressed that images also […]

A guide to the OpenStack Icehouse release

This month we reach another outstanding milestone for open cloud standards as we celebrate the latest release of OpenStack: Icehouse. The OpenStack ecosystem continues to experience explosive growth.  In the previous release of OpenStack we had approximately 850 contributors.  In Icehouse, the number of contributors more than doubled to over 2,100. Likewise, IBM maintains committed to […]

Should I move my application to hybrid cloud?

Should I move my application to a hybrid cloud? This is the question of the day. As hybrid cloud becomes the client and server, or service oriented architecture (SOA), of the 2010s, everyone is asking about trying cloud. As I covered in some previous Thoughts on Cloud blog posts, there are some fundamental structural aspects […]