SLA options for your VM in IBM SmartCloud Enterprise+ (Part 2)

Share this post:

Welcome back! In my earlier post SLA options for your VM in IBM SmartCloud Enterprise+ (Part 1) we discussed about the SLA types and how they define the HA for VMware virtual machines. In this blog post I’ll be discussing the SLA options and HA for the LPARs over SystemP.

High availability (HA) of AIX LPARs in IBM Smart Cloud Enterprise+ (SCE+) is achieved through some of the configurations in the AIX operating system directly. AIX operating system has some of the finest mechanisms to help us achieve this. SCE+ configures priority hang detection, lost I/O hang detection and crash detection as part of HA enablement. Let’s discuss what these properties are and how they enable the HA environment.

Priority hang detection

All processes (also known as threads) run at a priority. This priority is in the range 0-126, with 0 highest priority and 126 the lowest. The default priority for all threads is 60. Any user can lower the priority of a process by using the nice command. Anyone with root authority can also raise a process’s priority.

The kernel scheduler always picks the highest-priority runnable thread to put on a CPU. It is therefore possible for a sufficient number of high-priority threads to completely tie up the machine such that low-priority threads can never run. If the running threads are at a priority higher than the default of 60, this can lock out all normal shells and logins to the point where the system appears hung. The system hang detection feature provides a mechanism to detect this situation and give the system administrator a means to recover from it. This feature is implemented as a daemon (shdaemon) that runs at the highest process priority. This daemon queries the kernel for the lowest-priority thread run over a specified interval. If the priority is above a configured threshold, the daemon can take one of several actions. Each of these actions can be independently enabled, and each can be configured to trigger at any priority and over any time interval.

System hang detection is configured in IBM SmartCloud Enterprise+ with an action to reboot the system, via the shconf command:

$: shconf -l pio -a pp_reboot=enable -a pp_rto=priority hang timeout based on SLA

Lost I/O hang detection

AIX can also detect I/O hang conditions and try to recover from them, based on user-defined actions.

I/O errors can cause the I/O path to become blocked, further affecting I/O on that path. In these circumstances it is essential that the OS alert the user and execute user-defined actions. As part of the lost I/O detection and notification, the shdaemon— with the help of the Logical Volume Manager — monitors the I/O buffers over a period of time and checks if any I/O is pending for too long a time period. If the wait time exceeds the threshold wait time defined by the shconf file, a lost I/O is detected and further actions are taken. The information about the lost I/O is documented in the error log. Also based on the settings in the shconf file, the system might be rebooted to recover from the lost I/O situation.

Lost I/O hang detection is configured in IBM SmartCloud Enterprise+ with an action to reboot the system, via the shconf command:

$: shconf -l lio -a lio_reboot=enable -a lio_to=Lost I/O timeout based up on SLA

Crash detection

If the OS crashes, an automatic restart should be enabled for continuity. Crash detection in IBM SmartCloud Enterprise+ enables a reboot via the chdev command, which changes the system object device property called autorestart to true.

$: chdev -l sys0 -a autorestart=true

The table below shows the values configured for these HA features in each SLA:

smartcloud-high-availabilitySLA settings for HA-enabling properties on System p

IBM recommends these values. Vendors, cloud administrators, or users can adjust the timeout intervals, depending on the prevailing environment and workload conditions.

For more details about the above features refer to the AIX infocenter.

More stories

Why we added new map tools to Netcool

I had the opportunity to visit a number of telecommunications clients using IBM Netcool over the last year. We frequently discussed the benefits of have a geographically mapped view of topology. Not just because it was nice “eye candy” in the Network Operations Center (NOC), but because it gives an important geographically-based view of network […]

Continue reading

How to streamline continuous delivery through better auditing

IT managers, does this sound familiar? Just when everything is running smoothly, you encounter the release management process in place for upgrading business applications in the production environment. You get an error notification in one of the workflows running the release management process. It can be especially frustrating when the error is coming from the […]

Continue reading

Want to see the latest from WebSphere Liberty? Join our webcast

We just released the latest release of WebSphere Liberty, It includes many new enhancements to its security, database management and overall performance. Interested in what’s new? Join our webcast on January 11, 2017. Why? Read on. I used to take time to reflect on the year behind me as the calendar year closed out, […]

Continue reading