Welcome back! In my earlier post SLA options for your VM in IBM SmartCloud Enterprise+ (Part 1) we discussed about the SLA types and how they define the HA for VMware virtual machines. In this blog post I’ll be discussing the SLA options and HA for the LPARs over SystemP.
High availability (HA) of AIX LPARs in IBM Smart Cloud Enterprise+ (SCE+) is achieved through some of the configurations in the AIX operating system directly. AIX operating system has some of the finest mechanisms to help us achieve this. SCE+ configures priority hang detection, lost I/O hang detection and crash detection as part of HA enablement. Let’s discuss what these properties are and how they enable the HA environment.
Priority hang detection
All processes (also known as threads) run at a priority. This priority is in the range 0-126, with 0 highest priority and 126 the lowest. The default priority for all threads is 60. Any user can lower the priority of a process by using the nice command. Anyone with root authority can also raise a process’s priority.
The kernel scheduler always picks the highest-priority runnable thread to put on a CPU. It is therefore possible for a sufficient number of high-priority threads to completely tie up the machine such that low-priority threads can never run. If the running threads are at a priority higher than the default of 60, this can lock out all normal shells and logins to the point where the system appears hung. The system hang detection feature provides a mechanism to detect this situation and give the system administrator a means to recover from it. This feature is implemented as a daemon (shdaemon) that runs at the highest process priority. This daemon queries the kernel for the lowest-priority thread run over a specified interval. If the priority is above a configured threshold, the daemon can take one of several actions. Each of these actions can be independently enabled, and each can be configured to trigger at any priority and over any time interval.
System hang detection is configured in IBM SmartCloud Enterprise+ with an action to reboot the system, via the shconf command:
$: shconf -l pio -a pp_reboot=enable -a pp_rto=priority hang timeout based on SLA
Lost I/O hang detection
AIX can also detect I/O hang conditions and try to recover from them, based on user-defined actions.
I/O errors can cause the I/O path to become blocked, further affecting I/O on that path. In these circumstances it is essential that the OS alert the user and execute user-defined actions. As part of the lost I/O detection and notification, the shdaemon— with the help of the Logical Volume Manager — monitors the I/O buffers over a period of time and checks if any I/O is pending for too long a time period. If the wait time exceeds the threshold wait time defined by the shconf file, a lost I/O is detected and further actions are taken. The information about the lost I/O is documented in the error log. Also based on the settings in the shconf file, the system might be rebooted to recover from the lost I/O situation.
Lost I/O hang detection is configured in IBM SmartCloud Enterprise+ with an action to reboot the system, via the shconf command:
$: shconf -l lio -a lio_reboot=enable -a lio_to=Lost I/O timeout based up on SLA
If the OS crashes, an automatic restart should be enabled for continuity. Crash detection in IBM SmartCloud Enterprise+ enables a reboot via the chdev command, which changes the system object device property called autorestart to true.
$: chdev -l sys0 -a autorestart=true
The table below shows the values configured for these HA features in each SLA:
IBM recommends these values. Vendors, cloud administrators, or users can adjust the timeout intervals, depending on the prevailing environment and workload conditions.
For more details about the above features refer to the AIX infocenter.