AIX kernel recovery

Beginning with AIX 6.1, the kernel can optionally recover from errors in selected routines, avoiding an unplanned system outage.

Kernel recovery is disabled by default. If kernel recovery is enabled, the system might pause for a short time during a kernel recovery action. This time is generally less than two seconds. The following actions occur immediately after a kernel recovery action:
  • The system console displays the following message:
    ------------------------------------------------------------------------- 
            A kernel error recovery action has occurred. A recovery log
             has been logged in the system error log. 
    -------------------------------------------------------------------------
  • AIX adds an entry into the error log. You can send the error log data to IBM® for service, similar to sending data from a full system termination. The following is a sample recovery error log entry:
    LABEL:          RECOVERY 
    Date/Time:       Fri Feb 16 14:04:17 CST 2007
    Type:            INFO
    Resource Name:   RMGR
    Description
    Kernel Recovery Action
    Detail Data
    Live Dump Base Name 
    RECOV_20070216200417_0000
    Function Name
    w_clear
    FRR Name
    w_init_clear_frr
    Symptom String 
    273
    EEEE00009627A072
    F10001001B18BBC0
    w_clear+D0
    wdog0030+288
    test_index+4C
    Recovery Log Data
    0001 0000 0000 0000 F000 0000 2FFC AEB0 0000 0111 0000 0000 0000 0000 0021 25BC
    8000 0000 0002 9032 EEEE 0000 9627 A072 F100 0100 1B18 BBC0 0000 0000 0000 0000
    0000 0001 0000 0000 0006 0057 D2FF 8C00 0001 0148 0500 0000 8000 0000 0002 9032
    .....
    
  • AIX generates a live dump. The data from a live dump is located by default in the /var/adm/ras/livedump directory and the file is named RECOV_timestamp_sequence, where timestamp specifies the time of the kernel recovery occurrence, and sequence specifies the number of times that kernel recovery has been invoked. You can send live dump data to IBM for service, similar to sending data from a full system termination. For more information about live dumps, see live dumps in Kernel Extensions and Device Support Programming Concepts.

Attention: Some functions might be lost after a kernel recovery, but the operating system remains in a stable state. If necessary, shut down and restart your system to restore the lost functions.

Memory and processor considerations

AIX maintains data on the status of kernel recovery during mainline kernel operations. When kernel recovery is enabled, additional processor instructions are required to maintain the data and additional memory is required to save the data. The impact to processor usage is minimal. Additional memory consumption can be determined by the following equation, where maxthread is the maximum number of threads running on the system and procnum is the number of processors:

memory required = 4 KB x maxthread + 128 KB x procnum
As show in the following example, a system with 16 processors and a maximum of 1000 threads consumes an additional 6304 KB:
4 x 1000 + 128 x 16 = 6304 KB