Dynamic memory guarding
AIX systems are designed to be resilient in regards to memory errors. Memory error resilience is the result of both hardware and operating system-level recoveries.
There are multiple ways to categorize memory errors, but for the purposes of this discussion, memory errors are classified as recoverable and non-recoverable errors.
Recoverable errors result in data located in specific locations being retrievable, and unrecoverable errors result in a loss of data from the specific location in question. Unrecoverable errors are typically resolved by using hardware redundancy in the memory subsystem, or by masking the area in question from use during boot time of the operating system.
AIX supports resilience as a means of preventing recoverable memory errors from becoming unrecoverable errors through a technique known as Dynamic Memory Guarding. Dynamic Memory Guarding is based on support provided by the hardware. Hardware provides mechanisms for the detection of and recovery from errors (such as memory scrubbing and error correcting circuits (ECC)). Hardware can provide mechanisms for avoiding future unrecoverable errors as well, including redundant bit steering.
As a complement to these hardware mechanisms, the hardware can inform the operating system about errors best handled through Dynamic Memory Guarding. This is done by identifying areas of memory to be deallocated. The AIX operating system uses this information to mask off the memory area in question and to stop using it. The operating system will move any data currently contained in the memory area in error to another memory area, and then stop using the memory page that contains the memory location in error. This memory guarding is done by the operating system without any user intervention and is transparent to end users and applications.