If you have been doing IBM® AIX® systems administration or SAN administration for any length of time, you are probably intimately familiar with disk errors, file system problems, and Logical Volume Manager (LVM) failures. What should you do when one of these situations arises? Or, better yet, how do you prevent them from happening in the first place?
This article looks at such situations—when good disks go bad. It starts with an overview of disk errors and their categorization. Then, it moves onto hardware concepts and ways of architecting well-designed and redundant environments. From there, it discusses solutions for being in those situations when crises arise.
Categorizing disk errors
I use two main areas to categorize disk errors on AIX systems: impact and duration. Impact measures the potency of the disk errors and how they affect servers. In other words, "How bad is this going to hurt?" Duration measures the length of time or persistence of the disk errors plus recovery time—or, "How long is this going to hurt?"
Impact can be broken into four main levels:
- Loss of availability - A loss of availability occurs when storage resources go offline or are disconnected from their managing servers. The data on the disks is not compromised, but the disks cannot be accessed. Examples include file systems being unmounted or Fibre Channel adapters being disconnected.
- Loss of data - Data cannot be written to or read from a disk because of a logical or physical problem. Examples include LVM write errors.
- Loss of data across multiple disks - In this instance, it is not just one disk that has encountered a loss of data but a number of disks. This situation typically occurs when logical volumes are striped across disks and one fails.
- Loss of data across multiple servers - With the widespread use of SAN technology, it is possible for a single piece of disk hardware to be compromised to the point where multiple servers are affected with a loss of data.
Duration can likewise be broken down into four main levels:
- Temporary - This type of disk error is the rare, one-off
hiccup that poses no true threat. It shows up once in the server's
errptfacility and then is gone. Examples include a bad block reallocation.
- Intermittent - Intermittent errors show up on an irregular basis and can be indicative of a nascent problem, such as when a hard disk logs a series of write errors, showing that the drive may fail.
- Regular - As if scheduled by a
cronjob itself, problems that occur on a weekly, daily, hourly, or minute-by-minute interval pose a serious risk to servers and can have widespread detrimental effects.
- Permanent - There is no easy or feasible way to come back from this type of error. Short of replacing hardware, you cannot recover from this situation.
By cross-referencing these two metrics in a table, you can get a good picture of the criticality of disk errors and how they affect servers. Figure 1 provides an example of such a table.
Figure 1. Cross-referencing impact and duration of disk errors
Figure 1 shows a four by four table. The columns represent the duration of a problem, increasing in time from left to right. The rows represent the impact of a problem, increasing in severity from the bottom to the top. The cells in the table are color-coded along the spectrum, moving from blue and green in the lower-left corner, indicating lesser degrees of problems (such as temporary loss of availability) to orange and red in the upper-right corner (indicating higher degrees of problems such as permanent loss of data across multiple servers).
In my experience, anything beyond the green area is a severe and serious issue and will likely cause loss of productivity or business. I have only seen a few instances when servers have gotten up into the yellow areas, and those came with drastic consequences.
It's never a matter of if a disk failure will occur but rather a question of when it will happen. No disk has ever been guaranteed to work indefinitely. The goal of any good systems administrator is to avoid being a victim of the mean time between failure value of hardware and find a way to mitigate the risk of a disk failure.
The three main objectives for any AIX or SAN administrator are to maximize availability, performance, and redundancy. You'll want your storage environment to be available—both to ensure that the data can be accessed on demand and to ensure that there is sufficient disk space to contain all the data. The disk has to have good performance so that applications do not get held up by any I/O wait. And the disk needs to have redundancy so that a failure of resources does not impair the server's ability to function.
Typically, each maximization has a trade-off with at least one of the other areas. A storage environment that maximizes availability and performance usually skips on having a great deal of redundancy, because the disk resources are optimized for speed and using every last byte available. One that focuses on availability and redundancy will likely have slower reads and writes, because the objective is long-term stability. And a solution that is heavy on performance and redundancy takes up more space for getting that high I/O and doubling up on reads and writes, decreasing the availability in terms of how much space is available.
With AIX, there are more practical ways of putting preventative measures in place. Here are a few general concepts that every administrator should know:
- Avoid single points of failure. Never, ever build an environment where the loss of a solitary resource would impair the environment. Such an architecture would include a single hard disk, a single Fibre Channel adapter, or a single power source for any piece of equipment. Inevitably, that resource will die at the most inopportune time.
- RAID technology is a great way to maximize resources. Many years ago, engineers developed a way of gathering cheap storage devices into larger groupings through RAID technology. AIX has many levels of RAID technology incorporated into it for no additional cost; these technologies can be employed at a software level, such as striping (RAID 0) and mirroring (RAID 1). Depending on the type of disk subsystem in use, other options may be available, like striping with parity (RAID 5), striping mirrors (RAID 0 + 1), or mirrored stripes (RAID 1 + 0/RAID 10).
- Use effective LVM strategies to segregate data. The worst mistake an administrator can make is to put all of a server's resources—operating system, third-party applications, paging space, and application data—within a single volume group. Doing so has all sorts of bad consequences, including poor performance, massive system backups, deterring manageability, and increasing odds of failure. Each facet of the server should be evaluated, isolated, and put into its own volume group and type of storage. For example, a large database server might be designed to have internal mirrored rootvg disks, SAN storage for application storage and paging space, and some solid-state disks for archive logging and high-I/O transactions.
Let's look at strategies for various types of storage used on AIX servers.
Internal hard drives
The most common form of storage in AIX, internal hard drives are typically used
for root volume group disks and servers with smaller footprints. When using
internal hard drives, the first step should always be to have at least two disks
per volume group and to mirror the hard drives with the
mirrorvg command. If the server is a large IBM
System p® machine, choose disks across multiple drawers to maximize
redundancy in case a piece of hardware such as a backplane fails. Also, to
optimize performance, it's wise to examine the layout of the logical volumes
on the disk with the
lspv –l and
lspv –p commands to keep higher-I/O areas on the
outside edge of the disks and logical volumes contiguous.
Small SAN storage
Smaller storage subsystems, such as direct-attached IBM FAStT disk drawers or older small SAN technology, are affordable solutions for environments in which more than internal disk space is needed to hold larger amounts of data. For these situations, it is important to manage intimately the configuration of the environment, because there may be some single points of failure along the way. The storage should be optimized with a proper RAID configuration, such as a RAID 5 setup with a hot spare disk. There should be two adapters that can access the drawer to keep availability and redundancy on the server side. And the proper software drivers, like multipath I/O or a subsystem device driver path control module, should be installed and kept up to date so that the disks are presented clearly to the server.
Large SAN storage
In larger SAN storage environments, where multiple servers are accessing a number of storage devices, such as IBM System Storage® DS8300 devices, through director-class switches, there are typically dedicated SAN administrators who manage disk resources. But from an AIX perspective, systems administrators can help by doing things like choosing multiple dual-port Fibre Channel cards to communicate with different fabrics and improve throughput. If virtual infrastructure optimization (VIO) technology is in use, N_Port ID virtualization (NPIV) can enable multiple servers with lower I/O needs to communicate through the same adapter, reducing the number of slots assigned to LPARs. SAN boot technology provides extremely rapid build and boot times for LPARs, especially when done with Network Installation Manager (NIM).
The effects of a disk failure can vary from a mild interruption to a complete server failure. So, what do you do when you encounter a failure?
The first step is to check the accessibility of the disk resources, starting at the highest
available level and moving downwards, using the
a guide where possible. If the server is still up and running, are the file systems still
mounted when viewed with a
command? If not, is the volume group accessible with
varyonvg, or has it lost quorum? Are the disks
themselves still in an Available state, or does an
show them being in a Defined state? Would SAN storage commands like
pcmpath query adapter
show that Fibre Channel devices are offline or missing? Is the server simply down and
sitting in a Not Activated state when viewed through the Hardware Management Console?
Or is the larger System p machine or SAN subsystem powered down? Never assume that
simply because one type of resource is accessible; all similar resources would be
accessible, so be thorough in your investigation.
The second step is to check the integrity of the resources, starting at the lowest available
level and moving upward. Did your server boot up successfully, or were there errors
as the system started, such as LED messages hanging with numbers like 552, 554,
or 556 (corrupted file systems, JFS, or Object Data Manager [ODM])? If the system is
up and running, do disk resources come back online in the Available state if you run
cfgmgr command? Can the volume groups be activated
varyonvg command? Do the file systems mount
cleanly? Is the data you expect to see present within the file systems, or are files
The third step is to correct problems with the resources on a case-by-case basis. Here are tips I have used over the years to fix problems:
- File systems. In my experience, this is the most
common type of disk error out there. It doesn't take much to make a
superblock dirty, cause fragmentation, mess up inodes, or cause
repeating JFS errors in the
errpt. Even a full file system can mess things up. And the best strategy for fixing file system issues is the simplest: the file system check command (
fsck). In these situations, I unmount file systems and run
fsck –yagainst them until they come back with no errors before mounting them again. Sometimes, I will be extra thorough in unmounting all file systems in a volume group and doing this with a small loop in a shell script in case there is a latent problem.
- Volume groups. When the problem exceeds the file system
realm, it often goes to the volume group level. Sometimes, the problem
is at the ODM level and can be corrected with a
synclvodm. In a pinch, I have turned off volume groups with
varyoffvg, exported them with
exportvg, then re-imported them with
importvgto get them to be recognized properly. But I always back up the /etc/filesystems file and record the disk port VLAN IDs (PVID) beforehand to keep the mount order preserved.
- Physical volumes. Speaking of PVIDs, I've seen disks go missing,
and then come back onto the server with different PVIDs. It helps to record
the disk information somewhere else periodically for comparison in case such
a thing occurs. When it does, I usually delete the disks from the server with
rmdev –dl, re-detect them with
cfgmgr, and then export and re-import the volume group.
- SAN connections. There are occasions when worldwide numbers
(WWN) aren't communicated end-to-end across the SAN fabric, such as with
NPIV on VIO servers. I sometimes disable the Fibre Channel adapters by
pcmpath set adapter offline, define or check the WWNs manually, and then turn the adapter back on. I've also had to go to the extremes of chasing cables and checking lights at the other end to make sure no physical problem exists.
- Boot problems. If you're trying to determine why a server
won't boot after a disk failure, the first thing I often do is to un-map or
disconnect all disks from the server except for the root volume group. It
can take the Software Management System (SMS) a considerable amount
of time to boot if it has to probe hundreds of disks when trying to find the
one or two
rootvgdisks. I boot the system from a NIM server in maintenance mode to run triage and fix file systems, recreate the boot logical volume with the
bosbootcommand, or access the root volume group to fix configuration files like /etc/filesystems. Also, after a server comes up, file systems that have problems are usually those that are in a closed state while the others around them mount up fine.
- Recovery. Finally, when something is broken and truly needs
to be fixed, make sure that the parts being exchanged are as close to the
original equipment as possible. In this way, you minimize the need to manipulate
things like file system sizes or software drivers that could compound the time
to get things up and working again. I always recommend making good system
mksysbimages and using products like IBM Tivoli® Storage Manager—for those occasions when data gets lost and cannot be recovered in the true worst-case situation.
The best way to avoid the impact and duration of problems when good disks go bad is not to be reactive to problems when they occur but rather maximize availability, performance, and redundancy in your AIX environments to prevent errors from happening in the first place. But when they do occur—because failure is inevitable—validate their accessibility and integrity, and come up with an incremental plan for fixing them and getting your server fully running once more.
- For more information about LVM, check out AIX Logical Volume Manager from A to Z: Introduction and Concepts.
- For more information about troubleshooting LVM, see AIX Logical Volume Manager from A to Z: Troubleshooting and Commands and the IBM eServer Certification Study Guide AIX 5L Problem Determination Tools and Techniques.
- Learn how to use the AIX Logical Volume Manager to perform SAN storage migrations (Chris Gibson, developerWorks, July 2010).
- Check out the IBM Redbook Problem Solving and Troubleshooting in AIX 5L.
- AIX and UNIX developerWorks zone: The AIX and UNIX zone provides a wealth of information relating to all aspects of AIX systems administration and expanding your UNIX skills.
- New to AIX and UNIX? Visit the New to AIX and UNIX page to learn more.
Get products and technologies
- Try out IBM software for free. Download a trial version, log into an online trial, work with a product in a sandbox environment, or access it through the cloud. Choose from over 100 IBM product trials.
- Follow developerWorks on Twitter.
- developerWorks blogs: Check out our blogs and get involved in the developerWorks community.
- Participate in the AIX and UNIX® forums: