When good disks go bad

Preventing and recovering from disk failures

It's never a matter of if a disk will fail, but when. So, what do you do when you're awakened at 2 o'clock in the morning because of file system, LVM, or SAN errors on an IBM AIX server? Or, better yet, how do you prevent them from waking you up in the first place? This article looks at strategies for managing disk resources to maximize availability, performance, and redundancy and provides techniques on how to recover from failures when good disks go bad.

Share:

Christian Pruett, Senior Systems Administrator, Freelance

Christian Pruett is a senior UNIX systems administrator with more than 14 years of experience with AIX, Sun Solaris, Linux, and HP/UX in a wide variety of industries, including computing, agriculture, and telecommunications. He is the co-author of two IBM Redbooks on AIX, has served as a UNIX book review for O’Reilly Publishing, and has worked on several of the IBM AIX certification exams. He resides in Colorado with his wife and two children. You can reach Christian at pruettc@gmail.com.



20 September 2011

Also available in Chinese Spanish

Introduction

If you have been doing IBM® AIX® systems administration or SAN administration for any length of time, you are probably intimately familiar with disk errors, file system problems, and Logical Volume Manager (LVM) failures. What should you do when one of these situations arises? Or, better yet, how do you prevent them from happening in the first place?

Frequently used acronyms

  • I/O: Input/output
  • JFS: Journaled file system
  • LPAR: logical partition
  • RAID: Redundant array of independent disks
  • SAN: Storage area network

This article looks at such situations—when good disks go bad. It starts with an overview of disk errors and their categorization. Then, it moves onto hardware concepts and ways of architecting well-designed and redundant environments. From there, it discusses solutions for being in those situations when crises arise.


Categorizing disk errors

I use two main areas to categorize disk errors on AIX systems: impact and duration. Impact measures the potency of the disk errors and how they affect servers. In other words, "How bad is this going to hurt?" Duration measures the length of time or persistence of the disk errors plus recovery time—or, "How long is this going to hurt?"

Impact can be broken into four main levels:

  • Loss of availability - A loss of availability occurs when storage resources go offline or are disconnected from their managing servers. The data on the disks is not compromised, but the disks cannot be accessed. Examples include file systems being unmounted or Fibre Channel adapters being disconnected.
  • Loss of data - Data cannot be written to or read from a disk because of a logical or physical problem. Examples include LVM write errors.
  • Loss of data across multiple disks - In this instance, it is not just one disk that has encountered a loss of data but a number of disks. This situation typically occurs when logical volumes are striped across disks and one fails.
  • Loss of data across multiple servers - With the widespread use of SAN technology, it is possible for a single piece of disk hardware to be compromised to the point where multiple servers are affected with a loss of data.

Duration can likewise be broken down into four main levels:

  • Temporary - This type of disk error is the rare, one-off hiccup that poses no true threat. It shows up once in the server's errpt facility and then is gone. Examples include a bad block reallocation.
  • Intermittent - Intermittent errors show up on an irregular basis and can be indicative of a nascent problem, such as when a hard disk logs a series of write errors, showing that the drive may fail.
  • Regular - As if scheduled by a cron job itself, problems that occur on a weekly, daily, hourly, or minute-by-minute interval pose a serious risk to servers and can have widespread detrimental effects.
  • Permanent - There is no easy or feasible way to come back from this type of error. Short of replacing hardware, you cannot recover from this situation.

By cross-referencing these two metrics in a table, you can get a good picture of the criticality of disk errors and how they affect servers. Figure 1 provides an example of such a table.

Figure 1. Cross-referencing impact and duration of disk errors
Table showing a cross-referencing of impact and duration of disk errors

Figure 1 shows a four by four table. The columns represent the duration of a problem, increasing in time from left to right. The rows represent the impact of a problem, increasing in severity from the bottom to the top. The cells in the table are color-coded along the spectrum, moving from blue and green in the lower-left corner, indicating lesser degrees of problems (such as temporary loss of availability) to orange and red in the upper-right corner (indicating higher degrees of problems such as permanent loss of data across multiple servers).

In my experience, anything beyond the green area is a severe and serious issue and will likely cause loss of productivity or business. I have only seen a few instances when servers have gotten up into the yellow areas, and those came with drastic consequences.


Preventative measures

It's never a matter of if a disk failure will occur but rather a question of when it will happen. No disk has ever been guaranteed to work indefinitely. The goal of any good systems administrator is to avoid being a victim of the mean time between failure value of hardware and find a way to mitigate the risk of a disk failure.

The three main objectives for any AIX or SAN administrator are to maximize availability, performance, and redundancy. You'll want your storage environment to be available—both to ensure that the data can be accessed on demand and to ensure that there is sufficient disk space to contain all the data. The disk has to have good performance so that applications do not get held up by any I/O wait. And the disk needs to have redundancy so that a failure of resources does not impair the server's ability to function.

Typically, each maximization has a trade-off with at least one of the other areas. A storage environment that maximizes availability and performance usually skips on having a great deal of redundancy, because the disk resources are optimized for speed and using every last byte available. One that focuses on availability and redundancy will likely have slower reads and writes, because the objective is long-term stability. And a solution that is heavy on performance and redundancy takes up more space for getting that high I/O and doubling up on reads and writes, decreasing the availability in terms of how much space is available.

With AIX, there are more practical ways of putting preventative measures in place. Here are a few general concepts that every administrator should know:

  • Avoid single points of failure. Never, ever build an environment where the loss of a solitary resource would impair the environment. Such an architecture would include a single hard disk, a single Fibre Channel adapter, or a single power source for any piece of equipment. Inevitably, that resource will die at the most inopportune time.
  • RAID technology is a great way to maximize resources. Many years ago, engineers developed a way of gathering cheap storage devices into larger groupings through RAID technology. AIX has many levels of RAID technology incorporated into it for no additional cost; these technologies can be employed at a software level, such as striping (RAID 0) and mirroring (RAID 1). Depending on the type of disk subsystem in use, other options may be available, like striping with parity (RAID 5), striping mirrors (RAID 0 + 1), or mirrored stripes (RAID 1 + 0/RAID 10).
  • Use effective LVM strategies to segregate data. The worst mistake an administrator can make is to put all of a server's resources—operating system, third-party applications, paging space, and application data—within a single volume group. Doing so has all sorts of bad consequences, including poor performance, massive system backups, deterring manageability, and increasing odds of failure. Each facet of the server should be evaluated, isolated, and put into its own volume group and type of storage. For example, a large database server might be designed to have internal mirrored rootvg disks, SAN storage for application storage and paging space, and some solid-state disks for archive logging and high-I/O transactions.

Let's look at strategies for various types of storage used on AIX servers.

Internal hard drives

The most common form of storage in AIX, internal hard drives are typically used for root volume group disks and servers with smaller footprints. When using internal hard drives, the first step should always be to have at least two disks per volume group and to mirror the hard drives with the mirrorvg command. If the server is a large IBM System p® machine, choose disks across multiple drawers to maximize redundancy in case a piece of hardware such as a backplane fails. Also, to optimize performance, it's wise to examine the layout of the logical volumes on the disk with the lspv –l and lspv –p commands to keep higher-I/O areas on the outside edge of the disks and logical volumes contiguous.

Small SAN storage

Smaller storage subsystems, such as direct-attached IBM FAStT disk drawers or older small SAN technology, are affordable solutions for environments in which more than internal disk space is needed to hold larger amounts of data. For these situations, it is important to manage intimately the configuration of the environment, because there may be some single points of failure along the way. The storage should be optimized with a proper RAID configuration, such as a RAID 5 setup with a hot spare disk. There should be two adapters that can access the drawer to keep availability and redundancy on the server side. And the proper software drivers, like multipath I/O or a subsystem device driver path control module, should be installed and kept up to date so that the disks are presented clearly to the server.

Large SAN storage

In larger SAN storage environments, where multiple servers are accessing a number of storage devices, such as IBM System Storage® DS8300 devices, through director-class switches, there are typically dedicated SAN administrators who manage disk resources. But from an AIX perspective, systems administrators can help by doing things like choosing multiple dual-port Fibre Channel cards to communicate with different fabrics and improve throughput. If virtual infrastructure optimization (VIO) technology is in use, N_Port ID virtualization (NPIV) can enable multiple servers with lower I/O needs to communicate through the same adapter, reducing the number of slots assigned to LPARs. SAN boot technology provides extremely rapid build and boot times for LPARs, especially when done with Network Installation Manager (NIM).


Recovery steps

The effects of a disk failure can vary from a mild interruption to a complete server failure. So, what do you do when you encounter a failure?

The first step is to check the accessibility of the disk resources, starting at the highest available level and moving downwards, using the errpt as a guide where possible. If the server is still up and running, are the file systems still mounted when viewed with a df or mount command? If not, is the volume group accessible with lsvg or varyonvg, or has it lost quorum? Are the disks themselves still in an Available state, or does an lsdev –Ccdisk show them being in a Defined state? Would SAN storage commands like lspath or pcmpath query adapter show that Fibre Channel devices are offline or missing? Is the server simply down and sitting in a Not Activated state when viewed through the Hardware Management Console? Or is the larger System p machine or SAN subsystem powered down? Never assume that simply because one type of resource is accessible; all similar resources would be accessible, so be thorough in your investigation.

The second step is to check the integrity of the resources, starting at the lowest available level and moving upward. Did your server boot up successfully, or were there errors as the system started, such as LED messages hanging with numbers like 552, 554, or 556 (corrupted file systems, JFS, or Object Data Manager [ODM])? If the system is up and running, do disk resources come back online in the Available state if you run the cfgmgr command? Can the volume groups be activated with a varyonvg command? Do the file systems mount cleanly? Is the data you expect to see present within the file systems, or are files missing?

The third step is to correct problems with the resources on a case-by-case basis. Here are tips I have used over the years to fix problems:

  • File systems. In my experience, this is the most common type of disk error out there. It doesn't take much to make a superblock dirty, cause fragmentation, mess up inodes, or cause repeating JFS errors in the errpt. Even a full file system can mess things up. And the best strategy for fixing file system issues is the simplest: the file system check command (fsck). In these situations, I unmount file systems and run fsck –y against them until they come back with no errors before mounting them again. Sometimes, I will be extra thorough in unmounting all file systems in a volume group and doing this with a small loop in a shell script in case there is a latent problem.
  • Volume groups. When the problem exceeds the file system realm, it often goes to the volume group level. Sometimes, the problem is at the ODM level and can be corrected with a syncvg or synclvodm. In a pinch, I have turned off volume groups with varyoffvg, exported them with exportvg, then re-imported them with importvg to get them to be recognized properly. But I always back up the /etc/filesystems file and record the disk port VLAN IDs (PVID) beforehand to keep the mount order preserved.
  • Physical volumes. Speaking of PVIDs, I've seen disks go missing, and then come back onto the server with different PVIDs. It helps to record the disk information somewhere else periodically for comparison in case such a thing occurs. When it does, I usually delete the disks from the server with rmdev –dl, re-detect them with cfgmgr, and then export and re-import the volume group.
  • SAN connections. There are occasions when worldwide numbers (WWN) aren't communicated end-to-end across the SAN fabric, such as with NPIV on VIO servers. I sometimes disable the Fibre Channel adapters by running pcmpath set adapter offline, define or check the WWNs manually, and then turn the adapter back on. I've also had to go to the extremes of chasing cables and checking lights at the other end to make sure no physical problem exists.
  • Boot problems. If you're trying to determine why a server won't boot after a disk failure, the first thing I often do is to un-map or disconnect all disks from the server except for the root volume group. It can take the Software Management System (SMS) a considerable amount of time to boot if it has to probe hundreds of disks when trying to find the one or two rootvg disks. I boot the system from a NIM server in maintenance mode to run triage and fix file systems, recreate the boot logical volume with the bosboot command, or access the root volume group to fix configuration files like /etc/filesystems. Also, after a server comes up, file systems that have problems are usually those that are in a closed state while the others around them mount up fine.
  • Recovery. Finally, when something is broken and truly needs to be fixed, make sure that the parts being exchanged are as close to the original equipment as possible. In this way, you minimize the need to manipulate things like file system sizes or software drivers that could compound the time to get things up and working again. I always recommend making good system backups—both mksysb images and using products like IBM Tivoli® Storage Manager—for those occasions when data gets lost and cannot be recovered in the true worst-case situation.

Conclusion

The best way to avoid the impact and duration of problems when good disks go bad is not to be reactive to problems when they occur but rather maximize availability, performance, and redundancy in your AIX environments to prevent errors from happening in the first place. But when they do occur—because failure is inevitable—validate their accessibility and integrity, and come up with an incremental plan for fixing them and getting your server fully running once more.

Resources

Learn

Get products and technologies

  • Try out IBM software for free. Download a trial version, log into an online trial, work with a product in a sandbox environment, or access it through the cloud. Choose from over 100 IBM product trials.

Discuss

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into AIX and Unix on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=AIX and UNIX
ArticleID=758399
ArticleTitle=When good disks go bad
publish-date=09202011