The IBM Storage and Storage Networking Symposium concludes today. As typical for manysuch conferences, it ended at noon, so that people can catch airline flights.
- TS1120 Tape Encryption - Customer Experiences
Jonathan Barney had implemented many deployments of tape encryption, and shared hisexperiences at two customer locations.
- The first company had decided to implement their EKM servers on dedicated 64-bitWindows servers. They had three sites, one in Chicago, Alphareta, and New York City,each with two EKM servers. Each library had a single TS3500 tape library, and pointedto four EKM servers, two local, and two remote.
The clever trick was managing the keystore. They decided that EKM-1 was their trustedsource, made all changes to that, and then copied it to the other five EKM servers.His team deployed one site at a time, which turned out to be ok, but he would notrecommend it. Better to design your complete solution, and make sure that all librariescan access all EKM servers.
This company decided to have a single key-label/key-pair for all three locations, but change it every 6 months. You have to keep the old keys for as long as you have tapesencrypted with those keys, perhaps 10-20 years.The customer found the IBM encryption implementation "elegant" and it can be easily replicated to a fourth site if needed.
- The second company had both z/OS and Sun Solaris. Initially they planned to have botha hardware-based keystore on System z, and software-based keystore on Sun, but they realized that System z version was so much more secure and reliable, that it made nosense to have anything on the Sun Solaris platform.
On System z, they had two EKM images, and used VIPA to ensure load balancing fromthe library. Tapes written from z/OS used DFSMS Data Class to determine which tapesare encrypted and which aren't. All Tapes written from Sun Solaris were encryptied, written to a separate logical library partition of the TS3500, which in turn contactedthe System z for the EKM management to provide the keys to use for the encryption.
The "gotcha" for this case was that when they tested Disaster Recovery, they had torecover the two EKM servers first, before any other restores could take place, and thistook way too long. Instead, they developed a scaled-down 10-volume "rescue recovery" z/OS image that would contain the RACF database and all EKM related software to actas the keystore during a disaster recovery. Anytime they make updates, they only haveto dump 10 volumes to tape. Restore time is down to only 2 hours.
He gave this advice to deploy tape encryption:
- Some third party z/OS security products, like Computer Associates Top Secret orACF2, require some PTFs to work with the EKM. The latest IBM RACF is good to go.
- Getting IP support from IOS to OMVS requires IPL.
- At one customer, an OMVS monitor software program killed the EKM because it wasn'tin their list of "acceptable Java programs". They updated the list and EKM ran fine.
- DO not update EKM properties file while EKM is running. EKM keeps a lot of stuffin memory, and when it is recycled, copies this back to the EKM properties file, reversing any changes you may have done. It is best to shut down EKM, update theproperties file, then start up EKM back up again. This is why you should always haveat least two EKM servers for redundancy.
- TSM for Linux on System z
Randy Larson from our Tivoli group presented this session.There is a lot of interest in deploying IBM Tivoli Storage Manager backup and archivesoftware on Linux for System z. Many customers are already invested in a mainframeinfrastructure, may have TSM for z/OS or z/VM, and want the newer features and functions that are available for TSM on Linux.
TSM has special support for Lotus Domino, Oracle, DB2 and WebSphere Application Servers.TSM clients can send backup data to a TSM server internally via Hipersockets, a virtualLAN feature on the System z platform that uses shared memory to emulate TCP/IP stack.
One of the big questions is whether to run Linux as guests under z/VM, or natively onLPAR. The general deployment is to carve an LPAR and run Linux natively untilyour server and storage administration staff have taken z/VM training classes. Oncetrained, they can easily move native LPAR images to z/VM guests. Unlike VMware that takesa hefty 40% overhead on x86 platforms to manage guests, z/VM only takes 5-10% overhead.
For the TSM database and disk storage pools, Randy recommends FC/SCSI disk, with ext3 file system, combined with LVM2 into logical volumes. ECKD disk and reiserfsworks too. Avoid use of z/VM minidisks. Under LVM2, consider 32KB stripes for the TSM database, and 256KB stripes for the disk storage pools. For multipathing, usefailover rather than multibus method. Read IC45459 before you activate "directio".
The TSM for Linux on z is very much like the TSM on AIX or Windows, and not like theTSM for z/OS. For tape, TSM for Linux on z does not support ESCON/FICON attached tape,you need to use FC/SCSI attached tape and tape libraries. TSM owns the library anddrives it uses, so give it a logical library partition separate from z/OS. ForSun/StorageTek customers, TSM works with or without the Gersham Enterprise Distrbu-Tape(EDT) software. Use the IBM-provided drivers for IBM tape. For non-IBM tape, TSM providessome drivers that you can use instead.
That wraps up my week. This was a great conference! If you missed it, look for the one in Montpelier, France this October. Check out the list of IBM Technical Conferencesto find others that might interest you.
technorati tags: IBM, Storage, Networking, Symposium, tape, encryption, EKM, keystore, z/OS, Solaris, ECKD, TSM, AIX, TS1120, Montpelier
The IBM Storage and Storage Networking Symposium in Las Vegas continues ...
- N series and VMware
Jeff Barnett presented how VMware manages disk image files in its VMfs repository, and how N series offersa better alternative. Virtual machines can access N series volumes directly.
- Business Continuity with System i
Allison Pate presented the various Business Continuity options for System i. Many customersuse internal storage for System i, but this then hampers Business Continuity efforts. Instead,you can have IBM System Storage DS8000 or DS6000 series disk systems provide disk mirroringbetween clustered systems.
- DR550 Introduction
There was a lot of interest in DR550, one of our many compliance storage solutions. Ron Henkhauspresented an overview of our DR550 and DR550 Express offerings. Unlike the competitive disk-onlysolutions, such as the EMC Centera, the DR550 allows you to attach an automated tape library, managing large amounts of fixed content data at a much lower cost point. It also has encryption, for both diskand tape data.
- Open Systems Disk Management
Siebo Friesenborg presented the various steps needed to troubleshoot performance problemswith open systems, including the use of "iostat" on AIX systems as an example, and the stepsyou can take to make formal Service Level Agreements (SLA) between the IT department and thevarious lines of business.
- IBM Encryption - TS1120 and LTO-4 encryption comparison
Tony Abete presented TS1120 and LTO-4 encryption techniques. Deploying encryption is more thanjust choosing a tape drive. There are a variety of factors involved, such as whether to managethe keys from the application, the operating system, or the library manager. You need policiesto decided when to encrypt tapes and when not to, generating your keys, storing them, and sharingthem with your business partners, suppliers and service providers with which you send tapes.
I can tell that many people are feeling like they are "drinking from a firehose".IBM's success in storage reaches out to so many different aspects of information management,a variety of industries, and disciplines as varied as regulatory compliance and medical imaging.
technorati tags: IBM, storage, symposium, NAS, Vmware, N series, Allison Pate, Ron Henkhaus, DR550, express, Business Continuity, iostat, AIX, SLA, TS1120, tape, drive, LTO, LTO-4, Tony Abete, encryption, key, management, drinking firehose
I am back at "the Office" for a single day today. This happens often enough I need a name for it.Air Force pilots that practice landing and take-offs call them "Touch and Go", but I think I needsomething better. If you can think of a better phrase, let me know.
This week, I was in Hartford, CT, Somers, NY and our Corporate Headquarters in Armonk, in a varietyof meetings, some with editors of magazines, others with IBMers I have only spoken to over the phone andfinally got a chance to meet face to face.
I got back to Tucson last night, had meetings this morning in Second Life, then presented "InformationLifecycle Management" in Spanish to a group of customers from Mexico, Chile, and Brazil. We have a great Tucson Executive Briefing Center, and plenty of foreign-language speakers to draw from our localemployees here at the lab site.
Sunday, I leave for Las Vegas for our upcoming IBM Storage and Storage Networking Symposium. We will cover the latest in our disk, tape, storage networking and related software.Do you have your tickets? If you plan to attend, and want to meet up with me, let me know.
technorati tags: The Office, IBM, ILM, Tucson, Executive, Briefing, Center, Spanish, Las Vegas, storage, networking, symposium, Dwight Schrute, Gun Show, T-shirt
Wrapping up my week's discussion on Business Continuity, I've had lots of interest in myopinion stated earlier this week that it is good to separate programs from data
, and thatthis simplifies the recovery process, and that the Windows operating system can fit in a partition as small asthe 15.8GB solid state drive
we just announced for BladeCenter. It worked for me, and I will use this post to show you how to get it done.
Disclaimer: This is based entirely on what I know and have experienced with my IBM Thinkpad T60 running Windows XP, and is meant as a guide. If you are running with different hardware or different operating system software, some steps may vary.
(Warning: Windows Vista apparently handles data, Dual Boot, andPartitions differently. These steps may not work for Vista)
For this project, I have a DVD/CD burner in my Ultra-Bay, a stack of black CDs and DVDs, and a USB-attached 320GB external disk drive.
- Step 0 - Backup your system
I find it amusing that this is ALWAYS the first step, but nobody provides any instruction.I will assume we start with a single C: drive with an operational Windows operating system, intermixed programs and data. If you have a Thinkpad, you should have "IBM Rescue and Recovery" program already installed, but is probably down-level. Mine was version 2.0 -- Yikes! Download IBM Rescue and Recovery Version 4.0 for Windows XP and Windows 2000,and reboot to make it fully installed.
Make TWO backups. First, make a bootable rescue CD and backup to several DVDs. Second, backup to a large external 320GB USB-attached disk drive. IBM Rescue and Recovery does compression, so a 60GB drive that is mostly full might take about 8-10 DVDs, have plenty on hand. If you have to recover, boot from CD, and restore from the USB-attached drive. If that doesn't work, you have the DVDs just in case.
If you are suitably happy with your backups, you are ready for step 1. For added protection, you can use a Linux LiveCD to backup your entire drive. I suggestSysRescCD, which is designed to be a rescue CD and can do backups and restores.
First, figure out if your drive is "hda" or sda. The "dmesg" command below shows that mine is "sda", with output like this:
tpearson@tpearson:~$ dmesg | grep [hs]d[ 7.968000] SCSI device sda: 117210240 512-byte hdwr sectors (60012 MB)[ 7.968000] sda: Write Protect is offI like to backup the master boot record to one file, and then the rest of the C: drive to a series of 690MB compressed chunks. These can be directed to the USB-attached drive, and then later burned onto CDrom, or pack 6 files per DVD.Most USB-attached drives are formatted to FAT32 file system, which doesn't support any chunks greater than 2GB, so splitting these up into 690MB is well below that limit.
[ 7.968000] sda: Mode Sense: 00 3a 00 00[ 7.968000] SCSI device sda: write cache: enabled, read cache: enabled
dd if=/dev/sda of=/media/USBdrive/master.MBR bs=512 count=1dd if=/dev/sda1 conv=sync,noerror | gzip -c | split -b 690m - /media/USBdrive/master.gz.To recover your system, just reverse the process:
cat /media/USBdrive/master.gz.* | gzip -dc | dd of=/dev/sda1dd if=/media/USBdrive/master.MBR of=/dev/sda bs=512 count=1You can learn more about these commands here and here.
- Step 1 - Defrag your C: drive
From Windows, right-click on your Recycle Bin and select "Empty Recycle Bin".
Click Start->Programs->Accessories->System Tools->Disk Defragmenter. Select C: drive and push the Analyze button. You will see a bunch of red, blue and white vertical bars. If there are any greenbars, we need to fix that. The following worked for me:
- Right-click "My Computer" and select Properties. Select Advanced, then press "Settings" buttonunder Performance. Select Advanced tab and press the "Change" button under Virtual Memory.Select "No Paging File" and press the "Set" button. Virtual memory lets you have many programs open, moving memory back and forth between your RAM and hard disk.
- Click Start->Control Panel->Performance and Maintenance->Power Options. On the Hibernate tab,make sure the "Enable Hibernation" box is un-checked. I don't use Hibernate, as it seems likeit takes just as long to come back from Hibernation as it does to just boot Windows normally.
- Reboot your system to Windows.
If all went well, Windows will have deleted both pagefile.sys and hiberfil.sys, the twomost common unmovable files, and free up 2GB of space. You can run just fine without either of these features, but if you want them back, we will put them back on Step 6 below.
Go back to Disk Defragmenter, verify there are no green bars, andproceed by pressing the "Defragment" button. If there are still some green bars,you can proceed cautiously (you can always restore from your backup right?), or seek professional help.
- Step 2 - Resize your C: drive
When the defrag is done, we are ready to re-size your file system. This can be done with commercial software like Partition Magic.If you don't have this, you can use open source software. Burn yourself the Gparted LiveCD.This is another Linux LiveCD, and is similar to Partition Magic.
Either way, re-size the C: drive smaller. In theory, you can shrink it down to 15GB if this is a fresh install of Windows, and there is no data on it. If you have lots of data, and the drive wasnearly full, only resize the C: drive smaller by 2GB. That is how much we freed upfrom the unmovable files, so that should be safe.
You could do steps 2 and 3 while you are here, but I don't recommend it. Just re-size C:press the "Apply" button, reboot into Windows, and verify everything starts correctly before going to the next step.
- Step 3 - Create Extended Paritition and Logical D: drive
You can only have FOUR partitions, either Primary for programs, or Extended for data. However, theExtended partition can act as a container of one or more logical partitions.
Get back into Partition Magic or Gparted program, and in the unused space freed up from re-sizing inthe last step, create a new extended/logical partition. For now, just have one logical inside theextended, but I have co-workers who have two logical partitions, D: for data, and E: for their e-mailfrom Lotus Notes. You can always add more logical partitions later.
I selected "NTFS" type for the D: drive. In years past, people chose the older FAT32 type, but this has some limitations, but allowed read/write capability from DOS, OS/2, and Linux.Windows XP can only format up to 32GB partitions of FAT32, and each file cannot be bigger than 2GB.I have files bigger than that. Linux can now read/write NTFS file systems directly, using the new NTFS-3Gdriver, so that is no longer an issue.
- Step 4 - Format drive D: as NTFS
Just because you have told your partitioning program that D: was NTFS type, you stillhave to put a file system on it.
Click Start->Control Panel->Performance and Maintenance->Computer Management. Under Storage, select Disk Management. Right-click your D: drive and choose format.Make sure the "Perform Quick Format" box is un-checked, so that it peforms slowly.
- Step 5 - Move data from C: to D: drive
Create two directories, "D:\documents" and "D:\notes\data", either through explorer, or in a commandline window with "MKDIR documents notes\data" command.
Move files from c:\notes\data to d:\notes\data, and any folder in your "My Documents" over to d:\documents.
(If you have more data than the size of the D: drive, copy over what you can, run another defrag, resize your C: drive even smaller with Partition Magic or Gparted, Reboot, verify Windows is still working,resize your D: bigger, and repeat the process until you have all of your data moved over.)
To inform Lotus Notes that all of your data is now on the D: drive, use NOTEPAD to edit notes.ini and change the Directory line to "Directory=D:\notes\data". If you have a special signature file, leave it in C:\notes directory.
Once all of your data is moved over to D:\documents, right-click on "My Documents" and select Properties. Change the target to "D:\documents" and press "Move" button. Now, whenever you select "My Documents", youwill be on your D: drive instead.
- Step 6 - Take A Fresh Backup
If you use IBM Tivoli Storage Manager, now would be a good time to re-evaluate your "dsm.opt" file that listswhat drives and sub-directories to backup. Take a backup, and verify your data is being backed up correctly.
With the USB-attached, backup both C: and D: drives. I leave my USB drive back in Tucson. For a backup copywhile traveling, go to IBM Rescue and Recovery and take a C:-only backup to DVD. Make sure D: drive box is un-checked. Now, if I ever need to reinstall Windows, because of file system corruption or virus, I can do this from my one bootable CD plus 2 DVDs, which I can easily carry with me in my laptop bag, leaving all my data on the D: drive in tact.
In the worst case, if I had to re-format the whole drive or get a replacement disk, I can restore C: and thenrestore the few individual data files I need from IBM Tivoli Storage Manager, or small USB key/thumbdrive,delaying a full recovery until I return to Tucson.
Lastly, if you want, reactivate "Virtual Memory" and "Hibernation" features that we disabled in Step 1.
As with Business Continuity in the data center, planning in this manner can help you get back "up and running"quickly in the event of a disaster.
technorati tags: IBM, Business Continuity, Windows, XP, BladeCenter, solid, state, disk, backup, Linux, sysresccd, LiveCD, dd, gzip, split, Tivoli, Storage Manager, USB, Lotus Notes, NTFS, NTFS-3G, FAT32, primary, extended, logical, partition, magic, gparted
Continuing this week's theme on Business Continuity, I will use this post to discuss this week'sIBM solid state disk announcement
.This new offering provides a new way to separate programs from data, to help minimizedowntime and outages normally associated with disk drive failures.
Until now, the method most people used to minimize the amount of data on internalstorage was to use disk-less servers with Boot-Over-SAN, however, not all operating systems, and not all disk systems, supported this.
In April, the BladeCenter HS21 XM blade server introduced the option to have oneIBM 4GB Flash Memory Device that used the USB2.0 protocol. The 4GB drive can be usedto boot 32-bit and 64-bit versions of Linux, such as Red Hat Enterprise Linux (RHEL) and SUSE Linux Enterprise Server (SLES)), but not Windows. Linux is incredibly small operating system. You can bootversions from a USB key/thumbdrive (64MB) or CD (700MB) image, so it makes sense that a 4GB flash drive based on USB protocol was a good fit for Linux.
Windows, however, is not supported, because of the small 4GB size and USB protocol limitations. For Windows, you would add a SAS drive, you boot from this hard drive, and use the 4GB Flash drive for data only.
So what's new this time? Here's a quick recap of July 17 announcement. For the IBM BladeCenter HS21 XM blade servers, new models of internal "disk" storage:
- Single drive model
A single 15.8GB solid-state disk drive, based on SATA protocol. In addition to theLinux operating systems mentioned above, the capacity and SATA protocols allowsyou to boot 32-bit and 64-bit versions of Windows 2003 Server R2, with plans in placeto other platforms in the future, such as VMware. I am able to run my laptop Windows with only 15GB of C: drive, separating my data to a separate D: partition, so this appears to be a reasonable size.
- Dual drive model
The dual drive fits in the space of a single 2.5-inch HDD drive bay.You can combine these in either RAID 0 or RAID 1 mode.
- RAID 0 gives you a total of 31.6GB, but is riskier. If you lose either drive,you lose all your data. Michael Horowitz of Cnet covers the risks of RAID zerohere andhere.However, if you are just storing your operating system and application, easily re-loadable from CD or DVD in the case of loss, then perhaps that is a reasonable risk/benefit trade-off.
- RAID 1 keeps the capacity at 15.8GB, but provides added protection. If you loseeither drive, the server keeps running on the surviving drive, allowing you to schedule repair actions when convenient and appropriate. This would be the configuration I would recommend for most applications.
Until recently, solid state storage was available at a price premium only. Flash prices have dropped 50% annually while capacities have doubled. This trend is expected to continue through 2009.
According to recent studies from Google and Carnegie Mellon, hard drives fail more oftenthan expected. By one account, conventional hard disk drives internal to the server account for as much as 20-50% of component replacements.IBM analysis indicates that the replacement rate of a solid state drive on a typical blade server configuration is only about 1% per year, vs. 3% or more mentionedin the these studies for traditional disk drives.
Flash drives use non-volatile memory instead of moving parts, so less likely to break down during high external environmental stress conditions, like vibration and shock, or extreme temperature ranges (-0C° to +70°C) that would make traditional hard disks prone to failure.This is especially important for our telecommunications clients, who are always looking for solutions that are NEBS Level 3 compliant.
Last year, I mentioned that flash drives could provide only a limited number of write and erase cycles, but today's new advances in wear-leveling algorithms have nearly eliminated this limitation.
As with any SATA drive, performance depends on workload.Solid state drives perform best as OS boot devices, taking only a few secondslonger to boot an OS than from a traditional 73GB SAS drive. Flash drives also excel in applications featuring random read workloads, such as web servers. For random and sequential write workloads, use SAS drives instead for higher levels of performance.
Part of IBM's Project Big Green, these flash drives are very energy efficient. Thanks to sophisticated power management software, the power requirement of the solid state drive can be 95 percent better than that of a traditional 73GB hard disk drive. These 15.8GB drives use only 2W per drive versus as much as 10W per 2.5” hard drive and 16W per 3.5” hard drive. The resulting power savings can be up to 1,512 watts per server rack, with 50% heat reduction.
So, even though this is not part of the System Storage product line, I am very excitedfor IBM. To find out if this will work in your environment, go to the IBM Server Provenwebsite that lists compatability with hardware, applications and middleware, or review the latest Configuration and Options Guide (COG).
technorati tags: IBM, Business, Continuity, solid, state, flash, disk, drive, announcement, blade, server, BladeCenter, H21, XM, 4GB, Flash, Memory, Device, USB2.0, Linux, RedHat, RHEL, Novell, SUSE, SLES, Windows, Project, Big Green, SATA, SAS, energy, efficient, efficiency, performance, NEBS, telecommunications, boot-over-SAN, Google, Carnegie Mellon, study, Vmware
Continuing this week's theme on Business Continuity, I thought I would explore more on the identification of scenarios to help drive appropriate planning. As I mentioned in my last post
, this should be done first.
A recent post in Anecdote talks about the long list of cognitive biases which affect business decision making. This list is a good explanation of why so many people have a difficult time identifying appropriate recovery scenarios as the basis for Business Continuity planning. Their "cognitive biases" get in the way.
Again, using my IBM Thinkpad T60 laptop as an example, here are a variety of different scenarios:
- Corrupted File System
Some file systems are more fragile than others. If your NTFS file system gets corrupted, you might be able to run
CHKDSK C: /F but this just puts damaged blocks into dummy files, it doesn't really repair your files back to their pre-damage level.All kinds of things can damage the file system, including viruses, software defects, and user error.
I keep my programs and data in separate file systems. C: has my Windows operating system and applications, and D: holds my pure data. If one file system is corrupted, the other one might be in tact, mitigating the risk.
- Hard Disk Crash
Hopefully, you will have temporary read/write errors to provide warning prior to a complete failure. In theory, if I kept a spare hard disk in my laptop bag, I could swap out the bad drive with the good drive. I don't have that. The three times that I have had a disk failure all occurred while I was in Tucson.
Instead, I keep the few files I need for my trip on a separate USB key, and carry bootable Live CD, which allows you to boot entirely from CDrom drive, either to run applications, or perform rescue operations.
The latest one that I am trying out is Ubuntu Linux, which has OpenOffice 2.2 that can read/write PowerPoint, Word, and Excel spreadsheets; Firefox web browser; Gimp graphics software; and a variety of other applications, all in a 700MB CDrom image. I even have been able to get Wireless (Wi-Fi) working with it, and the process to create your own customized Live CD with the your own application packages is fairly straightforward. Combined with a writeable USB key, you can actually get work done this way. Special thanks to IBM blogger Bob Sutor for pointing me to this.
(If you have a DVD-RAM drive, there are bigger Live CDs from SUSE and RedHat Fedora that provide even more applications)
- Laptop Shell Failure
This might catch some people by surprise. I have had the keyboard, LCD screen, or some essential port/plug fail on my laptop. The disk drive and CDrom drive work fine, but unless you have another "laptop" to stick them into, they don't help you recover. This can also happen if the motherboard fails, or the battery is unable to hold a charge.
IBM provides a 24-hour turn around fix. Basically, IBM sends me a laptop shell, no drive, no CDrom, with instructions to move the disk drive and CDrom drive from your broken shell, to the new shell, then send the bad shell back in the same shipping box.
Here, again, I am thankful that I keep my key files on an USB key. Often I travel with other IBMers, and can borrow their laptop to make presentations, check my e-mail, or other work, until I can get my replacement shell. In you are travelling outside the US, you might be able to move your disk drive into a colleague's laptop, access the data, copy it to your USB key or burn a copy on CD or DVD.
In a data center, many outages are really "failures to access data", but the data is safe. For example, power outages, network outages, and so on, can prevent people from using their IT systems, but the data is safe when these are re-established.
- Temporary Separation
At times, I have been temporarily separated from my laptop. Three examples:
- A higher level executive had technical difficulties with his laptop, and usurped mine instead.
- A colleague forgot his power supply for his laptop, and borrowed my laptop instead. (I wish there were a standard for laptop power plug connectors)
- Customs agents confiscate your laptop, give you a receipt, and eventually you get it back.
In all cases, I was glad that no "recovery" was required, and that the few files I needed were on my USB key. A few times, I was able to get by on the machines available at the nearest Internet Cafe, in the meantime.
With some imagination, you can recognize that this scenario is similar to the previous one for laptop shell failure.Here is a good example that you can identify different scenarios, and then later discover they have similar properties in terms of recovery, and can be treated as one.
- Permanent Separation
Laptops are stolen every day. Luckily, I've only had this happen twice to me in my career at IBM, and I managed to get a replacement soon enough. The key lesson here is to keep your USB key and recovery media in separate luggage.I know it is more convenient to keep all computer-related stuff in one place, but a thief is going to take your whole laptop bag, to make sure that all cables and power supplies are included, and is not going to leave anything behind. That would just slow them down.
In each case, some brainstorming, or personal experience, can help identify scenarios, identify what makes them unique from a recovery perspective, and plan accordingly. If you looking to create or upgrade your Business Continuity plan, give IBM a call, we can help!
technorati tags: IBM, Business, Continuity, plan, plans, planning, Thinkpad, T60, laptop, NTFS, CHKDSK, hard disk crash, USB, key, Live, CD, LiveCD, DVD, Ubuntu, Linux, SUSE, RedHat, Fedora, shell, failure
This week and next I am touring Asia, meeting with IBM Business Partners and sales repsabout our July 10 announcements
Clark Hodge might want to figure out where I am, given the nuclearreactor shutdowns from an earthquake in Japan. His theory is that you can follow my whereabouts just by following the news of major power outages throughout the world.
So I thought this would be a good week to cover the topic of Business Continuity, which includes disaster recovery planning. When making Business Continuity plans, I find it best to work backwards. Think of the scenarios that wouldrequire such recovery actions to take place, then figure out what you need to have at hand to perform the recovery, and then work out the tasks and processes to make sure those things are created and available when and where needed.
I will use my IBM Thinkpad T60 as an example of how this works. Last week, I was among several speakers making presentations to an audience in Denver, and this involved carrying my laptop from the back of the room, up to the front of the room, several times. When I got my new T60 laptop a year ago, it specifically stated NOT to carry the laptop while the disk drive was spinning, to avoid vibrations and gyroscopic effects. It suggested always putting the laptop in standby, hibernate or shutdown mode, prior to transportation, but I haven't gotten yet in the habit of doing this. After enough trips back and forth, I had somehow corrupted my C: drive. It wasn't a complete corruption, I could still use Microsoft PowerPoint to show my slides, but other things failed, sometimes the fatal BSOD and other times less drastically. Perhaps the biggest annoyance was that I lost a few critical DLL files needed for my VPN software to connect to IBM networks, so I was unable to download or access e-mail or files inside IBM's firewall.
Fortunately, I had planned for this scenario, and was able to recover my laptop myself, which is important when you are on the road and your help desk is thousands of miles away. (In theory, I am now thousands of miles closer to our help desk folks in India and China, but perhaps further away from those in Brazil.) Not being able to respond to e-mail for two days was one thing, but no access for two weeks would have been a disaster! The good news: My system was up and running before leaving for the trip I am on now to Asia.
Following my three-step process, here's how this looks:
- Step 1: Identify the scenario
In this case, my scenario is that the file system the runs my operating system is corrupted, but my drive does not have hardware problems. Running PC-Doctor confirmed the hardware was operating correctly. This can happen in a variety of ways, from errant application software upgrades, malicious viruses, or in my case, picking up your laptop and carrying it across the room while the disk drive is spinning.
- Step 2: Figure out what you need at hand
All I needed to do was repair or reload my file sytem. "Easier said than done!" you are probably thinking. Many people use IBM Tivoli Storage Manager (TSM) to back up their application settings and data. Corporate include/exclude lists avoid backing up the same Windows files from everyone's machines. This is great for those who sit at the same desk, in the same building, and would be given a new machine with Windows pre-installed as the start of their recovery process. If on the other hand you are traveling, and can't access your VPN to reach your TSM server, you have to do something else. This is often called "Bare Metal Restore" or "Bare Machine Recovery", BMR for short in both cases.
I carry with me on business trips bootable rescue compact discs, DVDs of full system backup of my Windows operating system, and my most critical files needed for each specific trip on a separate USB key. So, while I am on the road, I can re-install Windows, recover my applications, and copy over just the files I need to continue on my trip, and then I can do a more thorough recovery back in the office upon return.
- Step 3: Determine the tasks and processes
In addition to backing up with IBM TSM, I also use IBM Thinkvantage Rescue and Recovery to make local backups. IBM Rescue and Recovery is provided with IBM Thinkpad systems, and allows me to backup my entire system to an external 320GB USB drive that I can leave behind in Tucson, as well as create bootable recovery CD and DVDs that I can carry with me while traveling.
The problem most people have with a full system backup is that their data changes so frequently, they would have to take backups too often, or recover "very old" data. Most Windows systems are pre-formatted as one huge C: drive that mixes programs and data together. However, I follow best practice, separating programs from data. My C: drive contains the Windows operating system, along with key applications, and the essential settings needed to make them run. My D: drive contains all my data. This has the advantage that I only have to backup my C: drive, and this fits nicely on two DVDs. Since I don't change my operating system or programs that often, and monthly or quarterly backup is frequent enough.
In my situation in Denver, only my C: drive was corrupted, so all of my data on D: drive was safe and unaffected.
When it comes to Business Continuity, it is important to prioritize what will allow you to continue doing business, and what resources you need to make that happen. The above concepts apply from laptops to mainframes. If you need help creating or updating your Business Continuity plan, give IBM a call.
technorati tags: IBM, July, announcements, earthquake, Japan, nuclear reactor, power, outage, business, continuity, disaster, recovery, plan, plans, planning, IBM, Thinkpad, T60, laptop, Windows, Denver, BSOD, VPN, India, China, Brazil, help desk, Asia, Tivoli, Storage, Manager, TSM, BMR, external, USB, bootable, CD, DVD, separating, programs, data, Clark Hodge
Seth Godin has an interesting post titled Times a Million
.He recounts how many people determine the fuel savings of higher-mileage cars to be only $300-$900 per year,and that this is not enough to motivate the purchase of a more-efficient vehicle, such as a hybrid orelectric car. Of course, if everyone drove more efficient vehicles, the benefits "times a million" wouldbenefit everyone and the world's ecology.
When I discuss storage-related concepts, many executives mistakenly relate them to the one area of information technologythey know best: their laptop. Let's take a look at some examples:
- Information Lifecycle Management
Information Lifecycle Management (ILM) includes classifying data by business value, and then using this to determineplacement, movement or deletion. If you think about the amount of time and effort to review the files on yourindividual laptop, and to manually select and move or delete data, versus the benefits for the individual laptopowner, you would dismiss the concept. Most administrative tasks are done manually on laptops, because automatedsoftware is either unavailable or too expensive to justify for a single owner.
In medium and large size enterprises, automated software to help classify, move and delete data makes a lot of sense.Executives who decide that ILM is not for their data center, based on their experiences with their laptop, are losingout on the "times a million" effect.
- Energy Efficiency
Laptops have various controls to minimize the use of battery, and these controls are equally available when pluggedin. Many users don't bother turning off the features and functions they don't need when plugged in, because theyfeel the cost savings would only amount to pennies per day.
Times a million, energy savings do add up, and options to reduce the amount used per server, per TB of data stored, not only save millions of dollars per year, but can also postpone the need to build a new data center, or upgrade the electrical systems in your existing data center.
- Backup and Disaster Recovery planning
I am not surprised how many laptops do not have adequate backup and disaster recovery plans. When executives thinkin terms of the time and effort to backup their data, often crudely copying key files to CDrom or USB key, and worryingabout the management of those copies, which copies are the latest, and when those copies can be destroyed, theymight reject deploying appropriate backup policies for others.
Times a million, the collected data stored on laptops could easily be half of your companies emails and intellectual property. Products like IBM Tivoli Storage Manager can manage a large number of clients with a few administrators,keeping track of how many copies to keep, and how long to keep them.
So, next time you are looking at technology or solutions for your data center, don't suffer from "Laptop Mentality". Focus instead on the data center as a whole.
technorati tags: Seth Godin, IBM, storage, ILM, energy, efficiency, backup, disaster recovery, Tivoli Storage Manager, laptop
NetworkWorld has compiled interlude with storage videos
, a follow up to last year's Yikes! Exploding Servers
I've blogged about some of these videos already, but since there are probably a few out there buying the brand new Apple iPhone looking for YouTube videos to play on them, these links might provide some exampleentertainment on your new handheld device.
Next week has "Fourth of July" Independence Day holiday in the USA smack in the middle of the week, so I suspect the blogosphereto quiet down a bit. So whether you are working next week or not, in the USA or elsewhere, take some time to enjoy your friends and family.
technorati tags: NetworkWorld, storage, videos, HP, IBM, EMC, HDS, Sun,exploding, servers, Apple, iPhone, YouTube
Last week, I opined that Monday's IDC announcement "IBM #1 in combined disk and tape storage hardwaresales for 2006" was in part because of a resurgence of interest in tape
, with four specific examples. There was a lot of reaction and reflection fromboth sides.
- On the one side...
EMC blogger Mark Twomey at Storagezilla admits that perhapsTape Isn't Dead after all,is perhaps the best place to put long-term archive data, but not for backup? EMC's "creative marketing types" put out this Fun With Tape video that I found amusing. (It asks for a first name,last name, and e-mail address, which are then embedded into the resulting video itself, and perhaps forwarded to your nearest EMC sales rep, so answer according to your wishes for privacy).
The "mummy wrapped in tape media" seems to be a common theme, and shows up again in LiveVault'svideo with John Cleese, which makes the same argument asthe EMC video above, namely: switch your backups from tape to disk because we are a disk-only vendor.
- ... and on the other side
JWT over at DrunkenData asks Which is greener, disk or tape?Tape is, of course, by a long shot, and an essential part of IBM's Big Green initiative, a project to invest$1US Billion dollars per year for data centers to be more efficient for power and cooling.
Sun/StorageTek blogger Randy Chalfant questions the Death of Tape, and argues thatdisk-only solutions suffer from atrophy.The results he posts from a survey of 200 customers are similar to those we've seen with customers using IBM TotalStorage Productivity Center, our software to help evaluate data usage, and identify misuse, in your data center.
To my readers in the USA, United Kingdom, Ireland, South Africa, China and Japan, and a few other countries, Happy Father's Day!
technorati tags: IBM, IDC, combined, disk, tape, storage, announcement, EMC, dead, LiveVault, video, John Cleese, JWT, DrunkenData, Sun, StorageTek, STK, Randy Chalfant, TotalStorage, Productivity Center, Fathers Day, Big, Green, initiative
One of the differences between IBM and the other storage vendors is that IBM is also in the business of middleware, application-aware backup software, and advanced copy services. This allows IBM to put togethersolutions that work to address specific challenges for our clients.
IBM has written a whitepaper on a cleverVSS Snapshot Backup for Exchange using IBM Tivoli Storage Manager and the point-in-time copy capabilities of IBM System Storage disk systems.
A problem in the past was that each vendor's point-in-time copy method had its own unique proprietary interface.Microsoft Developed Volume Shadow Copy Services (VSS) as a common interface front-end to resolve this concern.IBM Tivoli Storage Manager for Mail can invoke standard VSS interfaces, and this in turn can invoke FlashCopyon the IBM System Storage SAN Volume Controller, DS8000 series, or DS6000 series disk system.
You might be thinking: Wouldn't it have been less effort to just have TSM for Mail invoke IBM proprietary interfaces,rather than having to put full VSS support into TSM for mail, and then full VSS support into IBM's various disksystems? Perhaps, but IBM doesn't decide to do things because it is the cheapest way, we focus on what is theright way, and in this case, customers now have more choices, then can use TSM for Mail with IBM or non-IBM disksystems that support the VSS interface, and IBM disk systems can be employed into other uses for VSS snapshot.
Of course, we would like our clients to consider both TSM and IBM System Storage disk systems for a combined solution,not because they are required to make the solution work, but because both are best-of-breed, and whitepapers likethis show how they can provide synergy working together.
technorati tags: IBM, Tivoli, Storage, Manager, for, Mail, TSM, Microsoft, Windows, VSS, Exchange, SVC, DS8000, DS6000, FlashCopy, snapshot, whitepaper
I'm wrapping up my week in Latin America.
Yesterday morning, the entire country of Colombia suffered their worst black-out (power outage) in 22 years. 98% of the country was out for 4 1/2 hours.This is just 5 months after an outage that hit 25% of the country, December 7, 2006.Ironically, this one happened the week I am here explaining the need for Business Continuity plans to IBM Business Partners from Argentina, Peru, Velenzuela, Ecuador and Colombia. As is oftenthe case, people often need a real example to recognize the need for planning is important.
It reminded me of the Northeast Black-out of 2003 that impacted USA and Canada. I was speaking to a crowd of 800 people at the SHARE conference in Washington D.C. when it happened, and hundreds of pagers and cell-phones went off all at the same time. Although we were outside the effected area and had plenty of lighting, we ended up canceling therest of my talk, and many people left immediately to help execute their business continuity plans.Of course, terrorism was immediately assumed, but a final report showed that it was initiated in Ohiodue to overgrown trees, and then propagated due to a software bug to hundreds of other plants.
According to this morning's Bogota newspaper, "El Tiempo", nobody knows the root cause of yesterday's outage. Immediately, the country's leftist rebels were blamed, but now the leading theory is that it was initiated byoperator error (a technician touching something he shouldn't have), and then propagated by a faulty distribution system.
Another example of the need for a robust and resilient infrastructure, and appropropriate business continuity plans.
technorati tags: Colombia, black-out, Northeast, El Tiempo, Business Continuity, Infrastructure
I survived my first day at SNW Spring 2007
.This is my first time at SNW, but it is very much like many of the other conferences I have been to.It officially started Monday morning with pre-conferencetutorials and primer break-outsessions that covered storage fundamentals, but I didn't arrive until late Monday night due to highwind conditions at the Phoenix airport that delayed my travel.
Tuesday started out with main tent sessions. Ron Milton, VP of ComputerWorld that puts on this conference,and Vincent Franceschini, Chairman of the Board for SNIA, kicked off the event.It didn't take them long to get into the alphabet soup: ILM, ITIL, SMI-S, XAM, IMA, MMA, DDF,MF, DMF, IPSF, SSIF, and SRM.Several hundred people had "voting devices" so that they could participate in "informal" surveys.
Q1. What was the greatest need?
- 37% Storage Resource Management (SRM) tools
- 19% Storage Virtualization
- 19% Information Lifecycle Management (ILM)
- 14% Integration with other management tools
- 11% Compliance storage for regulations
Q2. What are people doing to address storage infrastructure complexity?
- 33% Deploying new SRM and SAN management tools
- 26% Adopting "Storage as a Service" methodology
- 22% Deploying new storage virtualization technologies
- 8% Hiring more staff
- 9% (complexity was not an issue)
The first keynote speaker was Cora Carmody, CIO of SAIC. In the late 1980s and early 1990s, I did a lot of work with SAIC here in San Diego, and so IBM sent me to San Diego quite frequentlyfor face-to-face meetings with them. Her talk was cryptically titled "Jumbo Shrimp, InformationManagement, and the Mark of the Beast." Coming up with good titles is important. Some of herkey points:
- "Information management" was as much an oxymoron as "jumbo shrimp" or "military intelligence".(SAIC is a general contractor for the US Military, so this was especially funny).
- Computer data needs both "ownership" and "stewardship".
- Gartner analyst reports that 50% of digital information for a business resides in personal files onindividual PCs.
- PAN-StaRRs project is ingesting 10TB per week of astronomical data.
- TeraTEXT(R) project is a non-relational database that supports a large mix of structured and unstructured content.
- The next "Y2K" crisis for the USA is changing from 3-digit to 4-digit area codes for our telephone numbers.
- Battery size and life have not advanced as fast as we need
- There has been little progress in "User Interface" ease of use
- Formats and standards are picked for the most part by the winning vendors, and it is the silence of themarketplace that lets them get away with this.
- We are overly reliant on an inherently insecure medium.
- The "mark of the beast" refers to exciting new technologies based on "presence awareness". For example,some hotels now are able to check you into the hotel as you drive up in your car, based on your car's licenseplate. Some 24-hour gyms use your fingerprint as your entry credentials, eliminating the need to staff peopleat the front desk.
IBM's own Barry Rudolph, presented "Storage in an Age of Inconvenient Truths", dressed up like Oscar-winner andformer USA Vice President Al Gore. Barry's focus was on the growingconcern of over environmental Power and Cooling issues in the data center. According to IDC, the cost of power and cooling an individual server, over its lifetime, now exceeds its acquisition cost. Storage devices are not as bad as servers in this regard. Data centers now consume 1.2% of the worlds energy.
Over lunch, I heard Tony Asaro from ESG present "The Need for Highly Virtualized Storage Systems withina Virtualized Data Center." His concern is that there is still a "heavy touch" required to manage storage.Without virtualization, your data center is less than the sum of its parts. Although IBM has been doingstorage virtualization since 1974, Tony mentioned that most storage vendors were "late to the party".He argues that "internal virtualization" inside storage arrays is not enough, you need "external virtualization"(like the IBM System Storage SAN Volume Controller) to virtualize your entire infrastructure.What storage administrators would like is for storage to have consumer levels of "ease of use", and today'snon-virtualized storage environments are nowhere near that.
"The great advantage [the telephone] possesses over every other form of electrical apparatus consists in the fact that it requires no skill to operate the instrument."
- Alexander Graham Bell, 1878
I attended a few break-out sessions in the afternoon.
- Ralph Wescott, Pacific Northwest National Library
Ralph presented "Crisis of Capacity" which covered the drastic actions he had to take to handle power and coolingin their expanding data center during their summer months, where temperatures peak up to 105 degrees. This included creating "hot" and "cold" aisles onhis raised floor by re-organizing the perforated floor tiles, and doing a better job standardizing how cables areconnected to the back of racks and up through the ceiling to maximize airflow. An amp-meter on each power strip was used to measure the powerused at each rack, which allowed them to better prioritize their efforts. Their Air Conditioning unit was only 12inches from the concrete floor, and raising it to 18 inches greatly reduced noise and vibration. Adding a second AC unit made a world of difference. Finally, they eliminatedKVMs, because people who use KVMs break other parts of thedata center. His rule of thumb: the cooling requirements will be 50% of the rated power requirements for equipment.
- Terry Yoshi, Intel internal IT department, as a member of the SNIA's end user council
Terry presented "Taming the SAN Complexity". The problem with "complexity" as a concept is that it is very subjective, difficult to quantify, and therefore difficult to manage. He presented complexity in four areas:Organizational structure of the company as a whole; skill sets required of the IT staff; business process andprocedures; and technology. Dealing with complexity is a battle between Old School (because we've always doneit this way) and New School (because it is new and different technology). Storage Area Networks are inherentlya "shared resource", and the increased complexity is a direct result of the low reliability of the componentsand devices it is composed of. People should focus on the "Total Cost of Ownership" (TCO) for a SAN, and not just the initial acquisitionprice of SAN gear.He was not a fan of the "dual/multiple" vendor strategy that many companies employto reduce costs. His suggestion that things should be tried out first on your "test SAN" caused some chuckles,as few have such a thing. Finally, he suggested not only documenting "Best Practices" and "Best Known Methods"but also things that have been found not to work, his do-not-try-this-at-home list.
- Tony Antony, Cisco marketing manager for Optical products
This was an overview of the technologies available for long distance connections for disaster recovery,business continuity, and resilience. He covered three levels.
His rule of thumb: one buffer credit for every kilometer at 2Gbps speed (for every 2km at 1Gbps).
- IP - Fibre Channel of IP (FCIP) offers the greatest "global" distance but forces people into asynchronous mirroring.
- SONET/SDH - SONET is what we call it in the USA, and SDH is what it is called in other countries. This provides state-to-state or "out-of-region" distances, which is ideal to meet certain government regulations for homeland defense. He suggests this is offered when dark fiber or DWDM is not available.
- DWDM/CWDM - this is using a prism to run multiple colors of light through a single fiber optic cable. CWDM ischeaper, but only handles 8 signals per cable. DWDM can handle 32 to 160 signals per cable, but is more expensive.
The day ended at the "Expo". I hung out at the IBM booth to help answer questions and network with others.
technorati tags: IBM, SNW, Ron Milton, ComputerWorld, Vincent Franceschini, SNIA, SAIC, Barry Rudolph, Al Gore, Inconvenient Truth, presence awareness, Tony Asaro, ESG, Alexander Graham Bell, Ralph Wescott, Pacific Northwest National Library, Terry+Yoshi, Intel
The blogosphere has quieted down a bit over the two papers on MTBF estimates for Disk Drive Modules (DDM).One article on SearchStorage.com by Arun Taneja asksIs RAID passé?
Disk capacity is growing at a faster rate than DDM reliability. During the hours to rebuild a DDM, companies are at risk of additional failures that could require recovery from a copy, or result in data loss, depending on how well your Business Continuity (BC) plan is written and followed.
I'll discuss two comments in particular.
Joerg Hallbauer felt I did not address all the issues raised:
... The problem with that is that it's the DISK ARRAY that determines when a drive has failed an starts the rebuild process. That IS under the control of IBM, specifically the controller. But more importantly, it effects my risk of data loss.
As I see it, my risk of data loss with RAID-5 is influenced by two main factors. 1 - The drive replacement rate and 2 - The rebuild time (which to a great extent is a function of the drive size) both of which IBM has some control over.
So, I think that the question in my mind is, what's the tipping point? Where does the risk of using RAID-5 protection exceed what I'm willing to accept, and I need to move to some other protection mechanism like RAID-6? Is it when the rebuild times exceed 12 hours? 24 hours? 48 hours?
Also, I wonder why IBM isn't publishing some information to help me make these kinds of decisions?
Bill Todd felt I was not technical enough:
Oh, dear - while Tony doesn’t seem to be parrying vigorously (as Seagate, Hitachi, and Chunk were doing), his contribution sounds more like IBM marketing than the kind of detailed, technical response one might have hoped for
... well, he *is* a manager, and a marketing one at that, so perhaps we shouldn’t expect more).
Both are fair comments. Disk arrays do run microcode to assist or perform the RAID function, detect failures and start the rebuild process, and so clever designs to support spare disks, process the rebuild quickly, and so on, can differentiate one vendor's offering from another.
On the issue of what does IBM provide to help its clients make the right decisions for their environments, Jon William Toigo at DrunkenData points his readers to IBM's Business Continuity Self-Assessment tool. In normal data center conditions, DDMs will fail, and a Business Continuity plan shouldbe written and developed to handle this fact. Using 2-site and 3-site mirroring, complemented with versions of tape backups, can help address some of these concerns and mitigate some of the risks involved with using disk systems.
For those who want a more technical answer, IBM has just published a series of IBM Redbooks.
Each client's situation is different, so no simple answer is possible. However, IBM does have a lot of experience in this area, and would be glad to help you write or update your existing Business Continuity plan.
technorati tags: IBM, disk, MTBF, estimates, papers, Arun Taneja, Jon William Toigo, StorageMojo, Business Continuity, Redbooks
Well, this week I am in Maryland, just outside of Washington DC. It's a bit cold here.
Robin Harris over at StorageMojo put out this Open Letter to Seagate, Hitachi GST, EMC, HP, NetApp, IBM and Sun about the results of two academic papers, one from Google, and another from Carnegie Mellon University (CMU). The papers imply that the disk drive module (DDM) manufacturers have perhaps misrepresented their reliability estimates, and asks major vendors to respond. So far, NetAppand EMC have responded.
I will not bother to re-iterate or repeat what others have said already, but make just a few points. Robin, you are free to consider this "my" official response if you like to post it on your blog, or point to mine, whatever is easier for you. Given that IBM no longer manufacturers the DDMs we use inside our disk systems, there may not be any reason for a more formal response.
- Coke and Pepsi buy sugar, Nutrasweet and Splenda from the same sources
Somehow, this doesn't surprise anyone. Coke and Pepsi don't own their own sugar cane fields, and even their bottlers are separate companies. Their job is to assemble the components using super-secret recipes to make something that tastes good.
IBM, EMC and NetApp don't make DDMs that are mentioned in either academic study. Different IBM storage systems uses one or more of the following DDM suppliers:
- Seagate (including Maxstor they acquired)
- Hitachi Global Storage Technologies, HGST (former IBM division sold off to Hitachi)
In the past, corporations like IBM was very "vertically-integrated", making every component of every system delivered.IBM was the first to bring disk systems to market, and led the major enhancements that exist in nearly all disk drives manufactured today. Today, however, our value-add is to take standard components, and use our super-secret recipe to make something that provides unique value to the marketplace. Not surprisingly, EMC, HP, Sun and NetApp also don't make their own DDMs. Hitachi is perhaps the last major disk systems vendor that also has a DDM manufacturing division.
So, my point is that disk systems are the next layer up. Everyone knows that individual components fail. Unlike CPUs or Memory, disks actually have moving parts, so you would expect them to fail more often compared to just "chips".
If you don't feel the MTBF or AFR estimates posted by these suppliers are valid, go after them, not the disk systems vendors that use their supplies. While IBM does qualify DDM suppliers for each purpose, we are basically purchasing them from the same major vendors as all of our competitors. I suspect you won't get much more than the responses you posted from Seagate and HGST.
- American car owners replace their cars every 59 months
According to a frequently cited auto market research firm, the average time before the original owner transfers their vehicle -- purchased or leased -- is currently 59 months.Both studies mention that customers have a different "definition" of failure than manufacturers, and often replace the drives before they are completely kaput. The same is true for cars. Americans give various reasons why they trade in their less-than-five-year cars for newer models. Disk technologies advance at a faster pace, so it makes sense to change drives for other business reasons, for speed and capacity improvements, lower power consumption, and so on.
The CMU study indicated that 43 percent of drives were replaced before they were completely dead.So, if General Motors estimated their cars lasted 9 years, and Toyota estimated 11 years, people still replace them sooner, for other reasons.
At IBM, we remind people that "data outlives the media". True for disk, and true for tape. Neither is "permanent storage", but rather a temporary resting point until the data is transferred to the next media. For this reason, IBM is focused on solutions and disk systems that plan for this inevitable migration process. IBM System Storage SAN Volume Controller is able to move active data from one disk system to another; IBM Tivoli Storage Manager is able to move backup copies from one tape to another; and IBM System Storage DR550 is able to move archive copies from disk and tape to newer disk and tape.
If you had only one car, then having that one and only vehicle die could be quite disrupting. However, companies that have fleet cars, like Hertz Car Rentals, don't wait for their cars to completely stop running either, they replace them well before that happens. For a large company with a large fleet of cars, regularly scheduled replacement is just part of doing business.
This brings us to the subject of RAID. No question that RAID 5 provides better reliability than having just a bunch of disks (JBOD). Certainly, three copies of data across separate disks, a variation of RAID 1, will provide even more protection, but for a price.
Robin mentions the "Auto-correlation" effect. Disk failures bunch up, so one recent failure might mean another DDM, somewhere in the environment, will probably fail soon also. For it to make a difference, it would (a) have to be a DDM in the same RAID 5 rank, and (b) have to occur during the time the first drive is being rebuilt to a spare volume.
- The human body replaces skin cells every day
So there are individual DDMs, manufactured by the suppliers above; disk systems, manufactured by IBM and others, and then your entire IT infrastructure. Beyond the disk system, you probably have redundant fabrics, clustered servers and multiple data paths, because eventually hardware fails.
People might realize that the human body replaces skin cells every day. Other cells are replaced frequently, within seven days, and others less frequently, taking a year or so to be replaced. I'm over 40 years old, but most of my cells are less than 9 years old. This is possible because information, data in the form of DNA, is moved from old cells to new cells, keeping the infrastructure (my body) alive.
Our clients should approach this in a more holistic view. You will replace disks in less than 3-5 years. While tape cartridges can retain their data for 20 years, most people change their tape drives every 7-9 years, and so tape data needs to be moved from old to new cartridges. Focus on your information, not individual DDMs.
What does this mean for DDM failures. When it happens, the disk system re-routes requests to a spare disk, rebuilding the data from RAID 5 parity, giving storage admins time to replace the failed unit. During the few hours this process takes place, you are either taking a backup, or crossing your fingers.Note: for RAID5 the time to rebuild is proportional to the number of disks in the rank, so smaller ranks can be rebuilt faster than larger ranks. To make matters worse, the slower RPM speeds and higher capacities of ATA disks means that the rebuild process could take longer than smaller capacity, higher speed FC/SCSI disk.
According to the Google study, a large portion of the DDM replacements had no SMART errors to warn that it was going to happen. To protect your infrastructure, you need to make sure you have current backups of all your data. IBM TotalStorage Productivity Center can help identify all the data that is "at risk", those files that have no backup, no copy, and no current backup since the file was most recently changed. A well-run shop keeps their "at risk" files below 3 percent.
So, where does that leave us?
- ATA drives are probably as reliable as FC/SCSI disk. Customers should chose which to use based on performance and workload characteristics. FC/SCSI drives are more expensive because they are designed to run at faster speeds, required by some enterprises for some workloads. IBM offers both, and has tools to help estimate which products are the best match to your requirements.
- RAID 5 is just one of the many choices of trade-offs between cost and protection of data. For some data, JBOD might be enough. For other data that is more mission critical, you might choose keeping two or three copies. Data protection is more than just using RAID, you need to also consider point-in-time copies, synchronous or asynchronous disk mirroring, continuous data protection (CDP), and backup to tape media. IBM can help show you how.
- Disk systems, and IT environments in general, are higher-level concepts to transcend the failures of individual components. DDM components will fail. Cache memory will fail. CPUs will fail. Choose a disk systems vendor that combines technologies in unique and innovative ways that take these possibilities into account, designed for no single point of failure, and no single point of repair.
So, Robin, from IBM's perspective, our hands are clean. Thank you for bringing this to our attention and for giving me the opportunity to highlight IBM's superiority at the systems level.
technorati tags: IBM, Seagate, Hitachi, HGST, EMC, NetApp, HP, HDS, Sun, Google, CMU, DDM, Fujitsu, MTBF, MTTF, AFR, ARR, JBOD, RAID, Tivoli, SVC, DR550, CDP, FC, SCSI, disk, tape, SAN,