This blog is for the open exchange of ideas relating to IBM Systems, storage and storage networking hardware, software and services.
(Short URL for this blog: ibm.co/Pearson )
Tony Pearson is a Master Inventor, Senior IT Architect and Event Content Manager for [IBM Systems for IBM Systems Technical University] events. With over 30 years with IBM Systems, Tony is frequent traveler, speaking to clients at events throughout the world.
Lloyd Dean is an IBM Senior Certified Executive IT Architect in Infrastructure Architecture. Lloyd has held numerous senior technical roles at IBM during his 19 plus years at IBM. Lloyd most recently has been leading efforts across the Communication/CSI Market as a senior Storage Solution Architect/CTS covering the Kansas City territory. In prior years Lloyd supported the industry accounts as a Storage Solution architect and prior to that as a Storage Software Solutions specialist during his time in the ATS organization.
Lloyd currently supports North America storage sales teams in his Storage Software Solution Architecture SME role in the Washington Systems Center team. His current focus is with IBM Cloud Private and he will be delivering and supporting sessions at Think2019, and Storage Technical University on the Value of IBM storage in this high value IBM solution a part of the IBM Cloud strategy. Lloyd maintains a Subject Matter Expert status across the IBM Spectrum Storage Software solutions. You can follow Lloyd on Twitter @ldean0558 and LinkedIn Lloyd Dean.
Tony Pearson's books are available on Lulu.com! Order your copies today!
Safe Harbor Statement: The information on IBM products is intended to outline IBM's general product direction and it should not be relied on in making a purchasing decision. The information on the new products is for informational purposes only and may not be incorporated into any contract. The information on IBM products is not a commitment, promise, or legal obligation to deliver any material, code, or functionality. The development, release, and timing of any features or functionality described for IBM products remains at IBM's sole discretion.
Tony Pearson is a an active participant in local, regional, and industry-specific interests, and does not receive any special payments to mention them on this blog.
Tony Pearson receives part of the revenue proceeds from sales of books he has authored listed in the side panel.
Tony Pearson is not a medical doctor, and this blog does not reference any IBM product or service that is intended for use in the diagnosis, treatment, cure, prevention or monitoring of a disease or medical condition, unless otherwise specified on individual posts.
The developerWorks Connections Platform is now in read-only mode and content is only available for viewing. No new wiki pages, posts, or messages may be added. Please see our FAQ for more information. The developerWorks Connections platform will officially shut down on March 31, 2020 and content will no longer be available. More details available on our FAQ. (Read in Japanese.)
Special thanks to Anthony Vandewerdt, who sent me his version of this presentation that he planned to present in Australia next week. I "smartened it up" (or whatever the appropriate phrase is the opposite of "dumbed it down") for the technical audience.
Recovery procedures for single and double drive failures. A double drive failure on an XIV typically involves less recovery effort than traditional RAID5-based disk systems, and in many cases results in no data loss whatsoever. I provided details on this in my blog post [Double Drive Failure Debunked: XIV Two Years Later], so no need to repeat myself here.
Replacing the Automatic Transfer Switch (ATS) non-disruptively. To support either single-phase and triple-phase power sources, the XIV uses an ATS to take two independent power feeds, and distribute this out to the three Uninterruptible Power Supplies (UPS).
Built-in Migration capability to copy data off other disk systems over to the XIV.
Configuring Synchronous and Asynchronous mirroring using either the Fibre Channel or Internet Protocol ports.
Optimizing the use of XIV for VMware, AIX and other operating systems.
The IBM XIV Storage System is quite popular in New Zealand, with four times more boxes sold per capita than the other countries in the Asia Pacific region. I covered both the A14 model as well as the new Gen3 model.
Business Continuity/Disaster Recovery (BC/DR) Update: Lessons, Planning, Solutions
My colleague Vic Peltz from IBM Almaden presented on lessons learned from Hurricane Katrina and various other natural disasters. Unlike tradtional presentations the focus on technology, Vic took a different approach, focusing on people and procedures. I was here last year when the earthquake hit Christchurch on the south island, so I was well aware that BC/DR was top of mind for many of the attendees. Throughout this week, I have felt tremors, and many of the locals told me that these happen all the time.
Introduction to IBM Storwize V7000
I knew I was in trouble when the request for me to present Storwize sounded like something from [Mission Impossible]:
"Good morning, Mr. Pearson. Your mission, should you choose to accept it, involves presenting Storwize V7000 in Auckland, New Zealand. You may also present the Storwize V7000 Unified, but it is essential that you not cover the SAN Volume Controller or SONAS products from which they are based upon, as you will not have enough time. The audience is very technical, so be careful. As always, should any questions come up that you cannot answer, the conference coordinators will disavow all knowledge of your actions, nor reimburse your laundry charges. This message will self-destruct in five seconds."
Well, I accomplished my mission in 75 minutes. I was able to cover the block-only version of the IBM Storwize V7000, with support for clustering the control enclosures, expansion drawers and external storage virtualization. I then spent a few minutes on the block-and-file Storwize V7000 Unified, which adds support for CIFS, NFS, HTTPS, FTP and SCP protocols through two new "file modules", with integrated support for backup and anti-virus checking. I covered both IBM Easy Tier for sub-LUN automated tiering between Solid-State Drives (SSD) and spinning disk, as well as Active Cloud Engine for file-based movement between disk and tape.
Last week, US President Barack Obama declared September 2011 as "National Preparedness Month". Here is an excerpt of the press release:
Whenever our Nation has been challenged, the American people have responded with faith, courage, and strength. This year, natural disasters have tested our response ability across all levels of government. Our thoughts and prayers are with those whose lives have been impacted by recent storms, and we will continue to stand with them in their time of need. This September also marks the 10th anniversary of the tragic events of September 11, 2001, which united our country both in our shared grief and in our determination to prevent future generations from experiencing similar devastation. Our Nation has weathered many hardships, but we have always pulled together as one Nation to help our neighbors prepare for, respond to, and recover from these extraordinary challenges.
In April of this year, a devastating series of tornadoes challenged our resilience and tested our resolve. In the weeks that followed, people from all walks of life throughout the Midwest and the South joined together to help affected towns recover and rebuild. In Joplin, Missouri, pickup trucks became ambulances, doors served as stretchers, and a university transformed itself into a hospital. Local businesses contributed by using trucks to ship donations, or by rushing food to those in need. Disability community leaders worked side-by-side with emergency managers to ensure that survivors with disabilities were fully included in relief and recovery efforts. These stories reveal what we can accomplish through readiness and collaboration, and underscore that in America, no problem is too hard and no challenge is too great.
Preparedness is a shared responsibility, and my Administration is dedicated to implementing a "whole community" approach to disaster response. This requires collaboration at all levels of government, and with America's private and nonprofit sectors. Individuals also play a vital role in securing our country. The National Preparedness Month Coalition gives everyone the chance to join together and share information across the United States. Americans can also support volunteer programs through www.Serve.gov, or find tools to prepare for any emergency by visiting the Federal Emergency Management Agency's Ready Campaign website at [www.Ready.gov] or [www.Listo.gov].
In the last few days, we have been tested once again by Hurricane Irene. While affected communities in many States rebuild, we remember that preparedness is essential. Although we cannot always know when and where a disaster will hit, we can ensure we are ready to respond. Together, we can equip our families and communities to be resilient through times of hardship and to respond to adversity in the same way America always has -- by picking ourselves up and continuing the task of keeping our country strong and safe.
NOW, THEREFORE, I, BARACK OBAMA, President of the United States of America, by virtue of the authority vested in me by the Constitution and the laws of the United States, do hereby proclaim September 2011 as National Preparedness Month. I encourage all Americans to recognize the importance of preparedness and observe this month by working together to enhance our national security, resilience, and readiness.
IBM has several webinars to help you prepare for upcoming disasters.
Today, September 8, at 4pm EDT, IBM is hosting a [CloudChat on Business Resilience] will focus on resiliency and continuity in the cloud—a timely topic considering the recent weather events on the East Coast of the U.S. This chat will include Richard Cocchiara, IBM Distinguished Engineer and CTO, IBM Business Continuity and Resiliency Services (@RichCocchiara1) and Patrick Corcoran, Global Business Development, IBM Business Continuity and Resiliency Services (@PatCorcoranIBM).
Don't think you can afford Disaster Recovery planning? Next week, September 13, I will be joined with a few other experts on freeing up much needed funds from your tight IT budget, by being more efficient. The Webinar [Taming Data Growth Made Easy] is part of IBM's "IT Budget Killer" series.
Lastly, on September 21, IBM will have the Webinar [Planning for Disaster Recovery in a Power Environment: Best Practices to Protect Your Data]. This will cover principal lessons learned from disasters like Hurricane Katrina and the World Trade Center, local and regional considerations for Disaster Recovery Planning, planning Recovery Time Objectives (RTOs), and best practices for automation, mirroring and multiple Site Operational Efficiencies. A customer case study from University of Rochester Medical Center (URMC) will help reinforce the concepts, with a discussion on how a major hospital ensures Business Continuity via Contingency Planning using IBM Power Systems. The speakers in clude Steve Finnes, World Wide Offering Manager for IBM Power Systems, Vic Peltz, Consulting IT Architect for WW Business Continuance Technical Marketing, and Rick Haverty, Director of IT Infrastructure at University of Rochester Medical Center (URMC).
Hopefully, you will find these webinars useful and informative!
During lunch, people were able to take a look at our solutions. Here are Dan Thompson and Brett Cooper striking a pose.
Hyper-Efficient Backup and Recovery
The afternoon was kicked off by Dr. Daniel Sabbah, IBM General Manager of Tivoli software. He started with some shocking statistics: 42 percent of small companies have experienced data loss, 32 percent have lost data forever. IBM has a solution that offers "Unified Recovery Management". This involves a combination of periodic backups, frequent snapshots, and remote mirroring.
IBM Tivoli Storage Manager (TSM) was introduced in 1993, and was the first backup software solution to support backup to disk storage pools. Today, TSM is now also part of Cloud Computing services, including IBM Information Protection Services. IBM announced today a new bundle called IBM Storwize Rapid Application Backup, which combines IBM Storwize V7000 midrange disk system, Tivoli FlashCopy Manager, implementation services, with a full three-year hardware and software warranty. This could be used, for example, to protect a Microsoft Exchange email system with 9000 mailboxes.
IBM also announced that its TS7600 ProtecTIER data deduplication solutions have been enhanced to support many-to-many bi-direction remote mirroring. Last year, University of Pittsburgh Medical Center (UPMC) reported that they were average 24x data deduplication factor in their environment using IBM ProtecTIER.
"You are out of your mind if you think you can live without tape!"
-- Dick Crosby, Director of System Administration, Estes
The new IBM TS1140 enterprise class tape drive process 2.3 TB per hour, and provides a density of 1.2 PB per square foot. The new 3599 tape media can hold 4TB of data uncompressed, which could hold up to 10TB at a 2.5x compression ratio.
The United States Golfers Association [USGA] uses IBM's backup cloud, which manages over 100PB of data from 750 locations across five continents.
Customer Testimonial - Graybar
Randy Miller, Manager of Technical System Administration at Graybar, provided the next client testimonial. Graybar is an employee-owned company focused on supply-chain management, serving as a distributor for electical, lighting, security, power and cooling equipment.
Their problem was that they had 240 different locations, and expecting local staff to handle tape backups was not working out well. They centralized their backups to their main data center. In the event that a system fails in one of their many remote locations, they can rebuild a new machine at their main data center across high-speed LAN, and then ship overnight to the remote location. The result, the remote location has a system up and running by 10:30am, faster than they would have had from local staff trying to figure out how to recover from tape. In effect, Graybar had implemented a "private cloud" for backup in the 1990s, long before the concept was "cool" or "popular".
In 2001, they had an 18TB SAP ERP application data repository. To back this up, they took it down for 1 minute per day, six days a week, and 15 minutes down on Sundays. The result was less than 99.8 percent availability. To fix this, they switched to XIV, and use Snapshots that are non-disruptive and do not impact application performance.
Over 85 percent of the servers at Graybar are virtualized.
Their next challenge is Disaster Recovery. Currently, they have two datacenters, one in St. Louis and the other in Kansas City. However, in the aftermath of Japan's earthquakes, they realize there is a nuclear power plan between their two locations, so a single incident could impact both data centers. They are working with IBM, their trusted advisors, to investigate a three-site solution.
This week, May 15-22, I am in Auckland, New Zealand teaching IBM Storage Top Gun sales class. Next week, I will be in Sydney, Australia.
I'm down here in Australia, where the government is a bit stalled for the past two weeks at the moment, known formally as being managed by the [Caretaker government]. Apparently, there is a gap between the outgoing administration and the incoming administration, and the caretaker government is doing as little as possible until the new regime takes over. They are still counting votes, including in some cases dummy ballots known as "donkey votes", the Australian version of the hanging chad. Three independent parties are also trying to decide which major party they will support to finalize the process.
While we are on the topic of a government stalled, I feel bad for the state of Virginia in the United States. Apparently, one of their supposedly high-end enterprise class EMC Symmetrix DMX storage systems, supporting 26 different state agencies in Virginia, crashed on August 25th and now more than a week later, many of those agencies are still down, including the Department of Motor Vehicles and the Department of Taxation and Revenue.
Many of the articles in the press on this event have focused on what this means for the reputation of EMC. Not surprisingly, EMC says that this failure is unprecedented, but really this is just one in a long series of failures from EMC. It reminds me of the last time EMC had a public failure with a dual-controller CLARiiON a few months ago that stopped another company from their operations. There is nothing unique in the physical equipment itself, all IT gear can break or be taken down by some outside force, such as a natural disaster. The real question, though, is why haven’t EMC and the State Government been able to restore operations many days after the hardware was fixed?
In the Boston Globe, Zeus Kerravala, a data storage analyst at Yankee Group in Boston, is quoted as saying that such a high-profile breakdown could undermine EMC’s credibility with large businesses and government agencies. “I think it’s extremely important for them,’’ said Kerravala. “When you see a failure of this magnitude, and their inability to get a customer like the state of Virginia up and running almost immediately, all companies ought to look at that and raise their eyebrows.’’
Was the backup and disaster recovery solution capable of the scale and service level requirements needed by vital state
agencies? Had they tested their backups to ensure they were running correctly, and had they tested their recovery plans? Were they monitoring the success of recent backup operations?
Eventually, the systems will be back up and running, fines and penalties will be paid, and perhaps the guy who chose to go with EMC might feel bad enough to give back that new set of golf clubs, or whatever ridiculously expensive gift EMC reps might offer to government officials these days to influence the purchase decision making process.
(Note: I am not accusing any government employee in particular working at the state of Virginia of any wrongdoing, and mention this only as a possibility of what might have happened. I am sure the media will dig into that possibility soon enough during their investigations, so no sense in me discussing that process any further.)
So what lessons can we learn from this?
Lesson 1: You don't just buy technology, you also are choosing to work with a particular vendor
IBM stands behind its products. Choosing a product strictly on its speeds and feeds misses the point. A study IBM and Mercer Consulting Group conducted back in 2007 found that only 20 percent of the purchase decision for storage was from the technical capabilities. The other 80 percent were called "wrapper attributes", such as who the vendor was, their reputation, the service, support and warranty options.
Lesson 2: Losing a single disk system is a disaster, so disaster recovery plans should apply
IBM has a strong Business Continuity and Recovery Services (BCRS) services group to help companies and government agencies develop their BC/DR plans. In the planning process, various possible incidents are identified, recovery point objectives (RPO) and recovery time objectives (RTO) and then appropriate action plans are documentede on how to deal with them. For example, if the state of Virginia had an RPO of 48 hours, and an RTO of 5 days, then when the failure occurred on August 25, they could have recovered up to August 23 level data(48 hours prior to the incident) and be up and running by August 30 (five days after the incident). I don't personally know what RPO and RTO they planned for, but certainly it seems like they missed it by now already.
Lesson 3: BC/DR Plans only work if you practice them often enough
Sadly, many companies and government agencies make plans, but never practice them, so they have no idea if the plans will work as expected, or if they are fundamentally flawed. Just as we often have fire drills that force everyone to stop what they are doing and vacate the office building, anyone with an IT department needs to practice BC/DR plans often enough so that you can ensure the plan itself is solid, but also so that the people involved know what to do and their respective roles in the recovery process.
Lesson 4: This can serve as a wake-up call to consider Cloud Computing as an alternative option
Are you still doing IT in your own organization? Do you feel all of the IT staff have been adequately trained for the job? If your biggest disk system completely failed, not just a minor single or double drive failure, but a huge EMC-like failure, would your IT department know how to recover in less than five days? Perhaps this will serve as a wake-up call to consider alternative IT delivery options. The advantage of big Cloud Service Providers (Microsoft, Google, Yahoo, Amazon, SalesForce.com and of course, IBM) is that they are big enough to have worked out all the BC/DR procedures, and have enough resources to switch over to in case any individual disk system fails.
The BP oil spill in the Gulf of Mexico is a good reminder that all organizations should consider practice and execution of their contingency plans. In this most recent case, the [Deepwater Horizon] oil platform had an explosion on April 20, resulting in oil spewing out at an estimated 19,000 barrels per day. While some bloggers have argued that BP failed to plan, and therefore planned to fail, I found that hard to believe. How can a billion-dollar multinational company not have contingency plans?
The truth is, BP did have plans. Karen Dalton Beninato of New Orleans' City Voices discusses BP's Gulf of Mexico Regional Oil Spill Response Plan (OSRP) in her article [BP's Spill Plan: What they knew and when they knew it]. A
[redacted 90-page version of the OSRP] is available on their website.
The plan indicates that it may be 30 days from the time a deep offshore leak reaches the shoreline, giving OSRP participants plenty of time to take action.
(Having former politicians [blame environmentalists] for this crisis does not help much either. At least the deep shore rigs give you 30 days to react to a leak before the oil gets to the shoreline. Having oil rigs closer to shore will just shorten this time to react. Allowing onshore oil rigs does not mean oil companies would discontinue their deep offshore operations. There are thousands of oil rigs in the Gulf of Mexico. Extracting oil in the beautiful Alaska National Wildlife Reserve [ANWR] might be safer, it does not eliminate the threat entirely, and any leak there would be damaging to the local plant and animals in the same manner.)
So perhaps the current crisis was not the result of a lack of planning, but inadequate practice and execution. The same is true for IT Business Continuity / Disaster Recovery (BC/DR) plans. In all cases, there are four critical parts:
The planning team needs to anticipate every possible incident, determine the risks involved and the likelihood of impact, and either accept them, or decide to mitigate them. This can include natural disasters (hurricanes, fires, floods) and technical issues (computer viruses, power outages, network disruption).
Mitigation can involve taking backups, having replicated copies at a remote location, creating bootable media, training all of the appropriate employees, and having written documented procedures. IBM's Unified Recovery Management approach can protect your entire IT operations, from laptops of mobile employees, to remote office/branch office (ROBO) locations, to regional and central data centers.
When was the last time you practiced your Business Continuity / Disaster Recovery plan? I have seen this done at a variety of levels. At the lowest level, it is all done on paper, in a conference room, with all participants talking through their respective actions. These are often called "walk-throughs". At the highest level, you turn off power to your data center --on a holiday weekend to minimize impact to operating revenues-- and have the team bring up applications at the alternate site.
As many as 80 percent of these BC/DR exercises are considered failures, in that if a real disaster would have occurred, the participants are convinced they would not have achieved their target goals of Recovery Time Objective (RTO). However, they are not complete failures if they can help improve the plans, help identify new incidents that were not previously considered, and help train the participants in recovery procedures.
The last part is execution. In my career, I have been onsite for many Disaster Recovery exercises as well as after real disasters have occured. I am not surprised how many people assume that if they have plans in place, have made preparations, and have one to three practice drills per year, that the actual "execution" would directly follow. While the book [Execution] by Bossidy and Charan is not focused on IT BC/DR plans per se, it is a great read on how to manage the actual execution of any kind of business plan. I have read this book and recommend it.
If you have not tested out your IT department's BC/DR plans. Perhaps its time to dust off your copy, review it, and schedule some time for practice.