IBM SAN Volume Controller (SVC) has offered Fibre Channel Storage Virtualization since June 2003. Two SVC nodes communicate with each other via fibre channel to form a high availability I/O group. They then communicate with the storage that they virtualize via Fibre Channel and with the hosts they serve that virtual storage to, via Fibre Channel. When IBM added real-time (metro mirror) and near real-time (global mirror) replication it was also done using Fibre Channel, with each SVC cluster communicating to the other by connecting using fibre channel protocol transported over dark fibre with or without a WDM or via FCIP (Fibre Channel over IP) routers.
Each Fibre Channel port on an SVC node can be a SCSI initiator to backend storage, a SCSI target to hosts and all the time communicate to its peer nodes using those same ports. With every generation of SVC node, these ports got faster and faster, going from 2 Gbps to 4 Gbps to 8 Gbps. In SVC firmware V5.1 IBM added iSCSI capability to the SVC using the two 1 Gbps ethernet ports in each node. This allowed each node to also be an iSCSI target to LAN attached hosts.
When the Storwize V7000 came out in Oct 2010 it offered all of this capability, plus offered two fundamental changes to the design.
Firstly the two controllers in a Storwize V7000 can communicate with each other across an internal bus, eliminating the need to zone them together (or even attach the Storwize V7000 to Fibre Channel fabrics).
The other more obvious difference is that a Storwize V7000 comes with its own disks, which it communicates with via multi-lane 6 Gbps SAS.
When IBM added 10 Gbps Converged Enhanced Ethernet adapters to the SVC and to the Storwize V7000, these adapters operated as iSCSI Targets, allowing clients to access their volumes via a high-speed iSCSI network. In V6.4 code IBM allowed these adapters to also be used for FCoE (Fibre Channel over Ethernet). These are also effectively SCSI targets ports allowing hosts that use CEE adapters to connect to the SVC or V7000 over a converged network.
If you have a look at the Configuration limits page for SVC and Storwize V7000 version 6.4 (the Storwize V7000 one is here), you will see this interesting comment:
"Partnerships between systems, for Metro Mirror or Global Mirror replication, do not require Fibre Channel SAN connectivity and can be supported using only FCoE if desired"
So does this mean we can stop using FCIP routers to achieve near real-time replication between SVC clusters or Storwize V7000s? The short answer is most likely not. Lets look at why...
The whole reason Fibre Channel became the standard method to interconnect Enterprise Storage to Enterprise hosts is simple: Packet loss is prevented by buffer credit flow control. Frames are not allowed to enter a Fibre Channel network unless there are buffers in the system to hold them. Frames are normally only dropped if there is no destination to accept them. Fibre channel is a highly reliable, scalable and mature architecture. When we extend Fibre Channel over a WAN we do not want to lose this reliable nature, so we use FCIP routers like Brocade 7800s, that continues to ensure frames are reliably delivered in order, from one end point to another.
Converged enhanced ethernet allows Fibre Channel to be transported inside enhanced ethernet frames. The one fundamental that CEE brings to the table is the same principle that a frame should not enter the network without a buffer to hold it. Extending FCoE over distance has the same challenge: the moment you start moving those frames over a WAN connection you need to ensure frames are not lost due to congestion. How do we do this? The same way we did with Fibre Channel: we use Dark Fibre, we use WDMs or we use routers. The same issues and requirements exist.
For more information on FCoE over distance check out this fantastic Q&A from Cisco:
It's a story that has been repeated many times: You buy a shiny new storage system..... and it is beautiful.
Then... a disk fails, which takes just the tiniest bit of shine off.
No problem you declare! You place a service call and the disk is replaced. So far so good.
But then as the vendor service representative is walking out the door, it suddenly occurs to you... hey, that person is taking away the failed disk. Doesn't that disk have my data on it?
The short answer is that unless you have purchased self encrypting drives or are encrypting your data prior to writing it, then that failed drive will almost certainly contain some readable data. How readable will depend on the product. If the disk contains de-duplicated compressed data, it would present a great (but I suppose not insurmountable) challenge to any would be data snooper. But a failed disk removed from a standard RAID array, would contain data in sequential chunks (that are perhaps 256 KB in size). Whether that would be useful is another question.
So what to do?
First up, every responsible vendor takes great pains to ensure failed hard drives are not simply thrown in the dumpster or sold in job lots. As Railcorp in Australia found out the hard way (when they started selling off the media they had in the lost and found department) not controlling media with client data is a very bad idea. Instead responsible vendors usually return failed drives either to the original manufacturer (to get a warranty rebate) or to a reutilization center (either their own, or a third-party). In either case, there is a financial benefit to them to do this. The shipment will be done in a secure fashion and any disk drive that can be repaired will be thoroughly wiped. If not it will be securely destroyed. Again, all the major vendors should be able to produce a policy document explaining how this is done. For the majority of clients out there, I personally think this is good enough.
But what if you don't think this is good enough? What if your data is way too sensitive to take any risks?
Simple answer: Keep the failed disks.
A quick Google search came up with lots of easy to find programs from most major storage vendors. Just search for something like disk retention service (retention is the key word here). Here are some examples:
The only fly in the ointment is that these services are generally not free... and if you realize this only after the first drive has failed, you may find yourself negotiating with your vendor on price, well after the main purchase is complete. The only exception I have found so far is that IBM Australia lets you retain failed drives for free, provided the machine is covered by a Service Pac.
Of course maybe you knew this already and have always retained failed drives, but now your store-room is slowly filling with failed disks. Now what? Well I do not suggest you do this, but I sure laughed while watching it (sorry if there is an advertisement before-hand):
Instead Google search for secure hard disk shredding or secure hard disk recycling. Examples I found in Australia very quickly ( I have not contacted or dealt with either of these) included this one and this one. I am sure there are plenty of choices out there.
IBM has today announced a whole swag of planned new features across the entire IBM Storage product line. You can read the announcement letter here and I have also dropped the text at the bottom of this blog post (to save you clicking on the link).
It's a very impressive list, but to hone in on a few of the more exciting offerings:
IBM Easy Tier will be enhanced to cache hot data in SSD storage installed in a client server. Looks like it will initially be a combination of DS8700/DS8800 and AIX with or Linux servers. I am sure there are plenty who will immediately think of EMC VFCache, so I am keen to get more details so I can see how the two compare. If you are curious in the meantime, check out this EMC fact sheet and then read this fascinating interview with the CMO of FusionIO.
A new high density storage module will be made available, initially I suspect for the DS8800. This is a really important step as we are seeing a lot of new technologies emerging in the SSD space. This is because the technical requirements of SSD don't always line up with the architectures of existing storage controllers, so a custom built enclosure designed just for SSD makes perfect sense.
The IBM XIV will be enhanced with the ability to cluster multiple XIVs together and migrate volumes non-disruptively between them. The non-disruptive volume migration is a great new feature which should definitely help with swapping XIVs out as new models come available.
There are plenty of other new features as well, so check out the announcement letter reproduced below:
IBM® intends to support a number of new enhancements to a variety of IBM storage systems in the future. These enhancements will leverage innovative research on intelligent algorithms, automation, and virtualization that is being incorporated into products in the IBM storage portfolio. The statements of direction highlighted here are intended to provide a glimpse into the IBM storage roadmap for selected product capabilities.
IBM intends to deliver:
Advanced Easy Tier™ capabilities on selected IBM storage systems, including the IBM System Storage® DS8000® , designed to leverage direct-attached solid-state storage on selected AIX® and Linux™ servers. Easy Tier will manage the solid-state storage as a large and low latency cache for the "hottest" data, while preserving advanced disk system functions, such as RAID protection and remote mirroring.
An application-aware storage application programming interface (API) to help deploy storage more efficiently by enabling applications and middleware to direct more optimal placement of data by communicating important information about current workload activity and application performance requirements.
A new high-density flash storage module for selected IBM disk systems, including the IBM System Storage DS8000 . The new module will accelerate performance to another level with cost-effective, high-density solid-state drives (SSDs).
IBM intends to extend IBM Active Cloud Engine™ capabilities to:
Allow files on selected NAS devices to be virtualized by SONAS and Storwize® V7000 Unified. Virtualization capabilities provide access across a unified global namespace, while facilitating transparent file migrations in parallel with normal operations. This capability will help provide customer investment protection as clients continue to leverage their existing NAS assets while exploiting the capabilities of IBM Active Cloud Engine .
Enable file collaboration globally via IBM Active Cloud Engine . This capability will help enhance productivity where users at geographically dispersed locations can both share and modify the same file.
IBM intends to deliver Cloud features to SONAS and Storwize V7000 Unified to support:
Web Storage Services, a standards-based object store and API that implements the Cloud Data Management Interface (CDMI) standard from Storage Networking Industry Association (SNIA) to support the implementation of storage cloud services.
Self-service portal designed to speed storage provisioning, monitoring, and reporting.
IBM intends to support an increased scalability of capacity, performance, and host bandwidth by clustering IBM XIV® Gen3 systems together and providing the capability to migrate volumes across the cluster without disrupting applications. Management of the cluster will remain simple with consolidated views and shared configurations across the systems. These capabilities are intended to help clients address the scalability and management requirements for effective cloud computing.
IBM intends to extend NAS data retention enhancements for IBM Storwize V7000 Unified and IBM SONAS to provide file "immutability" to help support file integrity from the time the file is designated as immutable through its lifecycle. Immutability is intended to secure files from inadvertent or malicious change or deletion.
IBM intends to enable Real-time Compression for block and file workloads on Storwize V7000 Unified systems. This enhancement is designed to help clients experience the same high-performance compression for active primary block and file workloads on Storwize V7000 Unified that is being announced for block workloads on Storwize V7000. IBM Storwize V7000 Real-time Compression is designed to deliver enhanced storage efficiency with potential benefits including lower storage acquisition cost (because of the ability to purchase less hardware), reduced storage growth, and lower rack space, power, and cooling requirements.
All statements regarding IBM's future direction and intent are subject to change or withdrawal without notice, and represent goals and objectives only. The information in the above paragraphs are intended to outline our general product direction and should not be relied on in making a purchasing decision. The information is for informational purposes only and may not be incorporated into any contract. This information is not a commitment, promise, or legal obligation to deliver any material, code, or functionality. The development, release, and timing of any features or functionality described for our products remains at our sole discretion.
One common question that I hear on a regular basis regards the availability of an SRA for VMware SRM 5.0 when using Storwize V7000 or IBM SVC running V6.3 firmware. This combination is currently unsupported as per the alert found here.
The good news is that there are now IBM SRAs available for clients running SRM in combination with V6.3 firmware. While this combination is still not listed on the VMware support matrix found here, you can download the SRAs direct from IBM if your need is urgent.
A few weeks ago I wrote a piece called How to spot an old IBMer. It was a sort of reminiscence about my early days with IBM but it turned out to be one that really touched a chord with many Big Blue veterans. In fact the response was overwhelming, I have never received more hits or more comments for anything I have written. It was also pleasing that these responses were almost universally positive.
So it's ironic that today I am becoming an ex-IBMer.
Yes it's time for me to move on, so I have decided to try something new. I am joining a really exciting IT startup called Actifio.
So for all of you who have worked with me and helped me over the past 23 years: Thank you. It has been an honor to work at IBM and I wish Big Blue and all who continue to work there, nothing but success and happiness.
So you need to do some disk performance testing? Maybe some benchmarking? What tools are out there to help you out? Well I am glad you asked... here are some that I use on my daily travels:
IOmeter is an old classic, with emphasis on the word old. At time of writing, the most recent update was from 2006. However it remains very popular mainly because it is free and easy to use.
Some tips when using IOmeter:
On Windows, IOmeter needs to be run as an Administrator, which seems to be the most common mistake people make (not running as Administrator means you don't see any drives). You can only run one instance of IOmeter in Windows, which means if multiple users logon to the same server, only one user can run IOmeter. You also really need to run IOmeter with a queue depth ( or number of outstanding I/Os) greater than one, with multiple workers. If you don't, you will not be able to drive the storage to saturation. For instance here are some results running 75% read I/O, 0% random, 4 KB blocks on a Windows 2008 machine with 4 workers. In each case against the same 128 GB volume on a Storwize V7000 backended by 4 x 300 GB SSDs in a RAID10 array. In each case I let the machine run for 10 minutes before taking the screen capture to ensure the performance was steady state and not peaking.
Firstly I used a queue depth of one. Aggregate performance was around 27000 IOPS.
Then I used a queue depth of 10. Aggregate performance was around 81000 IOPS.
I then used a queue depth of 20. Aggregate performance was around 113000 IOPS.
What I am trying to show is that taking the defaults (one worker with a queue depth of 1) will not drive the storage to a useful value for comparison... you need to do some tuning and some experimenting to get valid results. At some point increasing queue depths will not improve performance (it may actually decrease it).
There is an alternative to IOmeter called IOrate (created by an EMC employee). It is also very popular and appears to still be in active development. It is not unusual to see IBM performance whitepapers that used IOrate to generate the workload.
This is a fairly recent tool that I have not had a chance to try out (due to time pressures). The tool uses virtual machines under VMware to generate the I/O and includes some very nice workload capture and playback tools as well as reporting tools.
Jetstress is a benchmarking tool created by Microsoft to simulate Microsoft Exchange workloads. I like the fact you can configure it to run for very long periods and it has a more real world feel about it than just running empty I/Os. You can get the base software here, but you will also need some files from a Microsoft Exchange install DVD (or from an installed instance of Microsoft Exchange). If you cannot get to those files you cannot complete the startup process inside Jeststress.
Oracle offer a tool on their website called Orion, which will simulate the workload of an Oracle database. You can get the tool from here (although you will need to create a free Oracle user account before you can download it).
SDelete is not a benchmarking tool or a performance modelling tool. But it is a great way to generate I/O with very little effort. Just create a new drive in Windows and then run SDelete against it with the -c parameter. This parameter is used for secure deletion, so generates random patterns (which is real traffic - albeit 100% sequential writes). The syntax is like:
(updated April 20, 2012 - I found in version 1.6 of SDelete the meaning of the -z and -c parameters got swapped. In version 1.6 if you want random patterns use -c, if you want zeros use -z. In previous versions it is the other way around!).
Just doing file copies is probably the worst way to generate benchmarks, especially as a single copy is usually a single threaded operation.
I am sure there are plenty of other tools out there to generate benchmarks and simulate workload. My main concern with many of them is that synthetic (artificial) workloads do not reflect real world workloads.
Right now I am working on giving a client a recommended version of firmware for their Cisco MDS Fibre Channel switches. For FICON, the recommendations are easy, but for Open Systems there are so many choices. So what am I going to recommend?
FICON Switches and Directors
For FICON switches, sticking to the FICON (IBM Mainframe Fibre Connection) recommended versions (which are determined by the IBM System z Mainframe team), is a very good strategy. The best place to get these is here (standard IBM logon is required). Just look along the right hand column for the release letters.
The SAN-OS and NX-OS release notes found on the Cisco website also show recommended versions for FICON. For instance have at the look at the FICON recommendations table in the releases notes for version 5.2.2a that you can find here. The upgrade path is just below the table I have linked to. This link will get outdated over time (as newer versions come out), but you can list all the release notes here.
If you are using a IBM TS7700 you should also be aware of this page on the IBM Techdocs site.
So based on current versions, if you are running SAN-OS 3.3.1c or below you need to move to 4.2.7b (as per the non-disruptive upgrade path). I strongly recommend you get to at least version 4.2.7b and start planning to move to release 5.2.2 (provided your hardware supports it).
For open systems attached Fibre Channel switches there are a number of versions to choose from. There are five things to consider:
Being on the very latest version has a small potential risk (of un-discovered bugs). However being on very old versions has a greater implicit risk (of being exposed to KNOWN bugs). Just because you have not hit a bug yet, does not insure you from potential issues, especially if your SAN is growing.
Your hardware. Some older Generation hardware is not supported at higher levels (for example Supervisor-1 cards cannot go past SAN-OS 3.3.5b) but later generation hardware is not supported at lower levels (for example Fabric 3 modules need NX-OS 5.2.2). The Cisco recommended versions page is the best place to confirm this.
End of life. As SAN-OS reached end of development in 2011, 3.3.5b is the best choice for all hardware that cannot upgrade to NX-OS. However be aware that some Cisco Generation 1 hardware (such as 2 Gbps capable hardware) will go end of service in September 2012 (for example Supervisor-1 cards and MDS 9120 switches). Links for this are below. Of course your service provider may choose to offer support beyond the Cisco end of life date, but instead of updating code, maybe you should be updating hardware.
You need to also upgrade your Fabric Manager to at least the same or a higher version than your switches are running. One important thing to be aware of is that from version 5.2, Cisco Fabric Manager has been merged into a new product called Cisco Data Center Network Manager (DCNM).
If you work (or have worked) for IBM then you have probably met many old timers. IBMers who have been with the company for 25 years or more (or even 50!).
But how do you spot an old IBMer?
Is it by the cut of their suit? Not sure about that anymore.
An IBM General Systems Division marketing rep in New Jersey in 1978.
It's certainly not by their extensive beards.
Development of the 3800 printer, taken in the early 1970s by Ray Froess (http://www.froess.com/IBM/3800printer.htm)
Is it by the size of their laptop? I hope not!
IBM 5100 Portable Computer (1975)
No... you can spot them by their use of certain words and phrases.
Here are a few I can think of... you may know more. Try this out as a test on someone who you think is an old IBMer and see how they go:
1) While showing a powerpoint presentation they keep saying they are showing foils (despite having not seen an overhead projector in over 10 years).
2) They refer to disk storage as DASD (pronounced Dazz-Dee).
3) They still call a Sales Rep a Marketing Rep (check out Buck Roger's book The IBM Way to see why).
4) They refer to their inbox as their reader (see #6 below).
5) They refer to the IBM corporate personnel database as callup (it has been a Web based application called BluePages for around 15 years).
6) If you say I will PROFS you (or I will send you a PROFS mail), they don't blink an eye-lid (PROFs was IBM's Mainframe based mail system, replaced by OfficeVision which was replaced by Lotus Notes in the 1990s).
7) If you say you F4ed or PF4ed an email... they know what you mean (it meant that you deleted it in PROFS/OfficeVision).
8) They reveal they are a veteran of IBM Typewriters by regaling you with their knowledge of Selectric Rotate Tapes.
It is ironic that only days after I wrote that 497 is the IT number of the beast, I learn that Linux has another unfortunate number: 208.
The reason for this is a defect in the internal Linux kernel used in recent firmware levels of SVC, Storwize V7000 and Storwize V7000 Unified nodes. This defect will cause each node to reboot after 208 days of uptime. This issue exists in unfixed versions of the 6.2 and 6.3 level of firmware, so a large number of users are going to need to take some action on this (except those who are still on a 4.x, 5.x, 6.0 or 6.1 release). If you have done a code update after June 2011, then you are probably affected. This means that if you are an IBM client you need to read this alert now and determine how far you are into that 208 day period. If you are an IBMer or an IBM Business Partner, you need to make sure your clients are aware of this issue, though hopefully they have signed up for IBM My Notifications and have already been notified by e-mail.
In short what needs to happen is that you must:
Determine your current firmware level.
Check the table in the alert to determine if you are affected at all, and if so, how far you are potentially into the 208 day period.
Prior to the 208 day period finishing, either reboot your nodes (one at a time, with a decent interval between them) or install a fixed level of software (as detailed in the alert).
To give you an example of the process, my lab machine is on software version 22.214.171.124 which you can see in the screen capture below. So when I check the table in the alert, I see that version 126.96.36.199 was made available on January 24, 2012, which means the 208 day period cannot possibly end before August 19, 2012.
Earliest possible date that a system running this release could hit the 208 day reboot.
SAN Volume Controller and Storwize V7000 Version 6.3
30 November 2011
25 June 2012
24 January 2012
19 August 2012
Regardless, I need to know the uptime of my nodes, so I download the Software Upgrade Test Utility (in case you have an older copy, we need at least version 7.9) and run it using the Upgrade Wizard (NOTE! We are NOT updating anything here, just checking):
I Launch the Upgrade Wizard, use it to upload the tool and follow the prompts to run it, so that I get to see the output of that tool. The output in this example shows the uptime of each node is 56 days, so I have a maximum of 152 days remaining before I have to take any action. At this point I select Cancel. You can run this tool as often as you like to keep checking uptime.
Note if you are on 6.1 or 6.2 code you may see a timeout error when running the tool, especially for the first time. If you do see an error, please follow the instructions in the section titled "When running the the upgrade test utility v7.5 or later on Storwize V7000 v6.1 or v6.2" at the Test Utility download site.
As per the Alert:
If you are running a 6.0 or 6.1 level of firmware, you are not affected.
If you are running a 6.2 level of firmware, the fix level is v188.8.131.52 which is available here for Storwize V7000 and here for SVC.
If you are running a 6.3 level of firmware, the fix level is v184.108.40.206 which is available here for Storwize V7000 and here for SVC.
If you are using a Storwize V7000 Unified, the fix level is v220.127.116.11 which is available here.
You should keep checking the alert to find out any new details as they come to hand. If you are curious about Linux and 208 day bugs, try this Google search.
*** Updated April 4, 2012 with links to fix levels ***
If you have any questions or need help, please reach out to your IBM support team or leave me a comment or a tweet.
*** April 10: The IBM Web Alert has been updated with new information on what to do if your uptime has actually gone past 208 days without a reboot. In short you still need to take action. Please read the updated alert and follow the instructions given there. ***
We just updated our Cisco MDS9509s to NX-OS 4.2.7b (from Cisco SAN-OS 3.3.1c) and now we are getting emails from this source: GOLD-major.
The actual message looks like this:
Time of Event:2012-03-05 15:07:21 GMT+00:00 Message Name:GOLD-major Message Type:diagnostic System Namexxxx Contact Namexxx@xxx.com Contact Emailxx@xxx.com Contact Phone:+61-3-xxxx-xxxx Street Addressx Road, xxxx, VIC, Australia Event Description:RMON_ALERT
WARNING(4) Falling:iso.18.104.22.168.22.214.171.124.1.10.18366464=2401032512 <= 4680000000:135, 4 Event Owner:ifHCOutOctets.fc4/5@w5c260a03c162
So who is GOLD-major?
GOLD actually stands for Generic OnLine Diagnostics. From Cisco's website: GOLD verifies that hardware and internal data paths are operating as designed. Boot-time diagnostics, continuous monitoring, and on-demand and scheduled tests are part of the Cisco GOLD feature set. GOLD allows rapid fault isolation and continuous system monitoring. GOLD was introduced in Cisco NX-OS Release 4.0(1). GOLD is enabled by default and Cisco do not recommend disabling it.
So in our example GOLD is actually reporting a major event (to do with exceeded thresholds, in this example utilisation on interface fc4/5).
Most clients using Cisco MDS switches are now moving to NX-OS (over SAN-OS, the name Cisco used for MDS firmware between version 1 and version 3) so this question will become more common. I am working on a post that discusses recommended versions (and the sunsetting of SAN-OS), so expect something soon. If on the other hand you are thinking.... how do I setup call home on a Cisco MDS switch? The information for NX-OS is here.
Curiously my brain cannot help itself, when I hear Gold Major I think it means Gold Leader which leads me to Red Leader which leads me to Red October. Maybe it's just me? Enjoy:
Because if a product uses a 32 bit counter to record uptime, and that counter records a tick every 10 msec, then that 32-bit counter will overflow after approximately 497.1 days. This is because a 32 bit counter equates to 2^32, which equals 4,294,967,296 ticks. If a tick is counted every 10 msec, we create 8,640,000 ticks per day (100*60*60*24). So after 497.102696 days, the counter will overflow. What happens next depends on good programming: normally the counter just starts again, but worst case a function might stop working or the product might even reboot.
Fortunately we are seeing less and less of these issues but just occasionally one still slips out. Recently IBM released details of a 994 day reboot bug in the ESM code of some of their older disk enclosures (EXP100, EXP700 and EXP710). Details about this bug can be found here. What I find interesting is the number of days it takes to occur, since 994 is actually 497 times two. This suggests that this product records a tick every 20 msec. This meant we got past 497 days without an issue but hit a problem after exactly double that number. So if you still have these older storage enclosures, you will need to reboot the ESMs (after checking the alert).
I googled 497 to see what images that number brings up and was amazed to find the M-497 jet powered train. More details on this rather interesting attempt at speeding up the commute home can be found here and here. It adds a whole new meaning to keeping behind the yellow line.
If you have combined vSphere 5.0 with XIV, then you may want to try out the new IBM Storage Provider for VMware VASA (vSphere Storage APIs for Storage Awareness). You can download the installation instructions, the release notes and the current version of the IBM VASA provider from here. Clearly because VASA is introduced in vSphere 5.0 your VMware vCenter also needs to be on version 5.0.
Now IBM have had a vCenter plugin for a very long time (which I have written about here, here and here) and while you still need that plugin if you want to do storage volume creation and mapping from within vCenter (as opposed to using the XIV GUI), the VASA provider makes storage awareness more native to vCenter. This is a very important step. It means instead of using vendor added icons and tabs (like the IBM Storage icon and the IBM Storage tab that are added by the IBM Storage Management Console for vCenter), you just use the default vCenter tabs.
Right now version 1.1.1 of the IBM VASA provider delivers information about storage topology, capabilities, and state, as well as events and alerts to VMware. This means you will see new additional information in three tabs: Storage Views, Alarms and Events.
After installing and setting up the VASA provider, in vCenter select your VMware cluster, go to the Storage Views tab and select the view Show all SCSI Volumes (LUNs) there are four columns with more information. The Committed, Thin Provisioned information, Storage Array and Identifier on Array (indicated with red arrows) comes straight from the XIV (hit the Update button at upper right if you are not seeing anything yet). This is really useful information as it lets you correlate the SCSI ID of a LUN to an actual volume on a source array. Here is a cut-down view of that extra information:
If you want a larger screen capture you can find one here.
The Task & Events and Alarms tabs will also now contain events reported by the VASA provider such as thin provisioning threshold alerts (although if you have just installed the provider you may see nothing new, as nothing has occurred yet to provoke an alert or event).
As usual I have some handy tips on the steps you will need to take to get VASA going:
First up you will need to identify a virtual machine to run the provider on (or just create a new one). I chose to deploy a new instance of Windows 2008 from a template. Because the VASA provider communicates to vCenter via an Apache Tomcat server listening on port 8443, that port needs to be free and unblocked. This also means you should not run the VASA provider in the same instance of Windows as the vCenter server (see below for more information as to why).
Download the IBM Storage Provider for VMware VASA as per the link above (use version 1.1.1, see the user comments in this post for details about a bug in version 1.1.0).
Install the provider in the Windows VM you created in step 1. The tasks are detailed in the Installation Instructions, but it is a simple follow-your-nose application installation. As per most XIV software packages, it will install a runtime environment (xPYV which is Python) as part of the install.
Now we need to define the credentials that VMware vCenter will use to authenticate to the IBM VASA Storage Provider. These should be unique (and are not an XIV userid and password - this is only between vCenter and the provider software). In my example I use xivvasa and pa55w0rd. The truststore password is used to encrypt the username and password details (so that they are not stored in plain text). Open a Windows command prompt (make sure to right select and open it as an Administrator) and enter the following commands:
cd "C:\Program Files (x86)\IBM\IBM Storage Provider for VMware VASA\bin" vasa_util register -u xivvasa -p pa55w0rd -t changeit
Don't close the command prompt, because we now need to define the XIV to the IBM VASA provider.
You need the IP address of your XIV and a valid user and password on the XIV that can be used to logon to the XIV. So in this example my XIV is using 10.1.60.100 and I am using the default admin username and password (which I know does not set a good example). This is the command you need to run:
If this command fails, reporting your firmware is invalid, you are probably using the original 1.1.0 version of the VASA provider, go back to the IBM Fix Central website and make sure you have the latest version (at least version 1.1.1). If it reports the firmware cannot be read, make sure you are running the Command Prompt as an Administrator.
Once you successfully added the XIV to the provider, you need to restart the Apache webserver. Do this by starting the services.msc panel and looking for the Apache Tomcat IBMVASA service as pictured below. Stop it and then start it. Once you have done that you can logoff from the VASA VM.
Now connect to your vSphere Client (which needs to be on at least version 5.0.0) and from the Home panel, open the Storage Providers panel.Then select the option to Add a new provider. The URL needs to include the correct port number (by default 8443), so it will look something like this (where the provider is running on 10.1.60.193). Note also that the VASA provider version number is in the URL, so if you upgrade the provider you will need to change the URL (currently v1.1.1):
The Login and password should match the user id and password you defined in step 4 (remember it is not logging into the XIV, it is logging into the VASA provider).
If you get a message saying your user id and password are wrong, you probably forgot to stop and start Apache in step 6 above. If you succeed you should see a new provider listed. Highlight the provider and select sync to update the last sync time.
Your setup tasks are now all completed. Now go and explore the panels I detailed above to see what new information you have available to your vCenter server.
Why a separate server for the VASA provider?
The IBM VASA provider uses Apache Tomcat, which by default listens on port 8443. However since vCenter already has a service listening on port 8443, it means we have a clash. I googled and found the Dell and Netapp VASA providers also listen on port 8443 and they also recommend separate servers. I noted Fujitsu's provider uses a different port but still requires a separate server. So it seems if you have multiple vendors you will either have to spin up a separate server for each vendors provider, or start playing with changing the port number. The installation instructions for the IBM VASA Provider explain how to change the default port number if you are truly keen.
I always laugh when people say to me: I wouldn't know what to blog about!
When you work in pre-sales support, you constantly get asked questions and each one of them could be the subject of a new blog post. Right now the most common question I am getting is:
I am implementing VMware Site Recovery Manager (SRM). One of the components I need are vendor specific Site Recovery Agents (SRA). I have searched IBM's website but cannot find them. Where are they?
So the short answer is: you get them from the VMware SRM download site. However before downloading, there is a key task that absolutely needs to be performed:
Visit the VMware vCenter Site Recovery Manager Storage Partner Compatibility Matrix. This site will confirm what products are supported by each version of SRM. You can find it here, but clearly you need to check back regularly to ensure you have the latest information.
Now find your storage device in the matrix and confirm what firmware levels are supported. This is really important. For example, the Feb 27, 2012 edition of the matrix tells me that the Storwize V7000 is supported for SRM version 5.0, but only when running Storwize V7000 firmware version 6.1 or 6.2. This is significant because if you upgrade to version 6.3 you are not supported. In fact that combination doesn't actually work yet, as detailed here. Clearly something you need to be aware of when planning firmware updates.
So where are the SRAs? On each of the pages below use the Show Details button to see what version SRAs are being shipped with that SRM (although sometimes the pages take a few days between an SRA being added and the page being updated):
There are a few more questions I routinely get asked:
Does IBM actually have an SRA download site?
The answer is yes, but it is an FTP site only for SRAs written by IBM. It is principally a repository for older SRAs and beta SRAs but you can also find the current SRAs on it. You can find the site here. Note however that it is NOT the official source. For that you need to use the VMware site.
What about the SRA for LSI/Engenio based products like the DS4800?
These used to also be found on the LSI site, but since LSI sold Engenio to NetApp, it is no longer available from the LSI or NetApp websites. You need to download the current version from the VMware sites listed above. There is a version for SRM 5 on the VMware download site.
What about nSeries SRAs?
If you need an nSeries SRA, again you should go to the VMware download pages. There are separate SRAs listed and available for IBM nSeries (as opposed to an SRA for NetApp branded filers).
What about an SRA for XIV with SRM version 5?
The answer: The SRA for XIV with SRM 5 (and 5.0.1) is now available from VMware. If you have access to download SRM, you will be able to download SRA version 2.1.0. It is the same SRA for both XIV Generation2 and Gen3.
What about an SRA for Storwize V7000 and SVC version 6.3 code?
The answer: It is coming. We are working to make it available as soon as possible. I will update this post as soon as I have a date for you (we are talking weeks, not months).
*** Update March 23, 2012 - Added details on SRM 5.0.1 ***
Many years ago I picked up a book that literally blew my mind. It was the Cuckoo's Egg by Clifford Stoll and it's a genuine classic, a true tale of hackers and how one was tracked down in the very early days of the internet.
Now the story is about events in 1986, so it captures the state of technology at the time (which rather dates the book), but wow, what a great story.
So why mention the book? Well apart from the fact that it is well worth a read, the key issue that Clifford saw again and again was default passwords. The hacker would identify a target and then try to logon using default IDs and default passwords, usually with great success.
Now I have blogged in the past about the determined (but often ignored) way that Brocade switches berate you into changing default passwords. But pretty well all products need to do this, as they all have the same issue (and a truly problematic counter-point). You absolutely need to do two things with every product in your data center:
Change the default passwords on every device you deploy.
Record what those passwords got set to (preferably using a logical or physical password safe).
Now don't laugh, but forgotten/lost passwords on data center kit (like switches) is a VERY common problem. When I worked in the IBM Storage Support team I took calls EVERY WEEK from clients who had devices they could not logon to, for all manner of reasons. For some, supplying them with the default passwords saved them (and condemned their employer?), but for others they needed much more detailed assistance.
My preferred solution to this challenge is to use external authentication (like LDAP) but being able to reset passwords with an external tool is also a nice option to have available.
The reason I started thinking about this is a nice tool IBM offer for the Storwize V7000 called the Initialization Tool that you can download from here. Using this tool you can reset the password of the Superuser ID on a Storwize V7000 back to the default (passw0rd). The tool runs on a USB key. After requesting the tool to help you to reset the superuser password, you insert the USB key into the Storwize V7000, wait for the orange indicator light on the relevant node canister to stop blinking and the task is complete. Then put the USB key back into your laptop and run the init tool again to get a completion report that should look like this:
This is great to rescue customers who have lost their passwords, but the question then gets raised: Can I block this?
My first response is: if you are concerned about unauthorized people with malicious intent placing USB keys into your Storwize V7000, then don't let them into your computer room (presuming you can spot them by the colour of the hat they are wearing). If that is not an option, lock the rack that the Storwize V7000 resides in (change control does have its benefits). If that is not an option, there is one more alternative, but it is a tad extreme.
What we can do is prevent password reset via USB key (or in the case of the SVC, via the front panel). We do this by issuing the following CLI command: setpwdreset -disable
In the following example, I confirm that password reset is possible (value 1), I then disable it and confirm that password reset is no longer possible (value 0). If curious I could then get some help on that command:
Only if your paranoia is matched by your attention to detail.
My reason to hesitate recommending it is simple: If you prevent password reset and then forget your password (and have no other local Security Administrator accounts), you have locked the door and thrown away the key. Far better to physically lock the rack.
In the end though, your company needs to set a policy that is actively enforced (with no exceptions). So get to it.
The updated XIV GUI that supports version 11.1 of the XIV software (which adds support for SSD Read Cache) is now available for download. This brings the XIV GUI to version 3.1 and you can download it for Windows, Mac, Linux, Solaris, AIX and HP-UX from here.
So what benefits will you get?
The new GUI will display information about the SSD read cache. For instance the statistics panel will now also report on SSD cache hits (as opposed to memory cache hits). The GUI will also display the presence and health of the SSD in each module (presuming they have been ordered for that machine). You can clearly see that it is located at the rear of the module!
It supports the IPv6 protocol. So if your XIV system has code level 11.1.0 or above, you can manage that XIV over an IPv6 connection (after using the updated GUI to define the new addresses).
The GUI can now manage up to 81 systems from a single console. Yes you read that right: 81 systems. So let's think about that: IBM would only take the GUI to that number if there were clients who were approaching that number. Outstanding!
Enhanced search and filtering. Allows you to search across all managed devices and also filter what gets displayed. The search function is really nice. You get to it from the View menu as shown: In this example I search for the term test and get a considerable number of hits. If you notice the first column uses some very nice icons to indicate the resource type (such as a volume, pool or host cluster):
The GUI now displays un-mapped LUNs as a separate category. This is also a very nice enhancement.
One other change is that if you start the XIV GUI in demo mode it now also displays an XIV Gen3 (so you can see the Gen3 patch panel).
If you are running Generation 2 XIVs (on 10.x.x code) you will benefit from those last three improvements so there is something for everyone.