I recently had a client ask me if I had seen this problem in Cisco Device manager: Device Manager was showing them 100% utilisation for CPU on one of their MDS9509s. I had a look at the show tech-support and curiously show process cpu showed practically no CPU usage at all. I suggested a display problem and sure enough, Cisco confirmed it:
Symptom: The show system resources command shows high CPU usage even when there is not
much activity on the switch. In one instance, the CPU utility (user and kernel)
was always 100 percent.
Conditions: You might see this symptom 248 days after the system came up
Curiously the Cisco tech support person stated that in fact a CP switchover every 497 days would prevent the issue reoccurring. This is curious because 248 days is close to half of 497 days. And 497 is ITs number of the beast.
The reason that 497 is a problem number is because of the use of a 32 bit counter to record uptime. If you record a tick for every 10 msec of uptime, then a 32-bit counter will overflow after approximately 497.1 days. This is because a 32 bit counter equates to 2^32, which can count 4,294,967,296 ticks. Because a tick is counted every 10 msec, we create 8,640,000 ticks per day (100*60*60*24). So after 497.102696 days, the counter will overflow. What happens next depends on good programming.
Some classic bugs can be found here, here, here and here. Most of these bugs are old and will almost certainly not affect anybody. But remain on notice: 497 day bugs are still possible. Just Google the search argument: 497.1 day bug.
Now let me be clear: I am not aware of any active disruptive, bring-down-your-business type 497 day bugs. The sky is not falling. But historically many vendors products have had 497 day bugs, some of them nasty. I ponder whether we should schedule a switch reboot every 496 days just to avoid the possibility of a 497 day bug. Its an interesting idea. I certainly endorse staggering initial switch reboots by at least an hour, so that a simultaneous 497 day reboot bug (should one be lurking), would not reboot every switch in every fabric at the same time. And in case your think I am picking on Cisco, when I looked at the client switch in question, it was showing a kernel uptime of 562 days, 23 hours, 35 minutes, 24 seconds. Thats some solid uptime.
Back from a short break (for Easter and the School Holidays) to three great pieces of news:
- A new series of Doctor Who is screening.
- Will and Kates wedding went off without a hitch (its not often I get to yell at the dog to stop barking at possums because there is a Royal Wedding on).
- The VAAI driver for XIV has an official download link.
Ok... maybe the Royal Wedding has no place in my blog, but the VAAI link is very appreciated.
Two ways to get to the driver:
- Get it directly from here
- Go to fix Central and select it from the download list: http://www-933.ibm.com/support/fixcentral/
Remember, your XIV needs to be on 10.2.4a firmware, so you need to be talking to your IBM Service Representative to schedule a concurrent firmware update before you turn the VAAI functions on.
Now if your going, um... what is VAAI and how does it help? Check this blog post out:
If your asking, hey what else will 10.2.4a code bring me?
- How about better write performance?
- How about QoS?
- 10.2.4a code also brings the ability to do 'truck' initialization of an async pair (which lets you pre-load an async secondary for faster initial mirroring, or to convert from sync to async without re-mirroring all your data).
- It also lets you format a snapshot, which means you can keep a snapshot in place and mapped to a host, but it will not consume any space.
Last week IBM released Version 2 of the management plug-in for VMware vCenter. The main benefit of Version 1 (the previous release) was that it allowed you to map your datastores to XIV volumes (i.e. which XIV volume equates to which VMware datastore). This was very handy (especially if you were not paying attention as you allocated volumes to your VMware farm), but you still needed the (very easy to use) XIV GUI as well as (obviously) vCenter to manage your landscape end to end.
With the release of Version 2 of the XIV plug-in, we suddenly have the tantalizing possibility that the VMware administrator will not need to talk to their storage administrator or turn to the XIV GUI for day to day operations.
Well Version 2 offers a new and improved graphical user interface (GUI), as well as brand new and powerful management features and capabilities, including:
- Full control over XIV‐based storage volumes (LUNs), including volume creation, resizing, renaming, migration to a different storage pool, mapping, unmapping, and deletion.
- Easy and integrated allocation of volumes to VMware datastores, used by virtual machines that run on ESX hosts or datacenters.
- The ability to monitor capacity, snapshots and replication.
So from vCenter you can now for instance map yourself some new volumes to create data stores, or re-size existing ones. You can also confirm that each of your datastores is being mirrored.
You can get the plug-in free of charge from here:
There is a users guide here. I urge you to download it and have a read. The Users Guide contains lots of really good examples of how the plug-in can be used with some great screen captures. The release notes are here and also make for very good reading.
I honestly think every VMware installation should be using this plug-in. But I am curious about how it will affect the responsibility divide. If your a one-person shop, the chances are that you love your XIV quite simply because you don't need to administer it. The XIV leaves you free to focus on your VMware farm, rather than fret about hot spots or hot spares or RAID groups. For you, this plug-in just makes your life even easier.
But what about larger companies? Firstly, its important to understand that to perform storage administration, the vCenter plug-in will need an XIV userid that has Storage Admin privileges. Why is this significant? Well what if the team who manage the XIV and the team who manage VMware, are not the same people? What if they are different teams; who maybe have different managers; who may work in different buildings or different cities? What if they work for different companies? Do plug-ins like this one erode the lines and bring these teams together? Or are the functional divides still too strong?
I would love to hear your experiences, both in using the plug-in.... and tearing down the walls.
For someone who blogs so frequently about the IBM XIV, I will let you in on a little pet hate of mine: The XIV uses decimal volume sizes.
The XIV GUI and CLI has the user create volumes using decimal sizing, meaning 1 GB = 1,000,000,000 bytes (1000 to the power of three).
Nearly every host system out there (i.e. Windows, AIX, Linux, VMware, Solaris) display volume sizes in binary, meaning 1 GiB = 1,073,741,824 bytes (1024 to the power of three).
This disparity has a quirky consequence. If the XIV says a volume is 17 GB, the host that uses that volume says it is 16 GiB (which the host often then mis-states as GB). This doesn't mean there is a loss of space, this isn't headroom or formatting - its just a different way of counting bytes. Its not a road block and its easy to understand and work with. But it is a little annoying. (Then again, so is my 32 GB iPhone reporting it has 29.3 GB of space).
The other point is that the IBM SVC, Storwize V7000, DS8000 and DS3000/DS4000/DS5000 families have always used binary sizing (even if their respective interfaces use the term GB as opposed to GiB - yet another pet hate of mine and the Storage Buddhist).
So whats the point of this rant?
The IBM XIV Storage System GUI (Version 3.0) will allow volume creation in both GiB and GB units. The IBM XIV Storage System management GUI version 3.0 will support the creation of volumes in Gigabyte (GB) or in Gibibyte (GiB) or Blocks (where each block is 512 bytes).
So this is a really good change.
The new GUI has not hit the download site yet... but I will be sure to tell you as soon as it has!
*** Update 08/09/2011 - corrected GUI version from 2.5 to 3.0, removed some confusing terms ***
I have some great news regarding VAAI support for XIV.
Let me detail the current situation:
- VMware has approved the IBM driver for VAAI and we can now release it to the public. The IBM_VAAIP_MODULE plugin will be available shortly from the ibm.com website. When the release URL is available I will update this post. In the meantime you can get the driver from your XIV TA (Technical Assistant) or IBM Account Team. If they have somehow missed the news, get them to talk to their XIV Product Manager (or they can always talk to me!).
- The VMware Hardware Compatibility guide found here shows that VMware support the three VAAI primitives with XIV, if you are using ESX 4.1 or ESX 4.1 U1 and your XIV is on firmware release 10.2.4 or higher.
- XIV firmware release 10.2.4a is available for install. Installation of this firmware is non-disruptive (concurrent) and will be performed by IBM.
- The VAAI driver and installation of the 10.2.4a code are all supplied free of charge.
So what should your plan be?
- Ensure VAAI is disabled on your ESX hosts.
- Talk to your local XIV TA or IBM Service Representative (SSR) and arrange to have 10.2.4a firmware installed.
- When 10.2.4a code is installed, you can then begin installing the VAAI driver on each of your ESX 4.1 servers. You will need to reboot each server to install the driver.
A question I get routinely asked relates to Windows disk partition alignment with XIV. If you don't know what I am talking about, take some time to read these very useful pages from our friends at Microsoft. Once you have had a look, come on back and read my perspective.
Disk Partition Alignment (Sector Alignment): Make the Case: Save Hundreds of Thousands of Dollars
Back already? Hopefully you now know that disk partition alignment is all about starting an IO at a logical block address that best matches how the underlying hardware stores your data. So now your wondering, what does this have to do with XIV? Well XIV has two concepts that relate to this: cache and partitions.
XIV cache (the server memory used to speed up reads and writes) is organised into 4 KB blocks (which is nice and small).
So the XIV cache does not care about disk alignment.
But when it comes to writing and read from disk, the XIV writes data into chunks of consecutive logical block addresses (LBAs) that we call partitions. These partitions are 1 MiB in size. What does that concept mean? It means the magical number for XIV is 1024 KB or 1 MB. (actually KiB and MiB, but for the sake of ease, I will stick to the naming used by Microsoft. Given this number is fairly large (other hardware often aligns to 32KB, 64KB or 256KB), for XIV this reduces the potential impact of misaligned partitions. Which is good.
Correct Windows Disk Alignment could give up to a 7% performance improvement when using an offset of 1024 KiB. (1 MiB). I need to be clear, that's not a guaranteedimprovement of 7%. It's a maximum possible improvement. Your particular server will see an improvement somewhere between 0% and 7%. It depends on your workload patterns. The more small and random your workload, the more useful setting the 1024 KB offset will be. The more sequential your workload, the less useful it will be, as only the first and last parts of an I/O could potentially be misaligned. This mis-alignment could equate to a tiny percentage of extra work for the XIV. Sadly there is no metric you can display to detect how much impact misalignment is actually having.
So should you do it? The good news is that new volumes created under Windows 2008 prefer the 1 MB boundary. So a fresh install should already be using the correct values. The bad news is that volumes created under earlier Windows Operating Systems (Such as Windows 2000 and 2003) will almost certainly be misaligned, and correcting the alignment is destructive to the data in the partition.
How to check alignment at the host? Here is an example:
I start diskpart:
Microsoft DiskPart version 6.1.7600
Copyright (C) 1999-2008 Microsoft Corporation.
On computer: ANTHONYV-PC
I list my disks. In this example I have two disks installed in my laptop. I select disk 0:
DISKPART> list disk
Disk ### Status Size Free Dyn Gpt
-------- ------------- ------- ------- --- ---
Disk 0 Online 238 GB 5724 MB
Disk 1 Online 232 GB 1024 KB
DISKPART> select disk 0
Disk 0 is now the selected disk.
Now I list the partitions and see the offset for each one.
DISKPART> list partition
Partition ### Type Size Offset
------------- ---------------- ------- -------
Partition 1 Primary 100 MB 1024 KB
Partition 2 Primary 232 GB 101 MB
Partition 1 has an offset of 1024 KB, which is 1 MB, which is perfect for XIV. Partition 2 has an offset of 101 MB, which is still on the 1MB boundary (it was pushed there by the combination of the size of the first partition (100 MB) and its offset (1 MB). So this is perfect.
For an example of how to create a partition with the correct offset, check out this how-to document, that also provides some good follow on reading:
What about other IBM products?
The IBM SVC and Storwize V7000 prefers 64 KB (or larger) offsets as documented here:
Why? Because the SVC and Storwize V7000 use a concept of grains, where each grain is usually 64KB or 256KB in size.
The DS8000 (regardless of model), also prefers 64 KB offsets. The DS8000 use the concept of logical tracks where each logical track is 64KB.
The DS3000/DS4000/DS5000 range allow the user to set the segment size of a logical volume on creation. The setting that you define should match the segment size defined for the logical drive being used. In the example below, it is 64 KB.
What about VMWare?
The answers are no different. Misalignment can indeed make a difference to client performance. Check this link from NetApp and this document from VMware:
For an EMC perspective, check out this link from someone I respect a great deal, Chad Sakac:
I searched around looking for an image to highlight the theme of alignment. I found this image in the IBM archives for the IBM Mass Storage Facility announced back in 1974. I am sure this product had some interesting alignment challenges.
(edited 24/5/2011 --> removed old Visio Stencils link).
VisioCafe has been updated with IBM's latest official stencils for use with Microsoft Visio. These include all models of the Storwize V7000, including the newest models: The 2076-312 and 2076-324 (which have the dual port 10 Gbps iSCSI card).
Here is the link to VisioCafe. The Storwize V7000 stencils are in both the IBM-Disk as well as the IBM-Full packages.
Remember you can also find my XIV stencils here:
Requests for Visio stencils are one of the most common comments I receive.
More are coming so your requests are being heard!
Over on my Wordpress blog, I have posted an entry on migrating a Linux RHEL host from EMC to XIV.
If that subject interests you, check out my article here:
The XIV 10.2.4 release notes report performance improvements that are worth investigating. Two of the reported improvements listed are:
- Improved write hit performance with small blocks
- Improved write caching performance
I visited a client running 10.2.4 to see if these could be detected in the XIV performance statistics. In this clients case, the upgrade occurred on Feb 14. First up I wanted to show that in the period I am examining there was no major variation in write IO. In other words, before and after the code load, I wanted to confirm the client performed the same level of IO.
Having confirmed that the write IOPS did not vary over the period in question, did the latency change? Here we have some good news. Firstly the latency for Write Hits improved (slightly). A write hit is a write into a 1MB partition that already has some data in cache. It is faster than a write miss because some of the address allocation work has already been done. Write hits and misses both hit cache as I explained here. You can see a change on Feb 14 (when the code was updated):
I then looked at the latency for write misses. Again the latency dropped. This suggests that cache operations in general are being handled faster.
I then started thinking.... are we getting more write cache hits? The answer was YES! This is curious because the client normally does not have much control over where they actual write data to... Clearly the XIV firmware is managing the write cache in a more efficient manner. This is good not only because write hits normally have lower latency than write misses, but also because a write hit can save us destaging a block of data to disk. This is because a write hit could involve over-writing data that had not yet been destaged to disk. So two writes to the same LBA would only result in one write to backend disk.
So in conclusion, the upgrade to 10.2.4 code resulted in a measurable improvement in write IO performance at a real world client. Nice!
Its easy to make a fool of yourself.
Its not hard to do.
All you is need is a moment of inattention combined with a massive assumption. In fact assumptions can bring you undone at any time. A former manager of mine introduced me to the saying: To assume is to make an ass of you and me.
So what was the assumption this time?
One of our business partners sold a client two new XIVs and 4 new IBM SAN40Bs (40 port fibre channel switches). So far so good. When you order the SAN switches you have a choice of ordering 4 Gbps capable SFPs (SFPs are the fibre optic sub assemblies that you plug your cables into) or 8 Gbps capable SFPs. There was a time when the 8 Gbps SFPs were much more expensive than the 4 Gbps, but today they are about 75% of the price of the 4 Gbps. So it makes sense to buy the faster SFPs. But you need to ensure that all the HBAs at the client site are at least 2 Gbps capable, because 8 Gbps SFPs are tri-rate and can only go at 2, 4 or 8 Gbps. Sure enough an assumption was made that this was not an issue... but it was. The client has WDMs that run at 1 Gbps and upgrading those WDMs would be a significant expense.
So I got to thinking... could I force the SFP to 1 Gbps?
If I display the 8 Gbps SFP it reports it is capable of 200, 400, 800 MBps which is code for 2, 4 or 8 Gbps.
But maybe I could force it to 1 Gbps?
Sadly all I did was break the port. A port in Mod_Inv status means the SFP is in an invalid state. This is not going to work.
So what to do? We could not just move the old SFPs into the new switch, as the new 8 Gbps capable Brocade switches only accept Brocade approved SFPs. The only solution was to make it right and swap four of the Brocade 8 Gbps SFPs with Brocade 4 Gbps SFPs. Fortunately as we needed only four, I was able to swap them with little expense or hassle (I contacted our local Brocade rep who happily helped us out).
The end point was a happy client and a lesson re-learnt..... 1 into 8 does not go.
I am curious though... is there much 1 Gbps gear still out there? Is this a common issue?
Over on Wordpress, I have just published an article on SNMP and XIV.
Given some funky formatting, I have decided not to paste it into this blog.
If your interested in monitoring an XIV with SNMP, please head over to here:
A friend of mine sent me a direct message on Twitter that pointed out something interesting.... A blog post I had written on SDDPCM had been copied word for word by another site. A little bit of googling revealed that in fact it had been picked up by two sites. Here is the original, and the copies are here and here.
What bothered me was not that the content was copied without any obvious (well, obvious to me) attempt to acknowledge the original author. In fact in both cases, the copied text included a link to another blog entry I had written, so an alert reader would pick up that the content had come from someone else (still... a little acknowledgement doesn't hurt). To begin with, I was also not concerned with the re-use of my work. After all, I am writing this to be helpful, so if you think something I have written is helpful... and you spread the word... that work is even more helpful (but hey thats what Twitter is for... right?). But then it occurred to me....by copying the article without a link back to the original source (mine), if I find a mistake is made and I update my blog post, those corrections will not flow to the clones. So this potentially undermines my efforts to be helpful.
I also noticed that in each case, the clones had advertisments by Google. Does this mean Google and/or these other bloggers, are actually making money from copying my content?Hmmm... acknowledgement is one thing... a cheque is even nicer.
Or I am reading too much into this?
Still... message to Anthony... if you push content into the public domain you have to be prepared for this.
After tweeting about this, I did learn it is possible to insert sentences into your content that you could then monitor for with Google Alerts. I don't plan to do this myself, but its certainly worth being aware of. This of course also presumes the cloners don't detect these sentences and delete them.
I am very curious to know of similar experiences. Has this happened to you? Did you do anything about it? Were you happy with the result?
When IBM first released the Storwize V7000, we announced it was capable of supporting ten enclosures, but would on initial release support only five. We stated that this restriction would be lifted in Q1.
The good news is that this restriction is indeed now lifted by the release of Storwize V7000 software version 184.108.40.206, which is available for download from here:
You should also check out this link:
Storwize V7000 6.1.0 Configuration Limits and Restrictions
This new level also contains an additional enhancement which I think users will really like, called Critical Fix Notification. The new Critical Fix Notification function enables IBM to warn Storwize V7000 and SVC users if we discover a critical issue in the level of code that they are using. The system will warn users when they log on to the GUI using an internet connected web browser. It works only if the browser being used to connect to the Storwize V7000 or SVC, also has access to the Internet. (The Storwize V7000 and SVC systems themselves do not need to be connected to the Internet.) The function cannot be disabled (which is a good thing) and each time we display a warning, it must be acknowledged (with the option to not warn the user again for that issue).
As I blogged previously, VAAI support for XIV has two dependencies:
- 10.2.4a code
- Vmware Certified driver
Both of these things are very close to release....
In the meantime I have had the chance to demonstrate the uncertified VAAI driver with XIV 10.2.4 code, just to see what affect it has.
And what is the affect?
VAAI dramatically reduces the amount of work that the vSphere 4.1 server needs to do to get things done.
The XIV implementation of VAAI provides the three fundamentals of VAAI:
- Full clone, copying data from one logical unit (LUN based) to another without writing to the ESX server.
- Block Zeroing, assigning zeros to large storage areas without actually sending the zeros to the storage system.
- Hardware Assisted locking, locking a particular range of blocks in a shared logical unit (providing exclusive access to these blocks), instead of using SCSI reservation that locks the entire logical unit.
To test VAAI with XIV, I did two things: a VMDK migration (a Storage Vmotion) and VMDK cloning. I used the vSphere client to time how long the operation took and XIV Top to see how much IO was being generated by the vSphere server. Now please understand, these numbers and timings are based on a lab environment. The speed and peaks will vary from client to client and install to install.
Firstly the migration: I performed a migration of a VMDK from one data store to another. The migration without VAAI took 42 seconds as can be seen from the screen capture below:
The migration generated a peak of 135 MBps of traffic being written to the target volume as can be seen from XIV Top:
I then turned on VAAI and did the same migration. I won't document the process to install the VAAI driver, as it will be different when the certified version is released. However after the driver is installed, I could turn VAAI on and off by toggling these settings from 0 1 and back again:
I we did another VMDK migration with VAAI enabled. This time the migration took 19 seconds (as opposed to 42 seconds), so an immediate improvement occurred.
When I checked XIV Top, there was no IO at all! In other words the vMotion was done with no apparent load on the vSphere HBAs or the SAN. I feel silly showing this screen capture, but this is what I saw.... nothing.
I then did a VMDK clone. The Data store was on XIV, VAAI was not enabled. There was no other IO running on the ESX server. The clone took 40 seconds (as reported by vCenter):
The clone generated a peak of 230 MBps for around 50 seconds (as reported by XIV Top)
We then again activated VAAI and repeated the clone. Now the clone took 15 seconds (as reported by vCenter), so thats 25 seconds faster (more than 50%).
The clone generated a peak of 2 MBps for around 20 seconds (as reported by XIV Top). Almost no fibre channel IO was thus generated by the clone.
As I have blogged before, I will be repeating this whole exercise once I have real live customers running this configuration, so expect further updates.
Things have been pretty revolting lately, and I am not talking about Tunisia or Egypt or Libya (thought actually they could equally apply to my story).
What I am talking about is mother nature, and she is pretty angry with us right now.
In the last few months Australia and New Zealand have seen massive floods in Queensland, Victoria and Western Australia, destructive cyclones hitting Queensland and Western Australia, ferocious bush fires in Western Australia and most recently, a massive earthquake in New Zealand.
The personal loss of life and of property have been shocking and tragic. Each of these events have reminded me how quickly everything we hold dear can be taken away in an instant... by an event over which you have no control.
Which leads me to storage clouds....
If something can be stored electronically, then it can be stored in a cloud. A cloud that is hopefully well backed up, and far away from your own personal location. And no this is not an advertisement... its a suggestion....
Given the events of the last few months, I have started using a storage cloud provider to protect my photos, my music and my insurance information.
I looked for cloud storage providers who:
- Offered a tool that when installed on my laptop/PC, automatically backs up the contents of selected folders. This means I don't have to remember to backup. It should happen automatically.
- Offered a way of accessing the backed up data from anywhere.
- Is reasonably priced.
I considered the following uses:
- Backup all my photos.
- Backup all my music.
- Digitize my insurance documents and back them up. Scan in all my receipts and some photos of the contents of each room of the house. That way if the house burns down... I have a base to work off.
- Scan in important documents that I could not easily replace.
Let me give you an example of a document I would never want to have to replace....
My son is practicing to get his drivers license. In Victoria you need 120 hours of driving experience recorded in a log book. This log book needs to be filled in every time he drives the car. If the log book is lost... those 120 hours would need to be driven again. I cannot tell you how hard it is to find 120 hours of driving opportunities (and I heartily support the 120 hours scheme!). Even if you did feel inclined to create fake entries to recreate the book (which is illegal), frankly creating 120 hours of fake driving log entires would be very hard work. To make things worse... where I am storing this booklet? In the car of course (which is the most convenient place to store it). So what happens if the car is stolen? There goes the logbook.... So the plan I work on is that every time a page is filled up, I scan that page as an image stored on my laptop. The image goes into a folder that is automatically backed up to the cloud. Yes it does depend on my being diligent, but the actual process of copying the file somewhere else is automatic. Now I have 3 copies... the original, the scanned image on my laptop and a third (automatically created) copy way off in the cloud somewhere.
As for personal recommendations:
1) Get 2 GB free on Dropbox. This is a great point solution and a great way to dip your toes in.
2) Get 1GB free on Google Docs. This is a great tool to share files with others.
3) Try 15 days free on Carbonite. These guys look like good value for money.
Are there others? Yes there are... Mozy is one I have seen recommended. There is alsoAmazon S3. I am sure there are plenty more....
Have there been issues with storage cloud providers? A quick search reveals stories like: Flikr deleted a users data and Carbonite lost data due to hardware failure. Still... I have no plans to store my ONLY copy of data in the cloud. For me its a backup medium... not a primary storage location.
Are you convinced?
Are you already using the cloud?
Or are you thinking its too expensive or too insecure?
Better still, have you already been saved by the cloud?
Oh... and my son? He is on 89 driving hours... 31 to go....