<   Previous Post  Innovation, inventio...
Getting new stuff...  Next Post:   >

Comments (2)
  • Add a Comment
  • Edit
  • More Actions v
  • Quarantine this Entry

1 localhost commented Trackback

Hi Barry,Wow - that is quite a magnum opus, but well done! I have to confess that I'm still in the process of digesting it.... very thought provoking, though. Good set of criteria to examine storage virtualization - I personally might add "native performance" (which is a loaded term - I know!) as a possible cornerstone, but thats one we can debate at a later time.<div>&nbsp;</div> I was pondering cornerstone 3 and your discussion of it. <div>&nbsp;</div> First, the assertion that:"Such services can reside below a cache, but require a preparation step where the cache is flushed for those disks - thus ensuring you have a consistent image of what is about to be 'flashed'."<div>&nbsp;</div> I wasn't clear whether you were talking of the array itself or the virtualization engine. The assertion is true if the IO terminates at the virtualization engine cache from the viewpoint of the host operating system. Then the disk copy has to be made consistent by flushing cache before a copy operation. Invista does not terminate the IO - the virtualized array does, so whatever mechanism is used to make the array data consistent is the relavent one. This interesting facet of out-of-band split-path architectures makes it possible, for example, to do SRDF for volumes in Invista control. So here the array's brains don't quite get blown out!<div>&nbsp;</div> If its the array you were referring to, there are other methods to create a dependent write consistent copy of data, on a Symmetrix, (a "Timefinder consistent split") without flushing the Symmetrix cache, as the internal BCVs are full real-time mirrors when established, and not copies. This lets the microcode "hold" IOs before a split, ensuring the creation of dependent write consistent "crashed" image, which is a fully restartable copy of the data. <div>&nbsp;</div> To be fair, there is some background copy operation that occurs after the split, but the volumes are available and addressable right after the split. I thought Flashcopy had a similar mechanism, but could be mistaken, as I believe Flashcopy is implemented as a copy not a mirror. I know HDS has a mechanism to produce consistent copies as well on their arrays.<div>&nbsp;</div> Secondly:If I understand correctly, your point about "sparse" copy services is that as the virtualization engine provides a "sparse copy" (or a copy-on-first-write) image, it must update the metadata at all locations in the architectire to reflect the fact that the changed block has been moved, right? So a "snapshot" like with FLASHCOPY/NOCOPY or a Timefinder SNAP, presumably.<div>&nbsp;</div> So this is akin to the classic problem of cache coherency, true for any clustered system, or any SMP or distributed architecture. The usual solution is to ensure that the metadata updates happen faster than the data updates, and therefore fast interconnects, distributed lock managers, and other similar mechanisms. Even in an in-band appliance architecture, multiple cluster nodes need such metadata updates to happen faster than data updates.<div>&nbsp;</div> I believe your contention is that such inter-node metadata coherency is fairly mainstream and fast in an n-way cluster, but a lot harder and slower in an intelligent inline switch blade in a split-path virtualization engine.. I would submit that its not that apparent, at least to me, why you believe this is so. Here's my thinking.<div>&nbsp;</div> Say I have a, say a virtual luns A and want to make a "sparse" copy of that on a virtual lun B. So my virtualization engine will need to generate a memory structure pointing to B that mimics the source lun A's data through pointers. Until I actually change a block on A, all I need is the metadata map in memory, and the real storage I need to support B is zero, right?<div>&nbsp;</div> Now, I change the block with a write IO to it. In a split-path architecture, Invista for definiteness (I'm fairly sure Invista does NOT support "sparse" copies as of now, so this is hypothetical), the CPC holds the metadata, and copies are on the ASICs (DPCs) on the SAN fabric intelligent blades. So when the write is done to the volume on A, Invista would retrieve the block on A to be changed, write it out to a "save" pool of some sort, modify the memory structure defining B to reflect the pointer change to the save area, modify the block from A, and write out the modified block to A (maybe not necessarily in that strict order) <div>&nbsp;</div> The way I see it, thats multiple physical IO's to complete this operation - read from A, write of the unmodified block to the save area, and a write to the modified data to A. There is one metadata update - namely a pointer change; from the block on A to the new location of that block in the save area. <div>&nbsp;</div> Given three physical IOs with latencies in the order of disk speeds (millliseconds), I am hard pressed to believe that that the metadata update, which is at wire and memory speeds (microseconds at best) will be slower than the data update.<div>&nbsp;</div> So while one might argue that this happens at backplane or cluster interconnect speeds in an in-band appliance, I don't believe wire speeds (between CPC and DPC for Invista) will pose a limitation for metadata updates in a split-path scenario. This mechanism is no more complex that a standard metadata update, like new volume creation, from a metadata semantics perspective.<div>&nbsp;</div> Am I missing something?<div>&nbsp;</div> My pet peeve about in-band solutions (Storagezilla - I agree with you) is that they do blow out the brains of any array below. I think one should be able to invoke such copy services using array functionality (like SRDF or PPRC) and the ideal virtualization would be able to pick and choose the place where such functions should be invoked. Not sure any of these three classes of solutions can do that well or completely as of today...:^( <div>&nbsp;</div> Clearly much room for improvement for all.<div>&nbsp;</div> Would love to hear your take on this! I will continue noodle on your thoughts...<div>&nbsp;</div> Cheers, K.

2 localhost commented Permalink

Kartik,<div>&nbsp;</div> Thanks for taking the time to read an comment. <div>&nbsp;</div> The first point "Such services can reside below a cache, but require a preparation step where the cache is flushed for those disks - thus ensuring you have a consistent image of what is about to be 'flashed'."<div>&nbsp;</div> So I should have made it clear I was talking about a cache in virtualization device itself. As you state anything that terminates the I/O has to provide this facility, and in SVC's implementation there is a short "prepare" step where this is done. The virtual luns go 'write through' until the snap or Flash is actually triggered, from then on as you thought they are available and addressable.<div>&nbsp;</div> Thanks also for the clarification on how this is done on Symmetrix, always good to hear how other vendors implement their solutions.<div>&nbsp;</div> The second point regarding "sparse" copies. My main point is really the complexity of doing this in an n-way distributed split-path design due to what I understand are limitiations of the line cards. They do not have very much local buffer or meta-data storage capacity (unless I am mistaken)? But why would this matter you are probably thinking? So let me clarify.<div>&nbsp;</div> I guess my key reason for this, and missing from the example you provided, is how the device knows if it has to do more than just write the new data. Usually there is a bitmap in meta-data that says if a "grain" on lunA has already been split to lunB. This is independent of the virtualization meta-data. Lun's A and B exist in the virtualization map at the start. When a new write comes in for A, the old "grain" is split - copied to B while the new write is then merged into A. At the same time the bitmap is updated to say this grain has been split, thus any subsequent writes just go to A. It is the consistency of this bitmap that I was referring to, not updates to the virtualization map itself. One major advantage of a caching virtualization device is that the underlying read, merge, write operations are hidden and do not add additional latency to the source (lun A) I/O operations.<div>&nbsp;</div> So expanding on the bitmap updates, if you do plan to do a true n-way solution, every intelligent blade may need to have a copy of every bitmap. Usually this is a bitmap per lun. With large numbers of luns this requires a fairly large amount of bitmap space. Even if you have 'owner' nodes for the bitmaps (i.e. they only reside on one or two physical devices) you need to be able to synchronise between them and very quickly establish what needs to be done so as not affect source write I/O with any additional latency (unless you provide a cache).<div>&nbsp;</div> Your final point about in-band solutions blowing away underlying copy services. SVC does provide 'image mode' vdisks which maintain a one to one mapping between virtual and managed LBAs. When combined with the ability to disable the cache for any given vdisk you can still make use of your underlying copy services. It just means that you don't get some of the benefits of using virtualization in the first place. <div>&nbsp;</div> Barry