<   Previous Post  SVC - Business Probl...
Kicking up a 'flash'...  Next Post:   >

Comments (28)
  • Add a Comment
  • Edit
  • More Actions v
  • Quarantine this Entry

1 localhost commented Trackback

This comment will surely surprise you: <div>&nbsp;</div> CONGRATULATIONS!<div>&nbsp;</div> This is an extremely important acheivement of the sort I personally envisioned back when we started on our path to flash. And there will undoubtedly be several more milestones as solid-state persistent storage technologies make their way into commercial storage solutions. You and the Almaden team are to be commended.<div>&nbsp;</div> But this really isn't an arms race, or even an inter-vendor battle. Practical use cases must align the cost of the technology and with both the prformance and capacity requirements. I suspect there's lots more work for us all to do before we have just the right balance of RAM, NAND and spinning rust that will be required for mass-market appeal.<div>&nbsp;</div> One note about my presentation, though - "EFD" doesn't stand for EMC flash drive - the "E" is for "Enterprise." True, many of our field folk have taken to referring to the STEC drives EMC flash drives, but that's more because EMC is still the only place you can buy them in an array.<div>&nbsp;</div> Oh - and those mixed workload issues you not-so-subtly hinted at...there are ways to mitigate those with a little bit of old-fashioned innovation and some integration between the drive and the array microcode. You guys might not have figured that out yet.<div>&nbsp;</div> But honestly, congrats on the accomplishment, and thanks for joining in the efforts to make flash a commercial reality!

2 localhost commented Trackback

I thought that the SVC 4.x code was rated for something under 300,000 IO/s with like 8 nodes? How did you get past that?

3 localhost commented Permalink

BarryB,<div>&nbsp;</div> THANKS! Look out for more news soon. And thanks for the correction re EFD.<div>&nbsp;</div> OSSG,<div>&nbsp;</div> As with all things performance it is never that simple.<div>&nbsp;</div> The 8 node SPC-1 cluster did achieve just under 300K SPC-1 IOPs. However SPC-1 is closer to an 8K 40/60 workload.<div>&nbsp;</div> In my internal benchmarking a 70/30 4K all miss workload will achieve around 120K IOPs per node pair. This is with the cache enabled, so writes are being mirrored. If we run a pure read miss workload, then we get just over 200K IOPs per node pair. <div>&nbsp;</div> As the FusionIO cards give excellent response time we could disabled the cache in SVC for these tests, this reduces the work each node has to do, and the traffic on the fabric as there is no write mirroring in progress. This pushes the 70/30 4K number to almost that of a read miss workload - assuming the backend storage can cope - which in this case it could. <div>&nbsp;</div> The SVC cluster is running a subtly modified version of the SVC code which removes some of the configuration limits. As I've said before the cluster code designed for &gt;64 node clusters, but our official GA test and support statement is for up to 8 nodes...

4 localhost commented Trackback

Which begs the question, if SVC is essentially on complete passthrough mode, where's the value add? Wouldn't a company be able to just buy the same number of SSDs and attach them directly to hosts to get the same performance?

5 localhost commented Permalink

So this particular config had only SSD behind it, but thats not going to be a real life config for some years. You forget all the benefits that SVC brings, you can attach normal HDD based controllers too and can migrate hot / not hot data between HDD and SSD and back, you can FlashCopy, Mirror, Thin Provision etc and you can still use the cache for the HDD products, since we provide per vdisk control of caching. Not to mention the single point of management provisioning etc from Tiers0 through 3 depending on the needs of the application / host.

6 localhost commented Permalink

PS. One other thing we had to spend a lot of time tuning was the optimal data rate / queue depth to be maintained at the flash devices themselves. Given that you want to keep the flash busy enough to get the best out of the available channels, while not overloading it and causing potential issues with the algorithms performing garbage collection, wear leveling etc - this work has been done and the SVC backend queuing algorithms configured to sustain workloads within the optimal ranges (as we do for all storage controllers we support) - thus SVC is performing this work for you, and you don't need to tune each host in turn for the workload required - its handled by SVC.

7 localhost commented Permalink

Yeah, that's cool of course. But how a you going to replace the PCI flash card if it fails? And, by the way, you havent implemented any RAID on that cards?

8 localhost commented Permalink

True, the low level card does not support RAID, however SVC 4.3.0 introduced Virtual Disk Mirroring, so you can mirror across two flash controllers, which not only protects against controller failure, but also allows you to replace a card should it fail - while maintaining online access.

9 localhost commented Permalink

I have been to the DS5000 announcement today, a great future is ahead of us! The solid state disk story will now receive a boost since the first results are made...<div>&nbsp;</div> Can't wait to be able to test it myself !!!<div>&nbsp;</div> greetings<div>&nbsp;</div> ps: SVC is not only able to mirror the VDisks, you can also do RAID 0 with it. With these two functions you can do a sort of RAID 1+0 accross the SSD disks.

10 localhost commented Trackback

Barry, thanks for your comment correcting my error in saying this was an SPC-1 benchmark. My bad.

11 localhost commented Permalink

Congratulations, obviously this is what Mr Legg was hinting at when he dropped into see me recently when I was quizzing him over Flash Disks. However do you not feel that VDM for Flash controllers might be a bit overkill i.e any intention to support different RAID levels at an SVC level. Obviously at that point you've pretty much built a completely abstracted disk-controller and that begs a number of questions!

12 localhost commented Trackback

OK, not to pile on, but since the issue was raised by the OSSG:<div>&nbsp;</div> You've demonstrated that you can get over 1M IOPS with some number "N" greater than 8 specially-tuned SVC nodes operating in pass-through mode with very low latencies (albeit with no RAID protection for the flash devices, it seems).<div>&nbsp;</div> What is the impact on IOPS and latencies if you are using all the "value-add" features you say justifies using the SVC instead of JBOD: migrations, Flashcopy, Mirror, thin provisioning, RAID protection, etc.?<div>&nbsp;</div> Don't get me wrong - it is very interesting to know how fast you can go without any of the features turned on. The real question, however, is how fast you can go in a more realistic operating situation.

13 localhost commented Permalink

MartinG,<div>&nbsp;</div> Obviously I'm not at liberty to confirm or deny any future plans, thoughts, ideas, concepts or such on a public forum such as this, next time Steve is with you, ask for the roadmap details.<div>&nbsp;</div> BarryB,<div>&nbsp;</div> So maybe I wasn't clear about the 'tuning' side - we do the same 'tuning' for every storage controller - including DMX, CX, DS4K, DS8K etc - to ensure we get the best out of the storage controller. So from that point of view we did nothing new. <div>&nbsp;</div> Passthrough is a bit of a strong term, so the cache was disabled and is working in write through mode, but all the code stack is still there and I/O is processed through the system as per normal, striping, virtualizing etc. This is SVC code as installed at any customer today.<div>&nbsp;</div> Advanced functions will depend on what the source and target storage are, so a 700MB/s capable fusion IO source can obviously read a lot quicker than a 100MB/s capable HDD target. Whereas a migrate from flash to flash would be able to sustain much higher rates. As with any additional workload (as is generated by advanced functions) the backend has to be able to sustain the combined throughput - application I/O and function I/O - so I would expect there to be an increase in response time, and drop in top end throughput unless you added more backend capability. (I'm sure this is the case with your EFD's in a DMX too). As the SVC node hardware used was not running even close to saturation point, there is plenty of MIPs left to ensure that 1M at similar response times would still be possible - given adequate backend flash capability. <div>&nbsp;</div> PS There are a few nice side effects of flash when using SEV (as I'm sure you are aware) especially for a fine-grained solution such as ours, which can seriously reduce any performance impacts when using SEV - even when the vdisks are provisioned from traditional HDD.

14 localhost commented Trackback

Thanks for the explanation. But I'm still thinking that when you're running FlashCopy, Mirroring or mirroring with cache enabled, there must be additional memory-memory copies and redirections that the CPU's have to handle than they do in "pass-through" mode. Thus, it's not enough to add more flash drives; the load on the processors is increased by the "features", right?<div>&nbsp;</div> Am I misunderstanding this?<div>&nbsp;</div> And yes indeed, the real beauty of flash drives is that they can support a much higher access density than can HDD. Thin Provisioning can leverage this to deliver capacity efficiency without sacrificing performance (a true challenge on hard drives). And with flash it's no longer necessary to stripe database index tables across dozens of drives...it truly makes performance tuning a whole new ball game.

15 localhost commented Permalink

So by doing cache mirroring between nodes, there is no extra memory accesses on a given node, its simply the same buffer that is submitted onto the fabric (we can have multiple references to the same memory block) There is obviously additional code to run when doing FC or Mirroring, but when you enable the cache, this hides the additional latency of copy on write operations or doing 2 writes to the backend in a mirroring case.<div>&nbsp;</div> So yes, there is a longer code path through the node when you run advanced function, and yes that requires more CPU processing, but as I stated above we still had plenty on MIPs free for such processing, and doing the 4K style I/O we won't hit any bandwidth limits internally. Remember one of the key benefits of using fast moving Intel planar technology is that we can ride the technology curve. So 1.33GHz FSB, DDR2, PCIe etc etc - so these SMP multi-cored boxes maybe "thin" in the sense of 1U, but bandwidth within a single node is not an issue.

16 localhost commented Trackback

Gentlemen,<div>&nbsp;</div> This has been a heady and informative debate. Keep it up!<div>&nbsp;</div> I filed a short precis of it at<div>&nbsp;</div> http://www.eetimes.com/news/latest/showArticle.jhtml;?articleID=210300295<div>&nbsp;</div> Rick

17 localhost commented Permalink

Thx Rick.<div>&nbsp;</div> FusionIO have today release a press release covering the work we have been doing together :<div>&nbsp;</div> http://www.fusionio.com/PDFs/Pressrelease_IBM_Fusion.pdf

18 localhost commented Trackback

Yes, it was the Fusion-IO release that turned my attention to this page.<div>&nbsp;</div> One question: You say you prefer a custom approach to SSDs on servers. Did you make any modifications to the Fusion-IO cards beyond the use of your virtualization software?<div>&nbsp;</div>

19 localhost commented Permalink

Sorry for the delay, looks like the latest upgrades to the blog software has resulted in it not working with Firefox... (being investigated)<div>&nbsp;</div> So I had to run (expletive deleted) to add this...<div>&nbsp;</div> The FusionIO ioDrive is unmodified. <div>&nbsp;</div> There is more debate over on BarryB's blog, and some clarification of the 'points' he's making.

20 localhost commented Permalink

Barry,<div>&nbsp;</div> Very interesting results.<div>&nbsp;</div> Could you clarify how many Fusion IO drive cards were used for this test.<div>&nbsp;</div> Also what was the useable capacity of the "Virtual Disks" that the 1 Million IOPS was operated over?<div>&nbsp;</div> <div>&nbsp;</div> <div>&nbsp;</div>

21 localhost commented Permalink

Hi FGordon,<div>&nbsp;</div> During the benchmarking we had 41 ioDrives running behind the SVC cluster. The virtual disks were created using the full available capacity of the formatted ioDrives. There is no benefit to short-stroking flash drives as there is with HDD - because there is no corresponding seek time. <div>&nbsp;</div> Barry

22 localhost commented Permalink

Thanks Barry,<div>&nbsp;</div> Most performance Flash drives are formatted to use only a percentage (e.g. 60%) of the raw Flash capacity. This increases the spare Flash available for erasing and garbage collection. This is not "short stroking", but more "free" flash does improve performance.<div>&nbsp;</div> Perhaps the better question is: what were the Fusion Io drives formatted to support... how many GBytes each?<div>&nbsp;</div> <div>&nbsp;</div> <div>&nbsp;</div> <div>&nbsp;</div> <div>&nbsp;</div>

23 localhost commented Permalink

Understood.<div>&nbsp;</div> These were 160GB cards, which we formatted at 100GB usable. Fusion market their cards (today) at full capacity values, the default format of a 160GB card is ~132GB. As you state the more 'free' capacity the better the performance. We used a performance vs capacity trade off to settle on the 100GB value. If less performance is needed, more capacity can be provisioned, response time remains the same.

24 localhost commented Permalink

So why are you working with FusionIO when the Violin 1010 can demonstrate 1 Million IOPS in 2 U unit plus a server (1, 2, or 4 U)? <div>&nbsp;</div> I saw this at Linuxworld and these guys seem to be on top.

25 localhost commented Permalink

Pete,<div>&nbsp;</div> Is that a DRAM or Flash based version? And as far as I know that number is READS ONLY, and I think from what I've seen using 512b blocks. Which is itself impressive, however, its only really fair to compare apples with apples. So a similar 70/30 workload at 4K.<div>&nbsp;</div> If we had only reported reads at 512b we would have just about tipped the 5 Million IOPs mark.<div>&nbsp;</div> I agree though, the Violin in an interesting box, especially if they can take it beyond PCIe attach.

26 localhost commented Permalink

Barry,<div>&nbsp;</div> A single Linux host can get 1 Million IOPS with an arbitrary read/write mix from Violin’s Memory Appliance populated with our DRAM VIMMs. To achieve the maximum IOPS, we use 1KB access sizes. <div>&nbsp;</div> For a 70/30 workload at 4KByte, a Memory Appliance delivers over 400K IOPS.<div>&nbsp;</div> -Donpaul Stephens

27 localhost commented Permalink

Hi Donpaul,<div>&nbsp;</div> Interesting, how big was the linux box? Does Violin present itself as a block device to the OS?<div>&nbsp;</div> I have got almost 1 Million IOPs from a small Linux box myself, until I realised I'd forgotten to use the O_DIRECT flag when opening the device. So it is possible from memory, but its interesting you can get the same over the PCI bus.<div>&nbsp;</div> Thanks for the info.<div>&nbsp;</div> Barry

28 localhost commented Permalink

Hi Barry, Thanks for posting this on a site so people can ask questions and people like me can clarify potential misconceptions about our products. The Violin 1010 can present itself to the OS as a Block, character, or SCSI device via our open source driver. <div>&nbsp;</div> We get between 700,000 and 1,250,000 IOPS on single &amp; dual socket systems (1U &amp; 2U servers). Applications (multi- or single-threaded) can access the device via the block driver by way of the buffer cache and/or directly via the O_DIRECT flag. <div>&nbsp;</div> For 4Kbyte accesses, you can connect 2 Violin 1010 to a server and you’ll get 700,000+ IOPS to 800GB of RAID-protected memory. That is a more typical configuration, and is the cat’s meow for analytics.<div>&nbsp;</div> Given your work in the Flash arena, you'll probably be more interested in the Flash capability in the same 2U platform, which uses the same driver as our DRAM system. If you'll be at SC08 in Austin next month, you'll see a single 2U system with 4TB of Flash capacity. My guess is you could build a QuickSilver system with 40 TBytes useable capacity with RAID protection, much lower cost per TB, lower response times and a lot more IOPS in a similar rack space to the configuration you described. Flash does change the game for performance storage. Feel free to contact me directly and I'll stop using your site for free advertising :-) Cheers, Donpaul C. StephensPresident &amp; Founder, Violin Memory, Inc.<div>&nbsp;</div>