To say I started a bit of a ruckus is an understatement.
48 hours later, I thought I'd share with you what we've learned through this exercise.
What Happened First
I published a blog post showing differences in usable capacity between three popular midtier vendors. Not surprisingly, the results showed that the EMC CX4 was more capacity efficiency than HP's EVA or NetApp's FAS.
We designed the configs around 120 usable disks, and set them up for multiple instances of a demanding application, e.g. Exchange. We did the best we could with published documentation. We made no effort to game anything whatsoever.
And we offered to set the record straight if we made a mistake.
What Happened Next
A lot of blog hits and a lot of comments is what happened next, including a pick-up from The Register as well as Blocks and Files.
Lots of angry comments from employees from vendors. A few users chiming in. And, somewhere in the noise, a few useful nuggets came out of the discussion, which I want to share below.
BTW, if you ever feel like interacting with someone on a blog or forum, might I suggest an effort to be somewhat polite and courteous? It works well in the real world, and it makes sense online as well.
The NetApp FAS
The "capacity efficiency variable" here is snap reserves. Sure, there's overhead for RAID 6 DP versus RAID 5 (an academic argument to be sure), and filesystems, and whatnot, but the big glaring standout is snap reserves.
To this day, almost all published documentation recommends a 100% snap reserve for demanding block-oriented applications. In addition, the system defaults to this, so that's what most people end up running.
This was confirmed by several users (go read the comments), was not refuted by NetApp (yet!), which means that -- so far -- the results largely stand.
The HP EVA
A different scenario played out when discussing the EVA.
Here, the "capacity efficiency variable" is the use of disk groups. Each disk group is almost like a virtual storage array -- it shares common sparing, for example. EVA software combines all members of a disk group into a single, large pool, from which you carve virtual disks for application use.
The more disk groups you use, the more overhead gets used for hot spare-style protection. Obviously, HP customers have a vested interest in using as few disk groups as possible in the interests of space efficiency.
HP stated that -- for our exercise -- 1 or 2 disk groups should have been used, and not 7. If they are correct as stated, our results are wrong, and we need to go recalc a bit.
But I'm not 100% satisfied, as follows.
Certain HP documentation clearly states that in certain situations, customers will want to create multiple disk groups for performance reasons. For example, a database in one disk group, a transaction log in another disk group. Or putting sequentially accessed data in one disk group, and randomly accessed data in another volume group.
The concept is sometimes called "performance isolation", e.g. minimizing contention between demanding applications. On a traditional (e.g. non-virtualized) array, this is pretty easy to do: simply carve up LUN groups and hand them out for different purporses. No one will step on anyone else's spindles.
But this is a bit harder with the EVA -- everything is spread around. Carve up virtual disks and hand them out, and their underlying sectors are using all the spindles. This sort of approach offers great performance (all the spindles are put to use), but presumes that you've got enough spindle I/O to go around.
No ability to isolate, say, one application being backed up B2D, from another interactive application means that you'll have the potential of applications stepping on each other from time to time
If you don't for some reason, you'll experience performance contention -- the actions of one application will directly affect performance on other applications. If this happens on an EVA, you have a few choices: (1) buy more spindles (disks), (2) buy another array, or (3) create additional disk groups to isolate applications from each other.
So, if you plan to load up your EVA with several performance-intensive applications, and you don't want them stepping on each other, there's a case that can be made (unofficially confirmed) that you'll want more than the 1 or 2 disk groups that HP is offering up. And of course, that means more overhead to support each disk group.
But there's more.
HP documentation also points out the desirability to have multiple disk groups for availability isolation. Since all applications use all disks in a disk group, a failed disk that can't be recovered puts a neat hole in *all* your applications, and not just one.
Let me say that again.
If you are running a single disk group (as recommended occasionally by HP), and a disk fails that can't be recovered with hot sparing, etc. -- you run the risk of *every* application, file system, etc. having a problem and needing to be recovered.
I don't know about you, but I'd be strongly motivated to use as significant number of disk groups to protect myself from that scenario. I wouldn't want an unrecoverable disk failure in, say, my file system to take down Exchange, Oracle, SAP, SQLserver, etc. etc.
And neither would you, I think.
All of the sudden, our recommendation for 7 disk groups doesn't look so bad as it first did.
So, I'm going to have to rephrase the discussion with our friends at HP as follows:
"What are the recommended number of disk groups for an EVA with 120 usable disks where the customer has 6 or 7 demanding applications, and desires a significant degree of performance isolation and availability isolation?".
I bet the answer isn't one or two disk groups ...
There were those who thought that there was no way a vendor should be making one of these comparisons, that it should be done by an independent party. I'd agree, but the problem is coming up with an independent party these days. Any suggestions?
There were those who argued that the protection levels should have been different on the configs. We went with what each vendor recommended, or tried to. You're free to debate the merits of RAID 6 versus RAID 5 with proactive global hot spares separately ...
And there were those that thought we should have benchmarked each of these configs to ensure we would get equivalent performance from each. Sorry, I don't think that's possible.
But several people thought it was a useful -- though flawed -- discussion. And that's all we really wanted.
Well, based on what we've seen so far, it's unlikely that NetApp will change their recommendations (or their defaults) for FAS in these environments.
They're wedded to RAID DP (fine) and are forced to support 100% snap reserves, since running out of snap reserve means a vicious application crash. Oh, I'm sure we'll see some posturing from them, but nothing substantive to correct our work.
HP's a different case. Their design works well for one or two applications, but doesn't look so good when one starts loading up multiple applications that demand a degree of isolation.
And, since that's a subjective discussion, we probably need to go back and rework the numbers for EVA showing 3 or 4 disk groups, and not 7.
Thanks to all who chimed in ...