Sooner or later, data has to land on a physical device. And, with great oversimplification, there are two schools of thought about how best to do this.
One school of thought is that data placement matters, and there should be the option for precise control of where data lands. And there's another school of thought that it really shouldn't matter.
Both approaches have their pros and cons. But as storage technologies transition to much faster (and slower) devices, I think there's going to be some interesting vendor positioning exercises going on ...
So, What's This All About?
How you land your data on a physical device can really impact storage performance.
Write it and read it sequentially -- fast. Write it and read it randomly -- not so fast.
Put data on outside tracks -- fast. Put data on inside tracks -- not so fast.
Stay within a device's IOPs envelope -- fast. Exceed its IOPs capability -- not so fast.
And so on. It can get pretty complicated.
Historically, enterprise storage users -- in the interest of optimizing performance -- wanted to have very precise control of how they carved up their storage, and a very thorough understanding of what kind of data would land on which devices, and which portions of the device.
More performance could be achieved by harnessing multiple spindles to appear as one -- more IOPs, more bandwidth -- but there were tradeoffs everywhere.
But large arrays support multiple workloads. Some need performance, others just need capacity. So working the tradeoffs between the two can be a fine art at times.
And you'll see this "ability to control things when you need to" thinking evident in EMC array products, such as DMX and CX.
Not Everyone Needs This Level Of Control
Other vendors looked at the problem differently. Storage should just be a pool, they thought. We'll have our array software randomize any data between all the available spindles. It'll be easier to manage, and we'll get great performance from our array.
And you'll see this thinking embedded in products from NetApp, and the newer offerings from smaller vendors such as EquaLogic, XIV and Compellent, to name a few. And
I'd agree that -- yes -- some aspects of storage are easier to manage when everything is one, giant, spindle-randomized pool. But you do give up a few things in the process.
First, it's devilishly hard to get performance optimization and isolation in these environments. If you're supporting multiple different uses of a single array, you'd like to have a portion that's high performance, maybe another region that's medium performance, and perhaps a low-performance, high-capacity region.
You'd like to be able to partition and isolate spindles, as well as cache, processor and other I/O hardware. And if everything is designed as one ginormous pool, you'll have to get very clever of how you set things up.
Or end up buying multiple arrays for different workloads, which is what seems to happen more often. That can get expensive.
Second, there's a certain deceptive nature to the inital performance seen on these "spindle randomizing" arrays. Imagine one of these arrays with, say, 96 drives, and you put a single, small application on it -- maybe a benchmark test?
And, don't you know, you've got all those spindles behind your workload, and -- it flies! It's amazing!
Now, have some fun. Load it up to capacity. Throw multiple workloads at it. A very different performance picture emerges. I can't count how many times I've seen vendors pull this trick -- they'll configure the "other guy's" array with a small number of spindles to reach a given capacity, and use all of theirs.
Unfortunately, EMC is usually the "other guy" ...
Aggregate performance of multiple competing workloads in a full array is relatively hard to test. But it's also the real world we live in.
We've Always Believed That Customers Should Have Choices
Want to configure your array as one big, spindle-randomizing pool? Sure, we can do that. We don't think it's usually a good idea, but you can if you want.
Want to establish different regions with different performance characteristics, and keep them from interfering with each other? We can do that too ... and, in the real world, this is the dominant use case by far.
Yes, it takes some extra work. And some understanding of your different workloads. And some basic understanding of the tradeoffs involved. We can make it easier, but no one has created the magic array -- yet!
Newer Storage Technology Makes This Even More Interesting ...
So, you're probably aware that there's some new options out there -- higher highs, lower lows.
In terms of uber-performance, there's enterprise flash drives. Very fast. Also very expensive, but the price is coming down very fast. That being said, it'll proabably never be as cheap as the cheapest disk.
So, let's say that you've got an application or two that could really use a 30x or so IOPs boost. You'd like to buy the smallest amount of the expensive stuff to get the performance boost you're looking for. That means you're going to have to isolate an application (or a portion of an application) to very specific storage devices.
Good luck doing that with NetApp and WAFL. So much so, that they've decided to use flash as a sort of aggregate cache accelerator at the array level -- with much less compelling results, I'd argue. Ditto for other storage arrays based on the same design principle.
If you're interested, you might appreciate this well written and researched article on the subject.
The same effect plays out on the other end of the spectrum -- very cost-effective storage. 1TB drives are commonplace -- we'll see higher capacities before too long -- and we're seeing more use cases for data deduplication and other forms of capacity reduction. Not to mention drive spin-down.
All of these don't make the drives any faster ... so you're going to want to isolate workloads that can live comfortably with these sorts of lower levels of performance.
As a matter of fact, if you think out a few years, it's not hard to imagine only two types of storage media:
-- enterprise-class flash drives for the stuff that needs to go fast, and
-- multi-TB drives that are deduped and spun down for everything else
So, Here's The Argument
Sure, these "spindle-randomizing" arrays are an interesting variation on a theme. And I'm sure that there are places where they're useful.
But if your goal is to support multiple service levels at multiple cost points, you're probably going to end up having to use multiples of these arrays, each targeted at a different performance/cost tradeoff point.
And that can get expensive: acqusition costs, management costs, effort associated with moving things back and forth.
No, I think that the marketplace will want storage arrays that can handle multiple types of workloads -- simultaneously -- each with a different performance/cost tradeoff. And do so in such a way that workloads don't step on each other, it's easy to move things back and forth, and so on.
I can make a reasonable case that this is largely the case today.
But -- fast forward just a short bit of time -- and I think it'll be even more obvious.