Sooner or later, data has to land on a physical device. And, with great oversimplification, there are two schools of thought about how best to do this.
One school of thought is that data placement matters, and there should be the option for precise control of where data lands. And there's another school of thought that it really shouldn't matter.
Both approaches have their pros and cons. But as storage technologies transition to much faster (and slower) devices, I think there's going to be some interesting vendor positioning exercises going on ...
So, What's This All About?
How you land your data on a physical device can really impact storage performance.
Write it and read it sequentially -- fast. Write it and read it randomly -- not so fast.
Put data on outside tracks -- fast. Put data on inside tracks -- not so fast.
Stay within a device's IOPs envelope -- fast. Exceed its IOPs capability -- not so fast.
And so on. It can get pretty complicated.
Historically, enterprise storage users -- in the interest of optimizing performance -- wanted to have very precise control of how they carved up their storage, and a very thorough understanding of what kind of data would land on which devices, and which portions of the device.
More performance could be achieved by harnessing multiple spindles to appear as one -- more IOPs, more bandwidth -- but there were tradeoffs everywhere.
But large arrays support multiple workloads. Some need performance, others just need capacity. So working the tradeoffs between the two can be a fine art at times.
And you'll see this "ability to control things when you need to" thinking evident in EMC array products, such as DMX and CX.
Not Everyone Needs This Level Of Control
Other vendors looked at the problem differently. Storage should just be a pool, they thought. We'll have our array software randomize any data between all the available spindles. It'll be easier to manage, and we'll get great performance from our array.
And you'll see this thinking embedded in products from NetApp, and the newer offerings from smaller vendors such as EquaLogic, XIV and Compellent, to name a few. And
I'd agree that -- yes -- some aspects of storage are easier to manage when everything is one, giant, spindle-randomized pool. But you do give up a few things in the process.
First, it's devilishly hard to get performance optimization and isolation in these environments. If you're supporting multiple different uses of a single array, you'd like to have a portion that's high performance, maybe another region that's medium performance, and perhaps a low-performance, high-capacity region.
You'd like to be able to partition and isolate spindles, as well as cache, processor and other I/O hardware. And if everything is designed as one ginormous pool, you'll have to get very clever of how you set things up.
Or end up buying multiple arrays for different workloads, which is what seems to happen more often. That can get expensive.
Second, there's a certain deceptive nature to the inital performance seen on these "spindle randomizing" arrays. Imagine one of these arrays with, say, 96 drives, and you put a single, small application on it -- maybe a benchmark test?
And, don't you know, you've got all those spindles behind your workload, and -- it flies! It's amazing!
Now, have some fun. Load it up to capacity. Throw multiple workloads at it. A very different performance picture emerges. I can't count how many times I've seen vendors pull this trick -- they'll configure the "other guy's" array with a small number of spindles to reach a given capacity, and use all of theirs.
Unfortunately, EMC is usually the "other guy" ...
Aggregate performance of multiple competing workloads in a full array is relatively hard to test. But it's also the real world we live in.
We've Always Believed That Customers Should Have Choices
Want to configure your array as one big, spindle-randomizing pool? Sure, we can do that. We don't think it's usually a good idea, but you can if you want.
Want to establish different regions with different performance characteristics, and keep them from interfering with each other? We can do that too ... and, in the real world, this is the dominant use case by far.
Yes, it takes some extra work. And some understanding of your different workloads. And some basic understanding of the tradeoffs involved. We can make it easier, but no one has created the magic array -- yet!
Newer Storage Technology Makes This Even More Interesting ...
So, you're probably aware that there's some new options out there -- higher highs, lower lows.
In terms of uber-performance, there's enterprise flash drives. Very fast. Also very expensive, but the price is coming down very fast. That being said, it'll proabably never be as cheap as the cheapest disk.
So, let's say that you've got an application or two that could really use a 30x or so IOPs boost. You'd like to buy the smallest amount of the expensive stuff to get the performance boost you're looking for. That means you're going to have to isolate an application (or a portion of an application) to very specific storage devices.
Good luck doing that with NetApp and WAFL. So much so, that they've decided to use flash as a sort of aggregate cache accelerator at the array level -- with much less compelling results, I'd argue. Ditto for other storage arrays based on the same design principle.
If you're interested, you might appreciate this well written and researched article on the subject.
The same effect plays out on the other end of the spectrum -- very cost-effective storage. 1TB drives are commonplace -- we'll see higher capacities before too long -- and we're seeing more use cases for data deduplication and other forms of capacity reduction. Not to mention drive spin-down.
All of these don't make the drives any faster ... so you're going to want to isolate workloads that can live comfortably with these sorts of lower levels of performance.
As a matter of fact, if you think out a few years, it's not hard to imagine only two types of storage media:
-- enterprise-class flash drives for the stuff that needs to go fast, and
-- multi-TB drives that are deduped and spun down for everything else
So, Here's The Argument
Sure, these "spindle-randomizing" arrays are an interesting variation on a theme. And I'm sure that there are places where they're useful.
But if your goal is to support multiple service levels at multiple cost points, you're probably going to end up having to use multiples of these arrays, each targeted at a different performance/cost tradeoff point.
And that can get expensive: acqusition costs, management costs, effort associated with moving things back and forth.
No, I think that the marketplace will want storage arrays that can handle multiple types of workloads -- simultaneously -- each with a different performance/cost tradeoff. And do so in such a way that workloads don't step on each other, it's easy to move things back and forth, and so on.
I can make a reasonable case that this is largely the case today.
But -- fast forward just a short bit of time -- and I think it'll be even more obvious.
Been having a few conversations recently about this very subject, the impact of larger disks (with less IOPs available per terabyte) and the impact of fast flash disks and what this really means going forward. I am coming to the conclusion that we need two tiers of disk, I call them 'Too Good Perfomance' and 'Just Good Enough Performance'. You might want to consider them the performance tier and the capacity tier.
A high degree of intelligence and autononmic behaviour needs to be built into the 'array' (an array being a single array or perhaps a number of virtualised arrays) with chunks of data being moved around based on I/O characteristics. This might be on a Lun level or a sub-lun level; perhaps it is almost time for us to stop talking about LUNs? You might even decide to move data on observed behaviour; you know that a certain piece of data becomes hot between 2 and 3 in the morning; so you move it then and move it back once its hot period is over.
Not asking for too much am I?
Posted by: Martin G | July 17, 2008 at 05:02 AM
No, you're not asking for too much ... ;-)
Posted by: Chuck Hollis | July 17, 2008 at 10:06 AM
Hi!
Chuck, I think your argument is flawed because it's based on four flawed assumptions:
1. Humans can always do better than software in a dynamic environment.
2. Storage virtualization can not perform.
3. Orthodox architectures are the only right answer.
4. Automatic and Optimized Data Placement only adds value as a performance optimization.
You can check out a longer version of my rebuttal here:
http://blogs.netapp.com/extensible_netapp/2008/07/chuck-almost-ge.html
Posted by: kostadis roussos | July 19, 2008 at 08:51 PM
Hi K --
I saw your rebuttal, and I think you're still missing the mark. I see the hand waving and protestations, I don't see the facts.
Tell me, exactly, how a NetApp shared device would automagically detect a high I/O stream, such as an Oracle transaction log, and direct it to the appropriate device?
With the current state of storage technology, don't we need a human being to understand that situation ahead of time?
Sure, I'll give you and your architectural brethren the benefit of the doubt on data that you create, e.g. snaps et. al. But that's just part of the picture, isn't it?
Now, play along with me for just a moment.
Let's assume that EFD is very, very expensive. And that its benefit is very very great.
Wouldn't you want a mechanism to diret the storage traffic that could benefit from this performance towards EFD, and traffic that doesn't need it away from it?
Let's face it -- you guys can't do this simple, yet fundamental trick at a LUN or other object level, can you?
Live by your architectural decisions, die by your architectural decisions, I'd offer.
Yes, you've got a hunk of flash sitting on your PCI bus. And you've waved your hands and mumbled some gobbledygook about caching metadata, or something like that. And one of your team has claimed Exchange might be 40% faster, but wasn't really sure.
Now, compare that with a PHYSICAL implementation where we know EXACTLY where the data is (on a flash drive) and we can be REALLY PRECISE as to the exact performance differential.
And, if one wants to do this extreme level of performance with VMware's VMFS (or other layer of virtualization, such as Invisita), it's pretty straightforward to do this sort of alignment at either a device or pool level.
I just can't wait to see how NetApp and the rest of the spindle-randomizers are going to get out of this pickle.
And a strident blog post won't do it, folks.
Posted by: Chuck Hollis | July 20, 2008 at 12:21 PM
Hi!
Chuck, I think, you once again don't get it.
But at least I understand more of how little you understand of our system design.
Our systems do provide a way for a storage administrator to assign data to physical disks. What we do not require the administrator to do is map the physical disk block to a logical container.
So in your example, using a single FAS system you can choose to put your Oracle log files on an aggregate containing FC disks and your data files on an aggregate containing SATA disks.
How you can take what is a memory expansion unit (http://media.netapp.com/documents/netapp-performance-acceleration-module-datasheet.pdf)
and use that as proof that we do not provide a mechanism for storage architects to assign applications to classes of storage is a mystery to me.
kostadis
Posted by: kostadis roussos | July 20, 2008 at 05:29 PM
So, why aren't you guys supporting real enterprise flash drives?
Keep in mind, the "memory expansion" you reference was positioned as "NetApp Supports Flash Memory" in the press.
And how might you explain the NetApp reference in the article I referenced in the initial post?
Aggregates are a compromise, I'd offer.
You still don't get fine-grained control of what gets written where, do you?
And we've all seen the differences in performance between a partially empty aggregate and a full one.
Unfortunately, with a NetApp device, you just can't turn this off, can you? There's no way to expose native LUNs to applications -- period. There's always some software in the way, even if you don't want it.
And as more people will want greater control of their storage performance characteristics over time -- accelerated by interest in enterprise flash -- I'd consider this an architectural disadvantage.
Good luck trying to convince knowledgeable storage administrators that it's no big deal.
Posted by: Chuck Hollis | July 20, 2008 at 07:21 PM
Well, Jay Kidd finally came clean for NetApp --
http://blogs.netapp.com/jay/2008/07/flash-forward.html
Ever the skeptical one, I'll believe it when I see it!
Posted by: Chuck Hollis | July 22, 2008 at 03:56 PM
Chuck, it seems to me like you are missing what customers are looking for on a few points.
I don't think Storage Admins are looking for "more control" over their storage arrays. They are already overwhelmed with their pages of spreadsheets detailing which applications own which hyper. They also seem to be tired of fighting the tug-of-war between spreading applications across hypers from multiple raidgroups to get better performance and filling in those same raidgroups with other applications in order to keep utilization high. In large SAN environments you end up with poor performance, low utilization and a headache. Instead they seem to like the concept of pooling array resources, setting LUN/Volume priorities based on their application needs and letting the array manage the service levels so they can focus on their business not managing technology.
I also don't think your customers are looking at ways to buy the newest, fastest technology. Not in this economy. Those days are over. Instead of EFD, they are trying to buy more of the larger, cheaper, slower disks, then pool them and set priorities to get the performance they need without having to buy small isolated expensive EFDs.
I also think you may be a bit misguided on your comment on VMWare. The single disk I/O queue on VMFS would bottleneck long before any disk contention on todays disk drives. NFS eliminates this single disk I/O queue and allows for greater throughput to the disk drives, but is sadly not natively available on those non-intelligent layout arrays. Once you eliminate the single disk I/O queue and start using EFDs, the only way you can concentrate enough I/Os to require the performance of EFDs is through some data concentration mechanism (like de-duplication), which also is not natively available on those non-intelligent layout arrays.
It seems like the NetApp strategy to have EFDs when the market is demanding them and already have the virtualization in place to make use of them when the time comes is better than the strategy of trying to generate market demand with massive $$$ investments and vague solutions that don't really require EFD technology when given any scrutiny.
Posted by: Mike | July 22, 2008 at 11:52 PM
Masterful misdirection! If you're working for NetApp's marketing department, keep up the good work -- very well written and put together.
I'd invite others to read Mike's "arguments" carefully, and see where the logic doesn't hold. I saw about a half dozen places -- what can you find?
Posted by: Chuck Hollis | July 25, 2008 at 01:16 AM
Chuck,
you shouldn't post your rants about a competitior if you clearly don't know even the most fundamental things about their systems. Using a single NetApp controller for both, cheap ATA storage and high-end FC storage is normal for most NetApp customers. And of course they can decide where they want to put their data. Yes, they are working with 'virtualized' storage pools (called aggregates), but they have more than one. And in the future there will just be one more pool for the flash drives. What's the big deal? I don't need to decide on which SSD I want to put my data, all I tell the system is that I want it on my SSD storage pool. No idea why you think it would be hard to do that on a NetApp system, it is designed for these kind of things.
Regards
Arne
Posted by: Arne Rohlfs | July 27, 2008 at 11:01 AM
I don't think you get my point on what it usually takes for fine-grained performance optimization.
Maybe you've never had to do this with a storage array. Lucky you.
Most storage architects would agree that any form of abstraction (including file systems) defeats the ability to precisely place data when you need to. Vendors tirelessly argue that their method is "better", but there's usually no way to turn it off if you need to.
The fact that storage objects are pooled, rather than subdivided, can be problematic when trying to parcel out small bits of high performance storage (e.g. flash) to the precise places where it's needed.
Must be nice to not have this problem in your environment.
Posted by: Chuck Hollis | July 27, 2008 at 04:27 PM
Just back from leave and as an end-user i.e. customer; we are having to deal with more and more storage with the same amount of staff or less, we have 250Tb per FTE currently managed (2.5 petabytes of storage currently), grew last year at 112%. I don't have time for a lot of fine-grained data-placement; I don't have time for moving data between tiers manually but I also don't have the budget to put everything on Tier 2 (let alone Tier 1). I need vendors to come up with better automagic optimizers which migrate hot spots in the background, I would like to move away from LUNs and simply allocate capacity (but NFS/CIFS is not the answer for this) and let the array/virtualisation controller handle this. No vendor has this at the moment but the one has gets it, gets my attention.
Posted by: Martin G | August 06, 2008 at 04:46 AM
Amen!!!
So, quick question, are we talking mostly file data here? If so, there's a half-decent approach with file virtualization to some of this.
But -- on a broader note -- you're right. There ought to be a pool of different tiers, and the storage figures out where things need to go at different times.
Posted by: Chuck Hollis | August 06, 2008 at 07:36 AM
Chuck, I just saw this blog and enjoyed reading this blog. But I’ve got to join others in saying you’re wrong here. Storage virtualization has handled multiple drive speeds well for years. When SSD’s really take off it’ll handle those fine too. (http://www.communities.hp.com/online/blogs/datastorage/archive/2008/08/07/data-placement-who-s-architecture-is-really-broken.aspx)
But the real issue is that virtualization in the arrays is going to be a must to keep complexity under control. With it storage professionals have the foundation to avoid mundane tasks and focus on adding real value. Without it you hold contests to try get everyday tasks done in a reasonable time. You even award cars to somebody who can do it fast (EMC World Contest Link). But with the EVA everybody can do it fast! (EVA4400 Video Link). And that’s going to be a growing need for years.
Always happy to continue the conversation.
Craig
Posted by: Craig Simpson | August 07, 2008 at 06:26 PM
OK, it's now official: 100% of EMC's storage competitors appear not to understand the underlying issue here, or want to pretend it doesn't exist.
That's fine with me.
As HP and other vendors start to attempt to put these devices into their existing products, they'll find out the hard way.
BTW, your comment does not strike me as a "conversation".
Posted by: Chuck Hollis | August 08, 2008 at 08:48 AM
Hi Chuck,
I think you should provide specific details of some configurations where your fine grained control provides significant benefits. My feeling is such instances are a vanishingly small percentage of the total, and in general the overhead management required not only wastes a lot of time, but actually gets in the way of performance: many admins will not have the technical ability to truly improve performance with these 'features' (but they will think they do), and even if they did, as requirements change, many initial configs will not get modified in a timely fashion or at all to reflect changing requirements.
I get the impression you are grasping at straws trying to defend a storage architecture which is antiquated. Others have already pointed out that different service levels are possible for virtualized arrays (EVA and perhaps netapp), which is the area where you appear to be trying to claim an advantage. The SSD argument you bring up seems a red herring, just because EMC has implemented something one way doesn't define in stone how it must be done.
I have managed various arrays, and EVA is by far the easiest and best to manage. Knowledge of what's going on internally is still important, and I think HP needs to do much more to make in depth information readily available, but it is extremely easy to get good balanced performance from the array because it tends naturally to use everything available to it maximally. Contrast this to an array where drives just sit idle for months waiting for a manager to assign them to a task. This is what happens in real life.
Posted by: yfeefy | August 18, 2008 at 02:04 PM
Really, now ... just because you've never had a problem with your specific use case doesn't mean that's true for everyone else, does it?
One thing I will admit -- there are a lot of people out there who aren't taht challenged by data placement issues, and are comfortable with the array virtualizing all the data all the time.
These are the same people who probably won't be interested in things like enterprise flash drives, because -- well -- they don't have any particular storage performance issues.
Lucky them.
Still, I for one would want the ability to turn the darn stuff off if I felt I needed to for one reason or another.
But, if you're still not convinced, here's an invitation ...
Go into any larger enterprise shop. Go find the big hairy applications that run the business. You know, the ones where someone gets fired if it runs slow.
And I bet you won't find a single spindle-randomizing array -- from any vendor -- behind any of them.
Cheers!
Posted by: Chuck Hollis | August 18, 2008 at 05:05 PM
Hi Chuck,
Rather than an invitation, I'd rather see the details about how a sizable benefit is achieved with manual "fine-grained-placement". Apart from outer edge placement (which could be done automatically or manually with virtual arrays, although to do it manually, e.g. on a per LUN basis, is not implemented on the EVA I'm most familiar with (why not, hp?)) - I suspect substantially the same can be done with a virtual array, albeit with different techniques. That goes for the utilization of SSDs as well. It's a different philosophy, but the sysadmin still has a lot of control.
I understand the argument about large firms with large apps running large high-end arrays (which are not virtual arrays) and I don't believe it's relevant to this discussion. The main reasons for this is size and speed of hardware (and commensurate price tag) having nothing to do with fine-grained placement, and, well, politics. Just because they've thrown a lot of money at the problem doesn't mean you won't find plenty of suboptimal configurations here.
Posted by: yfeefy | August 19, 2008 at 05:40 PM
We're a fairly big organisation and we have a "spindle-randomizing" FAS and more traditional HP EVA. You're right our critical apps are on the traditional SAN.
However we wish it wasn't. The fact is the FAS is newer technology and as I'm sure you know, the most critical apps are the last to be moved to new technology, especially in larger organisations.
I guess to use traditional SANs you either invest heavily in training, or buy in consultancy. On the other hand I've had no trouble getting what I need from the FAS, with no formal training whatsoever.
Posted by: Chris | September 03, 2008 at 09:39 AM
Imagine a virtualized array able to manage its address space in fairly fine grained chunks (chunk = several MB?), and transparently move frequently accessed chunks to the EFD, and infrequently accessed chunks to the slower disks. In real world applications would these chunks stay hot enough long enough to warrant the movement to the EFD? Or would the array constantly be chasing its tail? Said another way, are we talking about EFD residencies of these chunks measured in minutes, hours, days? I realize its all application and data demographic dependent, but what are some of the more typical scenarios? And if its all automatically done by the array, then I suppose its possible that application 2 all of a sudden gets hot and its data forces application 1 out of the EFD, but what if the service level for application 1 was the most critical? How would the array know? So, it would seem that doing placement manually has some advantages when you are trying to control service levels for specific workloads. Doing it automatically is good if the overall goal is just to maximize overall performance, but not any specific workload. Best of both worlds would be automation with knowledge of the workload service level agreements.
Posted by: Peloton | September 05, 2008 at 03:39 AM
Peloton -- you're thinking in the right direction, I believe.
The trick is in the effectiveness of the algorithms against real-world usage patterns.
If you think about today's DMX (large, non-volatile cache) and its algorithms, it's a reasonable predecessor of what might be needed for what you describe.
The good news? We've got 15+ years experience understanding different workloads, and the algorithmic effectiveness of putting things in cache vs. spinning disk.
Not that there's an enormous amount of work to be done for this use case, but at least there's some work to build on ...
Posted by: Chuck Hollis | September 05, 2008 at 07:55 AM