Of Scalable File Systems and HPC ...
One of the more interesting new discussions in our industry is around HPC (high performance computing) environments, and the inevitable discussion of how will you get to your information?
Lots of vendors piling in. It's getting pretty noisy. I wrote a bit earlier on NAS evolution, but this is a more focused discussion.
What most people don't know is that EMC is pretty active in this space, and we've got more than few showcase examples of how our thinking (and our products) apply.
So, What's This All About?
HPC has come a long way recently. More and more interesting business problems can be solved with a boatload of compute power -- the list gets longer and longer.
And scale-out commodity server technologies have gotten so much better in terms of performance, cost and scaleability.
It's not just exotic applications anymore, it's getting much more mainstream.
But every compute challenge is an information challenge just waiting to happen, and HPC is no different.
Information Matters, Even For HPC
Right now, there seems to be three different architectural approaches for building enormous, high-performance filesystems for HPC environments.
One approach says put the file system on the compute elements themselves. Here we find Lustre (owned by Sun), ZFS (ownership a bit in dispute right now) and Ibrix -- maybe a few others that I forgot.
There seem to be a few hardy souls trying this in the storage nodes themselves, e.g. make the file serving elements appear as one by peering them. Isilon and the NetApp ONTAP GX seem to roughly fit here, maybe a few others as well.
And there are a few of us who want to do this in the network itself -- noticable Panasas and EMC's Celerra MPFS.
By the way, MPFS stands for multi protocol file system.
Understanding MPFS takes just a moment, and it's easier with pictures, but I'm too lazy to dig one up, so let me try this with plain words.
Imagine a standard set of arrays -- maybe CLARiiONs.
Imagine a Celerra NAS server off to the side, maybe a small one, maybe a big one.
And imagine that each compute server that can access information two ways: everything looks like a file system, but when an application opens something, it switches to very efficient block-level access, using ethernet, FC or in some cases a combination of Infiniband and FC.
The Celerra head is nothing more than a metadata server. It doesn't get in the way of the data path, which is nice. And when we put this in customer environments, they're usually delighted.
Delighted with the performance. Delighted with the manageability. Delighted with the cost. And delighted with the availablity.
But it's early days in this marketplace, and it's pretty clear that everyone is starting to discover the pros and cons of each.
It's About Big And Fast, Isn't It?
Yes, and no. We've been in a lot of face-off benchmarks against the other guys. Sometimes we win, sometimes they win. A few of the vendors never seem to win (usually the do-it-in-the-storage guys), but that's another story.
I'll simplify the discussion by saying that -- sooner or later -- everyone's approach can be made to run real fast. And we'll probably see more benchmarketing in this arena.
Everybody thinks getting real big is important as well. 32, 64, 128 TB and more. Not clear if anyone actually needs that, but I'm sure it's important in people's thinking.
But the interesting discussion, at least from what I can tell, has more to do with cost and availability.
Performance is very interesting, but people don't have unlimited budgets.
And, not to state the obvious, the performance of a system where you can't get to the information is exactly zero.
The Cost Equation
Since we're talking lots of compute nodes here, that's an obvious lever for cost optimization. Ethernet is cheaper than FC.
And, interestingly enough, IB can be cheaper than 10Gb ethernet -- if you've already bought an Infiniband Host Card Adapter (HCA) to do your server-to-server clustering.
Why? The incremental cost for the external attachment is exactly zero. You've already got a fast IB pipe that you're not even beginning to use. Now, you'll have to have an adventurous spirit when it comes to things like management and whatnot, but it's an interesting option nevertheless.
There's another lever here that has to be examined as well, and that's client software. The compute-side file system guys usually want a nice hunk of complex code on every compute node. There's the visible costs of that approach, and the harder-to-get-to costs in terms of memory utilization, CPU utilization, admin and support costs, etc.
I would argue that these guys are not advantaged for that part of the discussion.
MPFS currently uses a modified NFS client that EMC sells and supports to do the split-pathing. We're very excited about pNFS (parallel NFS) that's part of NFS V.4, even to the extent that we're funding client development for Linux, Windows and a few other environments, se we can get out of proprietary clients.
Certainly the footprint is less with this kind of approach. And having a supported client come free (or near-free) as part of the operating system is nice.
I'd also want to point out that the administration and operational side should be much easier using an asymmetric metadata server as we do with MPFS. There's one point of control for the whole shebang, rather than hundreds (or thousands).
And Then There's The Availability Discussion ...
Someone told me once that a Ferrari isn't very fast when it's sitting in the shop getting fixed. Neither is an uber-expensive HPC environment that can't access data.
And some of these environments are really, really big from a storage perspective.
We had a rather sophisticated customer once observe that sooner or later, everything breaks. The question in his mind was what then? Of course, he wanted the usual no-single-point-of-failure architecture, as well as decent data protection, but there was more.
One of the thing he liked about the MPFS asymmetric approach is that the most likely failure mode was to plain NAS rather than SAN access -- it would run, but it would more slowly. He shared a few experiences where the failure mode for the client-side file systems was a bit more ugly. And he wouldn't even think about the clustered storage nodes.
Creating huge file systems was another concern. Imagine a 32 or 64 or 128 TB file system. Now, imagine recovering it.
Wouldn't it be a better idea to have a large logical filesystem built from smaller filesystems, each of which could be individually backed up or recovered? More than a few people have made that observation, and, once again, it's something you'd find in MPFS.
So, There's More To The Discussion
For example, how does tiering and archiving fit in? Does data dedupe make sense for these environments? What's the ideal end-state for server/storage connectivity? Is unmanaged IB sufficient, or will things like FCoE be more appealing?
Too soon to tell.
But it's an interesting discussion, to be sure.

Comments