If you're surfing our little corner of the blogosphere, you might have noticed that a spontaneous debate has broken out about the idea of "web 2.0 storage": what it is, what it isn't, and so on.
One catalyst seems to be the availability of all sorts of free (or low cost) over-the-network storage schemes from Google, Amazon and others. Another catalyst was IBM trying to position their recent acquisition of XiV to revive their storage business.
I'm sure there's more that's been written around this, but here's the point: I think this "web 2.0 storage" discussion, as currently framed, is a big head-fake.
And, while I have no problem with competitors running off on a wild goose chase, I think there's an important cadre of IT thinkers that will want to focus on what's going to be important here.
A Bit Of Background
If you're interested in the back story, please take a moment to see what I've written before about "information clouds" and what makes this type of information ("active content") very different than other forms of information.
Once again, I'd like to borrow heavily from Clive Bearman (a fellow EMCer) and outline EMC's thoughts around "cloud storage", which is turning out to be a much more powerful and useful set of concepts than what's out there today.
And it ain't "web 2.0 storage", folks. At least, not how it's being described by others.
What It's Not
Some people seem to associate this webby, cloudy stuff with some of the more popular offerings from folks such as Amazon and Google, e.g. web 2.0 storage is simply cost-effective storage over the web.
Nothing wrong with that, but it's not particularly interesting. All you need is some decently cheap storage, a nice internet connection, and some sort of SLA from your service provider that it's going to be there when you need it.
Not to mention a business model that works ;-)
I think we'll see a lot of this sort of thing, but -- frankly -- it's not very interesting from a technology perspective. Nor will it be particularly useful to most customers, I believe.
What It Is
Think for a moment about what YouTube might need: millions of video streams, distributed around the globe, to millions of users, on a variety of devices.
Something gets unpredictably popular, it needs a very high service level. But most of the stuff is in the "long tail" of infrequently accessed, but nice to have, content.
Or, let's say you're a global wireless carrier looking at delivering video or audio streams the same way. Or you're a large-scale information publisher in one of the verticals. Or maybe you've got an interesting government application, of which there are very many.
You think about the problem very, very differently.
And it's not stuff that traditional IT organizations usually think about.
So, I'm going to leave the "web 2.0 storage" discussion to others (have fun), and spend my time on something a little more interesting and challenging -- cloud storage.
Cloud Storage Is Massive
Very massive. We're routinely encountering new requirements where terms like "gigabyte" and "terabyte" are not useful, the discussion starts at "many petabytes" and goes up from there.
We tend to think of all this stuff sitting in a data center somewhere, but for this model, it just doesn't work. Nobody can afford a single data center that's large enough to put all this stuff into (no, not even Google). More importantly, no one can afford the network pipes that'll be needed in a single place to feed everything into, or out of.
No, what you'll need is the ability to place these devices in locations around the world, and have them operate as a single entity: a single global name space, and -- more importantly -- the ability to ingest content from anywhere, and move content to popular places depending on traffic and interest.
Presenting storage as blocks (e.g. LUNs) won't scale. Presenting storage as files won't scale. You'll need an object-oriented approach with rich sematics -- nothing else will work at this uber-massive scale.
It goes without saying that costs matter, but in a very different way. Take any small cost (hardware, software, energy, administration, etc.) multiply it by a very large number, and you have a very large cost.
Cloud Storage Is Autonomous
If you can imagine many petabytes with billions of objects in hundreds of locations and millions of users, this means that management is an entirely unique proposition.
The environment must be self-tuning, and automatically react to surges in demand. It must be self-healing and self-correcting at a massive scale -- like the internet, no single scenario of failures can bring it down.
The idea of a bunch of administrators sitting glued to multiple consoles, watching indicators and firing off commands -- well, that just won't work here. Not only is it hard for people to react fast enough, no one can afford that much human capital to keep things running smoothly.
Cloud Storage Is Universal
We might keep thinking "browser access", but that's only one of many potential models for global information ingestion and access.
What about my set-top box? A mobile iTunes device? Ingestion of sensor data? RFID? VOIP phones? Security cameras? Or maybe satellites?
Thinking browser-oriented stuff is way too limiting, I think. Give yourself some time and fully ponder all the different ways information could be gathered and distributed on a massive, global scale, and you'll start to realize the enormity of the appeal.
Cloud Storage Is Infrastructure
What do I mean by that? Infrastructure is a platform to build other, more useful things. It takes care of all the hard stuff, so that its users can focus on the interesting and useful stuff. It's dependable. It's available. It's got to be delivered as a commercially-available, carrier-class product supported by a vendor. Just like power, and phones, and ... well, you get the idea.
High school science projects need not apply.
The Bottom Line
Maybe I'm the one who's wrong. Maybe "web 2.0 storage" as being postulated by others will be where all the action will be in the future.
Or, maybe, we're seeing one of those head-fakes that routinely happen in this industry that cause IT vendors to head off in one direction, and emerging customer requirements head off in another.
Cloud storage.
We'll see, won't we?
All good points, but it's useful to ground this discussion in the definition of Web 2.0, as originally laid out in Tim O'Reilly's article at http://www.oreillynet.com/pub/a/oreilly/tim/news/2005/09/30/what-is-web-20.html.
Of all the elements listed in the "Web as Platform", the ones which apply to the storage cloud, either strongly or weakly, are:
1. Granular addressability of content
2. Emergent user behavior not determined
3. The right to remix
4. Data as the "Intel inside"
If the data is restricted to a single corporation or home user backing up data to the cloud, I'd say that all the items 3 and 4 don't apply either.
Therefore, it's really inappropriate to use the moniker "Web 2.0" with these applications.
Posted by: Peter Quirk | January 12, 2008 at 06:28 PM
Hi
I really liked your article and wondered what solutions you have seen/ are seeing being built to deliver PB of data in a self tuning infrastructure.
Mostly what we see is Storage companies selling ever larger boxes with some software to assist ingestion to box A,B,C on a given frequency/data type or from a given location.
What you are describing is ingestion from all forms of data devices and flexibly storing it for access back to those devices or others.
How do you envision the tools and architecture to deliver to that need?
Really really interested...
Posted by: Web2what | April 24, 2008 at 06:43 AM