Well, the story of data deduplication in our little industry just expanded a bit, especially in the context of this week's announcement at EMC World.
By now, I think that most people are coming to a roughly similar set of beliefs around data deduplication, or 3D as it's now being abbreviated.
The more things change, the more they stay the same.
The Idea
In case you missed the back story, 3D is a simple but powerful idea: spot the inherent redundancies within data objects (e.g. files), and you can potentially save on storage.
The usual caveats associated with any magical technology apply: the cost of finding these redundancies may outweigh the benefits, many of your data sets may not have sufficient redundancies to be worth the trouble, be careful of how you implement otherwise you'll end up with something difficult to manage, and so on.
And, let's face it, in the face of ballooning storage requirements, the idea of squeezing the excess fat out of ginormous data sets is just too good to pass up without taking a good, hard look at what it might be able to do.
Backup Is The Logical Starting Point
Backup data sets usually have all sorts of redundancies in them. Do multiple full backups and you'll be copying much of the same data each time. Do an incremental backup and you'll get changed files, but still get a lot of data blocks you already have.
Hence the strong interest in using 3D for backups ... it's a sweet spot, especially if you're looking to help bridge the cost gap between backup-to-disk and backup-to-tape.
Consensus Belief #1 -- Many Different Approaches
When a given technology is relatively new, there's a certain set of people who try to figure out the "best" approach or product. There's a belief (desire?) to pick the "winner".
Unfortunately, as the product category fills out with more options, these people despair as it becomes clear that no single offering is best for all use cases.
To draw from past examples, I believe there is no single best remote replication approach, nor backup approach, nor storage management approach, nor storage networking approach.
It's not because vendors are being ornery; it's because there's enough variability in the use cases that the market will support all these different approaches. And, if you're going to be an industry leader, you're going to have to invest in multiple ways of doing essentially the same thing.
I think data deduplication is going through this same sort of "option expansion". And anyone who wants to be a serious player is going to have to offer many different 3D options to support all the major use cases.
As of this announcement, EMC offers 4 or 5 different 3D approaches, depending on how you count. And, as I've stated before, I think it's best to think of 3D as a feature, and not a product category.
Some examples:
Source deduplication: Avamar does global deduplication at the source prior to data being sent over the wire. As such, it does really well at things like VMware, remote file systems -- anywhere the network might be an issue.
Target deduplication (in line): the new DL D3D products announced this week do dedupe "on the fly" at the target. Great for working with existing backup applications where you've got decent LAN bandwidth to work with, such as backing up file servers in a data center -- and you've got a longer backup window to work with.
Target deduplication (post processing): these same products can be teamed up with our existing disk libraries to offer after-the-fact data reduction. Great for when you've got a very tight backup window to work with, but still want to squeeze the excess out of your data streams. I'm thinking things like demanding SAP and Oracle instances.
Single-instance archiving (pre-backup): just to be complete, I usually include EMC's single instancing capabilities (things like EmailXtender and Centera) that take data out of the backup stream entirely, and do data reduction on the archive, simply by spotting exact copies of objects.
Although single instancing may not seem as "good" as intra-object dedupe (redundancies within a file), they do a great job on reducing data sprawl on archives, where newer forms of dedupe aren't attractive because either (a) the information is already stored in compressed and/or encrypted form (for example, images) and finding candidate byte patterns is unlikely, or (b) compliance regs make people uncomfortable with the idea of intermixing portions of data objects in archives.
There's also far less of a performance penalty for ingesting new data. As such, I continue to see single-instancing
And, if I was making a complete list of EMC's data dedupe capabilities, I'd probably include RecoverPoint, which does an interesting form of data deduplication prior to replicating remotely.
Consensus Belief #2 -- Your Mileage May Vary
By now, enough 3D products have been exposed to enough real-world data sets to realize that data deduplicating has its limitations.
Data sets with high change rates aren't readily amenable to 3D's benefits on backups, for example. Or data that's already compressed or encrypted, hence randomized, may defeat whatever clever algorithm the 3D product uses.
Now, that being said, there are still many places where the value is there -- it just makes sense to try it out on your data before getting too far in the process.
Consensus Belief #3 -- No Free Lunch On Performance
Let's face it -- finding all those redundant data snippets and replacing them with pointers and/or stubs takes real CPU power. And that work has to be done somewhere, by something, at some time.
As an example, Avamar uses the "free" cycles of the server to do its pattern matching magic, and sends the resulting reduced stream over the network, meaning that the target storage device has to do little to no work to store it.
Conversely, devices that attempt to do in-line processing of data streams are essentially limited by processor bandwidth.
I noted that in a recent announcement, a 3D vendor was bragging about a new performance level. I compared that to what a native (e.g. non-dedupe) disk library could do, and the difference was roughly similar to the speed of an automobile compared to a speed of a jet airliner.
Driving is fine for shorter trips; but for longer trips, most people prefer flying.
And then there's post-processing, which I think will turn out to be more popular than people think. It gives you the raw speed of disk libraries (usually measured in GB per sec, not MB), but then can selectively reclaim space.
Personally, I think we'll see a lot of these "bimodal" deployments: one function focused on blazing speed for backup and recovery, and a second engine running in the background scanning saved data sets, and looking for redundancy opportunities. Or, if you've got the time, and performance isn't an issue, go straight to dedupe -- assuming your data sets are compressable.
As far as 3D technologies for primary storage, I think that'll be a longer haul. If performance didn't matter, it wouldn't be a problem -- but it's hard for me to come up with use cases where the concepts of "primary storage" coincide with "performance doesn't matter".
But I'm sure at least one vendor will try and split this hair.
Consensus Point #4 -- Vendor Sparring Has Just Begun
I noted with a wry smile the jousting between EMC and other bloggers about who's got the best product, etc. Scott Waterhouse, in particular, seems willing to take on all comers.
[warning: be careful before you decide to go a few rounds with him -- he knows his stuff pretty well]
But if I put all the jousting aside, and look a bit deeper, I tend to think EMC has some structural advantages in this arena.
For example:
- Most 3D backup targets need a storage array. We happen to make very good ones (CLARiiONs) that are well suited to this particular purpose in terms of performance, availability, expandability, functionality and so on.
- EMC has more than one approach to a customer's backup problem, which means we can bring w very wide range of competing approaches to bear on a particular situation.
- EMC has a slew of pre-backup archiving capabilities that move important data out of the stream prior to backup. Nothing like dramatically shrinking the problem before you tackle it ...
- EMC also has a non-trivial number of backup and recovery experts who can assess a situation, work with what you have, propose different approaches, and -- of course -- get it all installed and working to your satisfaction. We've even added managed services for people who don't want to staff the day-to-day aspects of this.
- And, finally, we've got the enterprise-class mission-critical support most people demand when considering this stuff. Hard to replicate that.
And that's where we are today -- never mind the ginormous R+D spend and voracious appetite for acquisition.
The Game Is On!
This much is clear -- 3D technology is pretty hot, and customers show no sign of losing interest in the forseeable future. As a result, just about every storage or backup vendor on the planet is going to have to play at some level.
Some will do a really good job at it; others won't. Still others won't show up in the game until very, very late.
Some will offer a point product or two for "checklist compatibility"; others will invest broadly across their portfolio as a core capabilities that's needed virtually everywhere.
And if you think about it, many traditional storage players haven't even come to the game yet. Wonder what they're gonna do?
Who said storage was boring?
Sounds like EMC is really covering all the bases for deduplication, Chuck. One note: I never heard the "3D" terminology before. Is this some new EMC insider jargon?
Posted by: Stephen Foskett | May 21, 2008 at 01:44 PM
Hi Stephen --
There are more bases to cover, we're not done yet ... !!!
I think there's a natural tendency to shorten concepts to acronyms in our industry.
Remember Y2K?
Posted by: Chuck Hollis | May 21, 2008 at 06:04 PM
Hi Chuck. Yahknow, I sure appreciate your words here with a flavor of education on the subject matter, and I enjoy reading your posts in general. However, in this case I must comment; I have little faith in the expertise portrayed by EMC on the whole regarding their accumulative 15 minutes spent in the dedupe space. I'm also fairly certain that flogging this well established market place with dominating players like Data Domain, NetApp, Sepaton, and others, with this new jargon "3D" will only serve to muddy the waters to end users. Perhaps that's not without plan. In any case, I found a SNIA vendor nuetral tutorial to be very beneficial in explaining to my ops staff what deduplication is about, and where it fits. Covers EMC technologies as well. Respectfully - Bjorn
http://www.snia.org/education/tutorials/2008/spring/data-management
Posted by: Bjorn Svenheld | May 22, 2008 at 07:09 AM
Hi Bjorn --
I think SNIA does a great job on providing (relatively) vendor-neutral education, which is very important during the early phases of a new technology entering the market.
As far as the 3D terminology, it's nothing more than a convenience, as typing (or saying) data deduplication over and over and over again. No attempt to muddy the waters, just an attempt at shorthand.
I had assumed that everyone had started to use it, since I hear the phrase "3D" from customers pretty frequently now. I guess I was a bit early to pick up on it.
Posted by: Chuck Hollis | May 22, 2008 at 11:11 AM