Data DeDupe -- Product or Feature?
Ah, you've got to love this bloggy world we're living in.
Today's post was driven by industry speculation that EMC and Quantum might be doing something.
Which, of course, lead to a broader discussion on data deduplication.
And, inevitably, I felt I might weigh in and try and attempt a tiny bit of clarification here.
Data Deduplication Is Hot
You know something is hot when the words used to describe it get compressed.
Over the last few years, the term "data deduplication" has been shortened to "data dedupe" and, most recently, "D3".
Guys, this is exactly the kind of behavior that got us into trouble with the whole Y2K thing ...
On a more serious note, the potential of squeezing the redundant data out of our growing storage farms is something that just can't be ignored. It's just too big an idea.
For The Record
I'm not going to comment regarding the rumors that perhaps EMC and Quantum are going to do something together.
Frankly, in the big scheme of things, it doesn't really matter whether we are or we aren't.
But, along the way, a few folks tried to embellish the story by speculating that we weren't happy with Avamar (fact: we're extremely happy), or we weren't happy with FalconStor (again, not true), or something similar.
I guess that sort of thing makes a "news" article more entertaining, but that doesn't mean it's accurate.
We've Seen This Before
As part of my job, I do my best to help people think about these things in a useful way. And I've found some luck in pointing to a historical precedent here, namely compression.
Data compression has been around for a long while, we've all got some experience with it. There are places it works really well, and places where it doesn't. And, we've all learned that compressing a compressed object is an exercise in futility.
We also all know there's no free lunch -- that there's a price to pay in either performance, complexity, or both, depending on the circumstances.
That being said, we see compression all through the IT landscape -- everything from JPG files and PowerPoint, to network communications, etc.
More importantly, there's no real market for data compression products anymore.
It's a feature you just expect to see in certain products.
And, if it's a good product, there are choices for different types of compression, including just turning it off if it's getting in the way of things. The smartest products can look at what's being compressed, and try and do the best job possible. Or integrate with other functions in a useful manner.
Data Deduplication Is A Feature
By analogy, that's how EMC sees data deduplication over time -- it's a feature you expect in the product, not the product itself.
We can already point to many place where it will make a certain sense -- client-side backups, target-side backups, file systems, over replication links, large content repositories -- and I bet we find a few more over time.
Dave Raffo over at SearchStorage picked up on this in his recent blog post -- he and I would agree about this.
And one has to seriously wonder about the long-term viability of a company entirely focused on target-side backup deduplication just as you'd wonder about a specialized data compression product company.
Guys, it's a feature. Not a product, and certainly not a market.
The Bottom Line
Right now, all the interest is in the technology and the players. Fine, that's what you would expect when the technology is relatively new, and all the vendors are working hard to make their products support a new feature set.
But, it won't be too long before the focus will shift -- away from the technology itself, and towards helping customers figure out where it makes sense, and where it doesn't.
Part of that will require having the feature widely available in many products. Part of that will integration with things like management layers and other parts of the infrastructure. And a lot of that will involve having good, experienced people and tools to analyze customer environments and make recommendations.
But in the meantime, I guess it makes for good reading ...



"Data Deduplication Is Hot" i agree!
We are about to redesign our Networker installation and are thinking a lot about dedup. Where does it fit and where is it only a waste of CPU cycles for the dedup process, we need to find the winns of dedup and keep away from the pit falls. In the backup environment there is a clearer and easier adoption for dedup and I'm shore that it will be the standard in the future to dedup for all the backup vendors, maybe not TSM....
But when does it arrive to the primary storage area? If we can trust the functions I would find it very nice to have our ~1000 Windows 2003 Virtual hosts in VMware and 200 Windows 2003 physical SAN hosts to all share the the common blocks. And offcourse those commonly and frequently shared blocks should all fit in the cache or be on a SSD disk in the DMX that serves all 1200 Windows hosts. This would save us the disk space of ~1200 x 10GB = ~12 TB only used for our boot disks.
Dedup and thin provisioning is the 2 things that can save us customers $ spent on phy disk and power. And as you say this should be a feature not a product.
So I'll sit back and relax and see what you in the industry can create for us customers.
Thanks for writing
Posted by: Lars Albinsson | March 19, 2008 at 04:28 PM
Hi Lars -- great comment.
Not that we'd be working on anything like that ;-)
Posted by: Chuck Hollis | March 19, 2008 at 06:23 PM
De-dupe is a feature and a tech-linguistic tradition. Acronyms are a form of De-dupe as is the word De-dupe itself as you point out. Avamar and Falconstor should be de-duped as A-mar and F-stor. We saw what happened recently to N-tap and I hate to think of what's going to become of Dell EqualLogic (Delgic??)
I agree that its a feature, just as virtualization is a feature. Of course, features can succeed as special purpose products if they are implemented poorly in competitor's platform-type products.
By the way, are you a hockey fan? If you want to talk about something that's hot, it would be the SJ Sharks.
Posted by: MarcFarley | March 20, 2008 at 12:11 PM
I agree that data deduplication is hot however I would like to put it in what I believe is a bigger context. To me it is not just a way to improve or optmize storage it is a critical part of any information architecture.
Data deduplication should be a core feature in EMC Documentum which put the use of it up on the business side. Since everything stored in Documentum is an object which may or may not have an attachment in the form of a document that objekt can be exposed to users in one or many folders. The key thing is that these linkages opens up for interesting ways of using data deduplication.
Imagine a corporate environment with thosands of users. A lot of important documents in the company will be used many times by many people in many different contexts. Since many of them likely are used as references in different projects such as corporate strategy, marketing documents so forth they are essentially read-only. However since a lot of users need them the result will be that the exact document will be imported many times and not only taking up unneccessary space but also create problems when these documents are updated.
So my solution would be that we have a job running on import that highlights to the user that this particular document is already available and ask the user if they want to use the existing one instead. That renders a link being created to that document in their Folder or Project space.
We can also continously run a job doing reports on the the current status of the repository to see how many duplicates we have and what kind of content is duplicated the most. If we want a proactive Knowledge Management function they can either consolidate that directly or create tasks to users asking them if they agree to deduplicate some of their content.
This will further help companies manage vital documents and further reduce the confusion of which document is the correct and updated one.
From a technology stand-point the first step would be to use a simple hash function to find exact duplicates but the next step should be to use vector-based indexing technology found in both Autonomy and FAST ESP to also detect similarity level and possibly use that for further refinement of similar content. That way we can find the same content but stored in different formats such as Word and PDF.
Posted by: Alexandra Larsson | March 21, 2008 at 12:00 AM
Hi Alexandra -- you bring up good points.
First, many people use the single-instancing function of Centera for all sorts of repositories: Documentum, email attachments, filesystems, etc. Doesn't matter how many "copies" there are in the environment, there's only one instance of the object which is centrally managed. If you're not doing that in your environment today, you could be, as literally thousands of customers are doing the same.
Second, there is sometimes the opportunity to capaitalize on small differences between stored objects to get further savings. Some kinds of use cases work well, others don't. In particular, file types that are already compressed (e.g. pictures in JPEG, PowerPoint presentations, MP3 audio, video streams) do not usually respond well to this kind of data deduplication.
We always do a bit of analysis on large data sets to make sure we understand the cost savings potential of the technology before we wade in with a solution.
Thanks for writing!
Posted by: Chuck Hollis | March 21, 2008 at 08:38 AM
Data Dedup a feature?
This may be EMC's perspective on this technology, but Data Domain has built quite anice business out of selling a feature. I would argue that in the greater scheme of storage management and organization that it is a feature but when you design a build a product to do this feature very well - as in Data Domain's case - you are able to have a successful product line built chiefly around that feature.
3PAR did the same with thin provisioning mentioned in a previous blog comment. Their time has largely passed but by exploiting the lack of this feature and convincing customers that this was a must have, they were able to compete successfully with the major storage companies like EMC who even now have not yet introduced thin provisioning on their Clariion line (yes, I know it's on the roadmap).
Posted by: mgbrit | March 26, 2008 at 05:37 PM
I think we're talking about different things.
I can't argue with you, that the first vendor to come up with a cool feature gets a lot of attention, and can build a nice (initial) business. Plenty of examples of that in our business, right?
But when everyone has pretty much the same feature, then what?
Certainly, that's an interesting question for the vendor in question (hint: sell yourself!), but what about customers who've made long-term investments in one product or another?
Example: what happens to TMS (the RamSan guys) when everyone has enterprise flash drives that are nearly as fast, a lot cheaper, and have a bunch of cool features? Or Gear6 for that matter?
Or what happens to Copan when everyone learns how to do intelligent drive spin-down?
Just for the record, I'm not hearing any good news from the 3PAR guys these days, they've been pretty quiet. As have Pillar and a few others.
If you're a VC, get in and get out. If you're a user of these technologies, you may have a different perspective.
Thanks for writing!
Posted by: Chuck Hollis | March 26, 2008 at 05:46 PM