« VDI -- The Red Hot Discussion | Main | I'm Starting To Really Like These Guys ... »

March 19, 2008


Lars Albinsson

"Data Deduplication Is Hot" i agree!

We are about to redesign our Networker installation and are thinking a lot about dedup. Where does it fit and where is it only a waste of CPU cycles for the dedup process, we need to find the winns of dedup and keep away from the pit falls. In the backup environment there is a clearer and easier adoption for dedup and I'm shore that it will be the standard in the future to dedup for all the backup vendors, maybe not TSM....

But when does it arrive to the primary storage area? If we can trust the functions I would find it very nice to have our ~1000 Windows 2003 Virtual hosts in VMware and 200 Windows 2003 physical SAN hosts to all share the the common blocks. And offcourse those commonly and frequently shared blocks should all fit in the cache or be on a SSD disk in the DMX that serves all 1200 Windows hosts. This would save us the disk space of ~1200 x 10GB = ~12 TB only used for our boot disks.

Dedup and thin provisioning is the 2 things that can save us customers $ spent on phy disk and power. And as you say this should be a feature not a product.

So I'll sit back and relax and see what you in the industry can create for us customers.

Thanks for writing

Chuck Hollis

Hi Lars -- great comment.

Not that we'd be working on anything like that ;-)


De-dupe is a feature and a tech-linguistic tradition. Acronyms are a form of De-dupe as is the word De-dupe itself as you point out. Avamar and Falconstor should be de-duped as A-mar and F-stor. We saw what happened recently to N-tap and I hate to think of what's going to become of Dell EqualLogic (Delgic??)

I agree that its a feature, just as virtualization is a feature. Of course, features can succeed as special purpose products if they are implemented poorly in competitor's platform-type products.

By the way, are you a hockey fan? If you want to talk about something that's hot, it would be the SJ Sharks.

Alexandra Larsson

I agree that data deduplication is hot however I would like to put it in what I believe is a bigger context. To me it is not just a way to improve or optmize storage it is a critical part of any information architecture.

Data deduplication should be a core feature in EMC Documentum which put the use of it up on the business side. Since everything stored in Documentum is an object which may or may not have an attachment in the form of a document that objekt can be exposed to users in one or many folders. The key thing is that these linkages opens up for interesting ways of using data deduplication.

Imagine a corporate environment with thosands of users. A lot of important documents in the company will be used many times by many people in many different contexts. Since many of them likely are used as references in different projects such as corporate strategy, marketing documents so forth they are essentially read-only. However since a lot of users need them the result will be that the exact document will be imported many times and not only taking up unneccessary space but also create problems when these documents are updated.

So my solution would be that we have a job running on import that highlights to the user that this particular document is already available and ask the user if they want to use the existing one instead. That renders a link being created to that document in their Folder or Project space.

We can also continously run a job doing reports on the the current status of the repository to see how many duplicates we have and what kind of content is duplicated the most. If we want a proactive Knowledge Management function they can either consolidate that directly or create tasks to users asking them if they agree to deduplicate some of their content.

This will further help companies manage vital documents and further reduce the confusion of which document is the correct and updated one.

From a technology stand-point the first step would be to use a simple hash function to find exact duplicates but the next step should be to use vector-based indexing technology found in both Autonomy and FAST ESP to also detect similarity level and possibly use that for further refinement of similar content. That way we can find the same content but stored in different formats such as Word and PDF.

Chuck Hollis

Hi Alexandra -- you bring up good points.

First, many people use the single-instancing function of Centera for all sorts of repositories: Documentum, email attachments, filesystems, etc. Doesn't matter how many "copies" there are in the environment, there's only one instance of the object which is centrally managed. If you're not doing that in your environment today, you could be, as literally thousands of customers are doing the same.

Second, there is sometimes the opportunity to capaitalize on small differences between stored objects to get further savings. Some kinds of use cases work well, others don't. In particular, file types that are already compressed (e.g. pictures in JPEG, PowerPoint presentations, MP3 audio, video streams) do not usually respond well to this kind of data deduplication.

We always do a bit of analysis on large data sets to make sure we understand the cost savings potential of the technology before we wade in with a solution.

Thanks for writing!


Data Dedup a feature?
This may be EMC's perspective on this technology, but Data Domain has built quite anice business out of selling a feature. I would argue that in the greater scheme of storage management and organization that it is a feature but when you design a build a product to do this feature very well - as in Data Domain's case - you are able to have a successful product line built chiefly around that feature.

3PAR did the same with thin provisioning mentioned in a previous blog comment. Their time has largely passed but by exploiting the lack of this feature and convincing customers that this was a must have, they were able to compete successfully with the major storage companies like EMC who even now have not yet introduced thin provisioning on their Clariion line (yes, I know it's on the roadmap).

Chuck Hollis

I think we're talking about different things.

I can't argue with you, that the first vendor to come up with a cool feature gets a lot of attention, and can build a nice (initial) business. Plenty of examples of that in our business, right?

But when everyone has pretty much the same feature, then what?

Certainly, that's an interesting question for the vendor in question (hint: sell yourself!), but what about customers who've made long-term investments in one product or another?

Example: what happens to TMS (the RamSan guys) when everyone has enterprise flash drives that are nearly as fast, a lot cheaper, and have a bunch of cool features? Or Gear6 for that matter?

Or what happens to Copan when everyone learns how to do intelligent drive spin-down?

Just for the record, I'm not hearing any good news from the 3PAR guys these days, they've been pretty quiet. As have Pillar and a few others.

If you're a VC, get in and get out. If you're a user of these technologies, you may have a different perspective.

Thanks for writing!


Hi Chuck, I have written a simple very basic tutorial on Dedupe. Please let me know, how can improve it and if I should add more references. Thanks, I have also used some quotes from you.

The comments to this entry are closed.

Chuck Hollis

  • Chuck Hollis
    SVP, Oracle Converged Infrastructure Systems

    Chuck now works for Oracle, and is now deeply embroiled in IT infrastructure.

    Previously, he was with VMware for 2 years, and EMC for 18 years before that, most of them great.

    He enjoys speaking to customer and industry audiences about a variety of technology topics, and -- of course -- enjoys blogging.

    Chuck lives in Vero Beach, FL with his wife and four dogs when he's not traveling. In his spare time, Chuck is working on his second career as an aging rock musician.

    Warning: do not ever buy him a drink when there is a piano nearby.

    Note: these are my personal views, and aren't reviewed or approved by my employer.
Enter your Email:
Preview | Powered by FeedBlitz

General Housekeeping

  • Frequency of Updates
    I try and write something new 1-2 times per week; less if I'm travelling, more if I'm in the office. Hopefully you'll find the frequency about right!
  • Comments and Feedback
    All courteous comments welcome. TypePad occasionally puts comments into the spam folder, but I'll fish them out. Thanks!