« EMC Information Infrastructure for VMware | Main | More Digital Big Bang -- IDC Study »

March 01, 2007


Ernst Lopes Cardozo

Dear Chuck,
Thank you so much. Ever since the double whammy of the disk drive articles by Google and CMU, I have been miserable. The thought that our enterprise drives could be just as unreliable as any PC-drive kept me up at night.
Then I read you blog and the sun came back in my life. Of course, I should have known. It all depends! There are these myriads of factors that make all the difference. The chance that my drives are in the same condition, experience the exact same temperature swings, mechanical abuses en usage patterns as Google’s are infinitesimally small. The fact that CMU’s drives failed as well make no difference. Heck, even if everybody’s drives fail, I will still know I’m protected, since my case IS different. My mileage will vary!

If CMU lost their faith in the superiority of Enterprise drives, that’s just their problem. I just know that more expensive drives are more reliable. Because we LOVE them more!

I feel your pain when I see your fine company is implied in a conspiracy to make bad drives. Such accusations after you spared none of our money.

Up in Smoke
Truth is, your blog echo's the rebukes from the tobacco industry, which always rejected studies that showed smoking is bad for our health. We have seen how that ended: the studies were largely right, the industry was wrong and they knew it al along. They had no need to conspire, since they had a common motive: profit.

If the industry wants more money for their enterprise drives, than it's up to the industry to prove to us that they hold more value. If your datasheets claim that they are more reliable, than it is up to you to prove that they are. At this point we, the jury, have serious doubts.

Chuck Hollis

DJ McFadden -- if you'd like a reply to your comment, please leave a valid email address. The one you left doesn't seem to work.


Chuck Hollis


Thanks for the comment, although I have to admit that the dripping sarcasm is a bit much to wade through.

I think you might have missed a key point I'm trying to make.

The studies -- as presented -- don't give us enough data to make a definitive conclusion one way or another.

Way too many variables in play, over too long a time, against a technology base that shifted during the study, and continues to shift as we debate this.

Agree or disagree, that's my take.

Now, if based on these studies, you feel there's enough evidence to start using PC-class drives for your most critical enterprise data, well, that's your choice now, isn't it?

I'm sorry if I wasn't able to boil this down to a simple yes-or-no proposition for you.

And if you'd like to believe in a conspiracy theory, well, that's your privilege as well.

As I said before, you're giving the vendor community way too much credit here ;-)

Thanks for reading!

Bill Todd

Tastes vary, I guess: *I* found Ernst's sarcasm both highly entertaining and precisely on target (especially the comparison to the case of the tobacco companies - you've even got their tone down pat).

Google and CMU are of course only the latest industry players to have highlighted the fact that real-world disk MTBFs very often just don't come anywhere near measuring up to manufacturers' claims: the Internet Archive and the apparently late (and very much missed) Jim Gray at Microsoft made similar observations a year or two ago. While those studies didn't directly address the issue of whether 'enterprise' drives at least maintained their *relative* claimed MTBF advantage over their lesser brethren, your own observations would seem to apply to this area as well as to those you attempted to turn them to: we really don't know nearly enough about the details of manufacturers' test conditions to have any certainty at all that *given identical environmental conditions similar to those likely to be encountered in real-world installations* enterprise drives would demonstrate any significant superiority in this area.

"Just trust us: we're the experts" doesn't fly these days quite as well as it used to - something to do with gross abuses of such trust across such a wide spectrum of industry and society as a whole. Here's hoping that we'll see a great deal more such data, even if it may not have all the variables controlled quite as perfectly as you (and for that matter we) might wish for (since, after all, the real world tends not to be quite that controlled either). In that vein, we'd of course welcome any that EMC might choose to contribute as well.

- bill

Ernst Lopes Cardozo

My sarcasm was prompted by the feeling that your post went completely beside the point. The Google and CMU studies did indeed little to explain why their drives failed more often. Does that invalidate their results? I don't think so. The way I see it is quite simple: the industry produces goods, accompanied by a spec sheet that includes a MTBF. The ones I read did not make specific restrictions about the conditions under which those numbers should be valid. If the reliability of a sufficiently large population of drives, under diverse but generally professional conditions, then deviates so enormously from what the manufacturers said we could expect, then that does shed serious doubt on the validity of the MTBF claims. Your comments seem to suggest that the researchers or users have to explain or even prove the cause of these early failures. I see that differently.

After reading a couple of your blogs, I must say you have a way of putting complex things simple and approachable. That is a very useful skill. This time, I think you used it to defend a position that can only harm you and your company in the long run.

I wish you the very best.

Chuck Hollis

Ernst, Bill

Seems to be a lot of emotion on this topic. I'm not quite sure why.

Just to re-iterate, no one is disparaging the study. It went about as far as it could given their constraints.

However, my personal opinion -- and those of others I respect -- is that it's very hard to drive to a definitive conclusion from the data presented.

I explained my reasons why.

There seems to be some sort of expectation that EMC (or any other vendor, for that matter) should somehow be obligated to explain the results of the two studies.

Guys, that's not our job to do. I heard that we contributed data to at least one of the studies, in the interests of scientific research, but that's about the limit of our participation.

I think, like all areas of scientific inquiry, you can read the papers and reach your own conclusions.

I've made mine -- there are way too many variables in play, over too long a time, to definitively conclude one thing or another.

If you disagree, fine.


Here's the interesting thing, the security industry generates a lot of paper pointing out holes in core technologies all the time. It's followed by a lot of people expressing their opinions and there are disagreements but ultimately they all sit down work the problem not the blame.

The storage industry contributes to the CMU research, you'll find EMC thanked in the acknowledgements along with many other vendors, and we have a bunch of people who want a public hanging because they have a personal axe to grind with a company.

Anyone using a tobacco industry analogy in this case is clearly incapable of reading the research and is obviously just a troll who should be ignored. EMC rips down every failed drive and analyses every support situation extensively to ensure quality, even though it does not manufacture drives itself.

It should also be noted that Seagate have publicly stated that 43% of all drives returned to them as failed units are found to be defect free. Google's report is even more damning, quoting up to 60% of returned drives found to be defect free upon analysis. So why were they shown as failed in the first place?

This is a complex issue with many moving parts, the idiots looking to paint someone, anyone, as "the bad guy" don't appear to see all the pieces in play.

I've read both papers, they were derived from customer failure records and don't appear to contain any followup information from the drive failure analysis cycle.

If a controller marks a drive as suspect and kicks it out of a RAID group but the manufacturer finds that it's defect free and operating within tolerances does that count as a failed drive? Under the CMU & Google report conditions it does as the drive was replaced, but for the manufacturer it doesn't the drive is functioning as designed.

How many "failed" drives in these reports were the result of bad controllers or faulty cabling? We don't know and the reason we don't know is that it probably would have been insanely expensive to send that many drives to be examined to the required level. It's cheaper to bin them.

Those are just some of the contextual elements in play, I reject the precepts put forth by Bill and Ernst since they appear to ignore the fact that without that drive failure analysis data we don't actually know what happened to those "failed" drives.

Chuck Hollis

Thanks to all who wrote and expressed their thoughts.

I'm going to close comments on this one; in the meantime, I found this article in eWeek by David Morgenstern rather interesting.



The comments to this entry are closed.

Chuck Hollis

  • Chuck Hollis
    SVP, Oracle Converged Infrastructure Systems

    Chuck now works for Oracle, and is now deeply embroiled in IT infrastructure.

    Previously, he was with VMware for 2 years, and EMC for 18 years before that, most of them great.

    He enjoys speaking to customer and industry audiences about a variety of technology topics, and -- of course -- enjoys blogging.

    Chuck lives in Vero Beach, FL with his wife and four dogs when he's not traveling. In his spare time, Chuck is working on his second career as an aging rock musician.

    Warning: do not ever buy him a drink when there is a piano nearby.

    Note: these are my personal views, and aren't reviewed or approved by my employer.
Enter your Email:
Preview | Powered by FeedBlitz

General Housekeeping

  • Frequency of Updates
    I try and write something new 1-2 times per week; less if I'm travelling, more if I'm in the office. Hopefully you'll find the frequency about right!
  • Comments and Feedback
    All courteous comments welcome. TypePad occasionally puts comments into the spam folder, but I'll fish them out. Thanks!