In some corners of our industry, technology discussions can turn into interesting studies of human behavior.
Even though EMC’s capabilities span a very broad range these days, we still get involved in our fair share of storage technology debates. Some are relevant, others aren’t.
And, since I once was a crusty, knuckle-dragging storage guy, occasionally I feel compelled to wade into certain areas.
And today, we talk a bit about disk drive failure rates.
What Happened
Maybe you saw the interesting white paper from a team at Google.
They tracked a population of disk drives over a period of five years, and concluded “hey, the data doesn’t really match up to what we might have thought”.
Fair enough.
And then the blogging started. Responses to responses. Vendor posturing.
Many of us took a look at this and thought “sheesh, what’s the big deal?”
So here’s a couple of thoughts on the discussion.
Sometimes disk drives don’t fail as you’d expect.
That’s about the only thing you can conclude from the white paper.
Trying to cross the chasm from that observation into speculation as to why that might be the case isn’t supported, at least in several people’s opinion.
Why is that?
Well, there’s just too many uncontrolled variables. We have too little information about drive specifics, how they were used, how they were maintained (if at all), and so on.
Anyone who’s worked in the storage industry knows that there are dozens and dozens of variables that can affect the life of a drive.
Externally – how was it mounted, what about orientation, vibration, what about duty cycle, was it in a long FC loop, temperature spikes, did it get firmware updates, and so on and so forth. Long list here.
Internally, what rev of the mechanicals, media substrate, head composition, lubricant, firmware, interface, were the drives screened, etc. etc. Another long list here.
You get the idea – there are just too many potentially significant variables in play here to conclude much of anything tangible. And if you think those sorts of variables are held relatively constant over a long period of time, well, that’s just not the case.
Complicating matters is the fact that disk drive technology is evolving incredibly rapidly, so even if you could make some sort of sense of the historical study, it’d be unlikely to apply in the future.
Maybe a non-technical analogy is in order ...
Let’s say you ran a human mortality study by tagging everyone who passed though the Atlanta airport on a given Wednesday.
And you followed them for five years, and somehow your data didn’t match up with what you’d expect.
You could come up with all sorts of wild speculation around airports being suspect, airplanes being suspect, connecting flights being suspect, airlines being suspect, Wednesdays being suspect, Atlanta being suspect, the government is lying to us, and so on.
None of which would have much of a valid basis for conclusion, right?
All you could say is that your data wasn’t quite what you expected, and more research is called for. That seems to be the case here.
Some people have their pattern recognition circuitry turned way up.
All of us have an instinct to detect and recognize patterns. It’s a key part of human intelligence.
But our capabilities are not perfect. Sometimes there’s a pattern, sometimes there’s not.
We see castles in the clouds, faces in the moon, and so on. Sometimes it turns a bit darker, and we think things are there that aren’t supported by the data.
Thank god for statistics and the scientific method.
In some of the blog posts, I think certain people had their pattern recognition circuitry turned up a bit too high. Either that, or they thought that by being controversial, they could increase their presence in the community.
Do I think that the white paper findings are interesting, and deserve more study?
Certainly.
Do I think there is a conspiracy among vendors to mislead the public?
Don’t be ridiculous.
You guys are giving us way too much credit here.
Some vendors will capitalize on just about anything.
Thinking back to the post 9/11 era, I remember all the tacky marketing campaigns from data protection vendors with the unspoken message “it could happen again!”.
I find that sort of marketing very repugnant.
In one of the more strident blogs, NetApp took the opportunity once again to position themselves as both as concerned citizens and thought leaders on this ”important industry topic.” Lots of misrepresentation, skewing of the facts, etc. It didn’t come off too well for them, at least from my perspective.
I didn’t have to respond in detail, as others have taken the liberty this time.
At least they’re consistent.
The job of storage vendors is to protect users from disk drive failures.
Components fail. Sometimes they fail as you’d expect. Sometimes they don’t.
Storage array vendors can use a wide array of techniques (and we’re not just talking simplistic RAID) to protect against disk drive failures.
And trust me, once you wade into the arcane details, there’s a wide variability in how different vendors attack the problem.
Some approaches cost more than others. So there’s always a useful discussion around what you think you need, and what you think you can afford.
And, as disk drives evolve rapidly (which they’ve been doing for several years, and show every sign of continuing to do), the pros and cons of different approaches will vary over time.
What made sense a few years ago might not make sense today. Such is life in the technology world. Conventional wisdom changes faster than we’d like.
Once again, it depends.
We all should avoid any discussion that starts with “the only right way to …” Very few topics in storage, or technology (or life!) are that simple.
Avoid these people, please.
A Couple Of Final Thoughts
So, I guess we have preliminary data that says, sometimes, disk drives don’t fail like you’d expect them to.
That would mean they join the very long list of technology components that seem to exhibit the same behavior: my cell phone, my cable connection, the motherboard in that three-year-old PC I own, that damn plasma TV I bought a few years back that’s now useless, my growing collection of iPod bricks, and so on.
For me, I guess the real surprise is that -- people were surprised.
Dear Chuck,
Thank you so much. Ever since the double whammy of the disk drive articles by Google and CMU, I have been miserable. The thought that our enterprise drives could be just as unreliable as any PC-drive kept me up at night.
Then I read you blog and the sun came back in my life. Of course, I should have known. It all depends! There are these myriads of factors that make all the difference. The chance that my drives are in the same condition, experience the exact same temperature swings, mechanical abuses en usage patterns as Google’s are infinitesimally small. The fact that CMU’s drives failed as well make no difference. Heck, even if everybody’s drives fail, I will still know I’m protected, since my case IS different. My mileage will vary!
If CMU lost their faith in the superiority of Enterprise drives, that’s just their problem. I just know that more expensive drives are more reliable. Because we LOVE them more!
I feel your pain when I see your fine company is implied in a conspiracy to make bad drives. Such accusations after you spared none of our money.
Up in Smoke
Truth is, your blog echo's the rebukes from the tobacco industry, which always rejected studies that showed smoking is bad for our health. We have seen how that ended: the studies were largely right, the industry was wrong and they knew it al along. They had no need to conspire, since they had a common motive: profit.
If the industry wants more money for their enterprise drives, than it's up to the industry to prove to us that they hold more value. If your datasheets claim that they are more reliable, than it is up to you to prove that they are. At this point we, the jury, have serious doubts.
Posted by: Ernst Lopes Cardozo | March 01, 2007 at 01:00 PM
DJ McFadden -- if you'd like a reply to your comment, please leave a valid email address. The one you left doesn't seem to work.
Thanks!
Posted by: Chuck Hollis | March 01, 2007 at 01:57 PM
Ernst
Thanks for the comment, although I have to admit that the dripping sarcasm is a bit much to wade through.
I think you might have missed a key point I'm trying to make.
The studies -- as presented -- don't give us enough data to make a definitive conclusion one way or another.
Way too many variables in play, over too long a time, against a technology base that shifted during the study, and continues to shift as we debate this.
Agree or disagree, that's my take.
Now, if based on these studies, you feel there's enough evidence to start using PC-class drives for your most critical enterprise data, well, that's your choice now, isn't it?
I'm sorry if I wasn't able to boil this down to a simple yes-or-no proposition for you.
And if you'd like to believe in a conspiracy theory, well, that's your privilege as well.
As I said before, you're giving the vendor community way too much credit here ;-)
Thanks for reading!
Posted by: Chuck Hollis | March 01, 2007 at 02:06 PM
Tastes vary, I guess: *I* found Ernst's sarcasm both highly entertaining and precisely on target (especially the comparison to the case of the tobacco companies - you've even got their tone down pat).
Google and CMU are of course only the latest industry players to have highlighted the fact that real-world disk MTBFs very often just don't come anywhere near measuring up to manufacturers' claims: the Internet Archive and the apparently late (and very much missed) Jim Gray at Microsoft made similar observations a year or two ago. While those studies didn't directly address the issue of whether 'enterprise' drives at least maintained their *relative* claimed MTBF advantage over their lesser brethren, your own observations would seem to apply to this area as well as to those you attempted to turn them to: we really don't know nearly enough about the details of manufacturers' test conditions to have any certainty at all that *given identical environmental conditions similar to those likely to be encountered in real-world installations* enterprise drives would demonstrate any significant superiority in this area.
"Just trust us: we're the experts" doesn't fly these days quite as well as it used to - something to do with gross abuses of such trust across such a wide spectrum of industry and society as a whole. Here's hoping that we'll see a great deal more such data, even if it may not have all the variables controlled quite as perfectly as you (and for that matter we) might wish for (since, after all, the real world tends not to be quite that controlled either). In that vein, we'd of course welcome any that EMC might choose to contribute as well.
- bill
Posted by: Bill Todd | March 02, 2007 at 02:01 AM
Chuck,
My sarcasm was prompted by the feeling that your post went completely beside the point. The Google and CMU studies did indeed little to explain why their drives failed more often. Does that invalidate their results? I don't think so. The way I see it is quite simple: the industry produces goods, accompanied by a spec sheet that includes a MTBF. The ones I read did not make specific restrictions about the conditions under which those numbers should be valid. If the reliability of a sufficiently large population of drives, under diverse but generally professional conditions, then deviates so enormously from what the manufacturers said we could expect, then that does shed serious doubt on the validity of the MTBF claims. Your comments seem to suggest that the researchers or users have to explain or even prove the cause of these early failures. I see that differently.
After reading a couple of your blogs, I must say you have a way of putting complex things simple and approachable. That is a very useful skill. This time, I think you used it to defend a position that can only harm you and your company in the long run.
I wish you the very best.
Posted by: Ernst Lopes Cardozo | March 02, 2007 at 05:07 AM
Ernst, Bill
Seems to be a lot of emotion on this topic. I'm not quite sure why.
Just to re-iterate, no one is disparaging the study. It went about as far as it could given their constraints.
However, my personal opinion -- and those of others I respect -- is that it's very hard to drive to a definitive conclusion from the data presented.
I explained my reasons why.
There seems to be some sort of expectation that EMC (or any other vendor, for that matter) should somehow be obligated to explain the results of the two studies.
Guys, that's not our job to do. I heard that we contributed data to at least one of the studies, in the interests of scientific research, but that's about the limit of our participation.
I think, like all areas of scientific inquiry, you can read the papers and reach your own conclusions.
I've made mine -- there are way too many variables in play, over too long a time, to definitively conclude one thing or another.
If you disagree, fine.
Posted by: Chuck Hollis | March 02, 2007 at 08:10 AM
Here's the interesting thing, the security industry generates a lot of paper pointing out holes in core technologies all the time. It's followed by a lot of people expressing their opinions and there are disagreements but ultimately they all sit down work the problem not the blame.
The storage industry contributes to the CMU research, you'll find EMC thanked in the acknowledgements along with many other vendors, and we have a bunch of people who want a public hanging because they have a personal axe to grind with a company.
Anyone using a tobacco industry analogy in this case is clearly incapable of reading the research and is obviously just a troll who should be ignored. EMC rips down every failed drive and analyses every support situation extensively to ensure quality, even though it does not manufacture drives itself.
It should also be noted that Seagate have publicly stated that 43% of all drives returned to them as failed units are found to be defect free. Google's report is even more damning, quoting up to 60% of returned drives found to be defect free upon analysis. So why were they shown as failed in the first place?
This is a complex issue with many moving parts, the idiots looking to paint someone, anyone, as "the bad guy" don't appear to see all the pieces in play.
I've read both papers, they were derived from customer failure records and don't appear to contain any followup information from the drive failure analysis cycle.
If a controller marks a drive as suspect and kicks it out of a RAID group but the manufacturer finds that it's defect free and operating within tolerances does that count as a failed drive? Under the CMU & Google report conditions it does as the drive was replaced, but for the manufacturer it doesn't the drive is functioning as designed.
How many "failed" drives in these reports were the result of bad controllers or faulty cabling? We don't know and the reason we don't know is that it probably would have been insanely expensive to send that many drives to be examined to the required level. It's cheaper to bin them.
Those are just some of the contextual elements in play, I reject the precepts put forth by Bill and Ernst since they appear to ignore the fact that without that drive failure analysis data we don't actually know what happened to those "failed" drives.
Posted by: Storagezilla | March 02, 2007 at 11:29 PM
Thanks to all who wrote and expressed their thoughts.
I'm going to close comments on this one; in the meantime, I found this article in eWeek by David Morgenstern rather interesting.
http://www.eweek.com/article2/0,1895,2099467,00.asp
Cheers!
Posted by: Chuck Hollis | March 05, 2007 at 08:54 AM