I felt I owed everyone a post about an illustrative incident that happened over the last few days.
As we move to more software-based storage stacks, there are some interesting questions arising about the new responsibilities of the various parties involved: software vendors, hardware suppliers, partner and end users.
There’s a new model emerging for building storage solutions, and we now have a well-documented example of some of the inevitable bumps along the way.
Last Saturday morning, as I was drinking my morning coffee, I saw that Twitter had a number of tweets referring to a formidably titled Reddit post “My VSAN Nightmare”.
Lovely, I thought.
What I found was a well-documented story of what happened when Jason Gill failed over a VSAN node to a new set of disks. It’s worth reading in its entirety here. You can feel his pain.
For the TL;DR crowd -- Jason added new hardware as he was replacing a failed server, and the cluster became IO saturated for many hours as it rebuilt unprotected objects. After working through its backlog, the cluster became usable again, with no data loss or corruption. But it was a rough stretch there for a while.
As I tell people, there's no bad day in IT quite like a Bad Storage Day.
Time To Mobilize
As I work closely with the VSAN team, I immediately started an email chain, but I wasn't the first. The good news is that VMware’s support organization (GSS) had already engaged near-flawlessly and got the customer back up and running. We also had access to a good sampling of logs and traces — essential when you’re trying to figure out what’s going on.
One of our lead engineers (Dinesh) spent his weekend with others getting to the bottom of what happened. By Sunday morning, we had our unequivocal smoking gun: the IO controller being used (a Dell H310) couldn’t keep up with the load generated by the rebuild. VSAN first tried to throttle the rebuild, and then started to throttle application loads — to a degree that made the cluster very unresponsive.
The official response is here on Reddit — again, worth reading if you’re interested.
Now What?
A bit more background will help with the picture. First, Jason was running a moderate configuration: not a lot of disks, and not a lot of IO. He said he was quite delighted with the performance of his VSAN cluster when things were running normally. All good.
More to the point, the choice of IO controller really matters when it comes to sustained IO performance. Even if you have a flash drive caching — and enough spindles — everything comes through the IO controller. These controllers manage their work using a queue for outstanding IOs, which are then dispatched as needed to various storage devices.
In a nutshell, when it comes to sustained IO, queue depth matters. Low-end controllers may have a queue depth of only 25; better ones more than 1000. The deeper the queue, the better the IO controller does at keeping all the storage devices busy. Queue depth issues won’t manifest themselves unless you’re trying to drive a lot of IO — say, during a node re-protect?
The VSAN software did exactly what it was designed to do — again, almost flawlessly. When the new storage capacity appeared, it waited the proscribed 60 minutes, and then start to reprotect objects that had a FTT=1 (failures to tolerate) policy. During a reprotect, VSAN may want to copy a *lot* of unprotected objects — potentially many terabytes — and do so relatively quickly.
But the hardware has to be there to support that.
When the IO controller got saturated, VSAN first backed down the reprotect (as designed), but insisted on minimum forward progress. That requirement for a minimum rate of progress is a desirable design goal, as the storage objects are now unprotected and completely vulnerable to a second failure. In this case, when insufficient forward progress was being made, VSAN then started to throttle production applications — again, as designed.
And that’s when the problems started.
What Really Went Went Wrong
On the VMware VSAN HCL, we list the Dell H310 card Jason was using. After all, it works fine — up to a point. As Jeramiah Dooley points out, there’s a difference between “it works” and “it meets my needs”.
In a completely separate document, we state that VMware recommends a minimum queue depth of 256 for “optimum performance”. There is no information published by VMware on the queue depth of the Dell H310 we list, or an explicit warning regarding the implications of using this entry-level card.
Making matters worse, IO controller vendors don’t routinely disclose the queue depth of their products: the usual procedure is to either scour the ‘net or run a specific utility. As an example, Duncan Epping has the beginnings of a list here.
You have to know what to look for — and also know what might happen if you don’t.
And, as a final twist of the knife, it seems that you can re-flash this card with the original LSI firmware, bringing the queue depth all the way up to 600 :-/
In my opinion, that’s a lot to ask from people who might be new to storage.
Jason — like perhaps many others — took a look at the VMware VSAN HCL, found a suitable (and inexpensive) IO controller card, built his cluster — and was quite pleased with the application level performance. He was unaware that — during an extended rebuild operation — the IO load would increase dramatically, and it would potentially bring his lightly-loaded cluster to its knees.
That’s not good.
The same kind of problem could potentially arise from — say — having insufficient IOPS (spindles) in a disk group. While the flash cache is fast enough, all those writes made during a reprotect need to end up on disk sooner or later. And those 4TB NL-SAS drives aren’t exactly speed demons.
That being said, if someone knows exactly what they’re doing, using low-end IO controllers and/or low-performance storage devices should be supported. That’s one of the appeals of software-based storage — choose your own hardware.
What To Do?
The most restrictive solution would be to limit the HCL to only devices that offered sufficient application and rebuild performance, and have some sort of exception process if desired. But that feels overly restrictive to me.
A better solution would be to clearly mark low-performance HCL entries (maybe with a turtle logo?), explicitly warn people that these devices may not offer enough performance for applications and (especially) rebuild scenarios, along with a link to more information: queue depths, why they matter, VSAN’s reprotection behavior and impact on sizing guidelines, and so on.
Growing Pains
Coming as I did from the array side of the storage business, this sort of thing doesn’t usually happen. In addition to sizing for capacity and application performance, recovery and rebuild scenarios are fully considered as part of the storage sizing exercise.
That being said, I have been witness to a few spectacular incidents when the guidelines weren’t followed, and similar applications-on-their-knees outcomes resulted.
At the end of the day, the hardware resources have to be there to do the work required — and that includes failure scenarios.
What’s fundamentally changed with the software storage model is that the system builder (e.g. customer or partner) is now responsible for sizing their environment correctly — and not the hardware vendor.
And now we as storage software vendors need to arm everyone with information and warnings to help make the right choices, and protect them from making poor ones.
Much more work to do here.
My heart goes out to Jason Gill for his decidedly unpleasant experience. We all want to thank him for thoroughly documenting it for others to learn from.
------------------
Like this post? Why not subscribe via email?
Great honest article. Keeping my eye on VSAN.
Posted by: BjornHouben | June 05, 2014 at 05:24 PM
Sometimes people ask me, why is it that storage seems to be behind the "software-defined" curve compared to other parts of the technology stack. This is a great example of the reason:
Storage is really damn hard.
No bad day in IT quite like a bad storage day is the truth. We are all getting better at it (scale out/share nothing architectures for example) but collectively there's a long way to go.
Kudos to Dinesh and the entire VSAN team. No one likes to see issues happen, but I bet there are a lot of customers impressed with the response.
Posted by: Jeramiah Dooley | June 07, 2014 at 02:24 PM