Life is full of learning experiences, and we had one yesterday.
A minor patch to our environment exposed underlying database corruption, which resulted in our internal social platform being unavailable for almost a full business day.
The backups? They were corrupt as well.
Thanks to the exceptional effort of everyone involved, nothing significant was really lost.
Sure, there are lessons to be learned on proper support practices for important applications (and our social platform is now one of those), but there are other lessons to be learned as well.
First, A Bit Of Perspective
I've been selling to mission-critical IT environments for most of my adult career.
I understand what can happen, how it happens, and what should be done about it. I am probably not as expert as someone who actually *runs* a mission-critical IT shop, but I'm close enough.
And, believe me, in terms of "bad IT days", data corruption is one of the biggies. Especially if your backups have the same problem. Usually -- unless superhero efforts are applied -- this means that you're gonna lose some data. Thankfully, we didn't lose too much.
So, what lessons did we learn?
#1 -- The Impact Was Stunning
Len wrote a great post on "The Air That I Breathe", and I think that's a great analogy.
All day long, it was hard for many of us to get business done, simply because the platform wasn't available. It was pretty much in the same league as "email unavailable".
So, at what point did this social platform go from "nice to have" to "need to have"? There wasn't a defined point that I can see, it just kind of snuck up on us.
People were resilient, and adapted -- that's what we all do anyway. But it was a huge impact to a lot of people's workday, and didn't do anything to help with establishing confidence around the platform.
#2 -- At Some Point, Declare Your Social Platform As Mission Critical
We didn't do that.
As a result, we didn't get the same operational procedures that EMC's top-tier applications get. I'm *not* blaming the IT guys -- they have a schema as to how they categorize things, and our application wasn't in the appropriate tier.
Why does that matter? More scrutiny and extra effort is applied to make sure that the application is always available -- and usually at significant additional cost.
Some of the investments that top-tier applications get include:
- advanced test, dev and staging environment to allow quick roll-back if there's a problem
- snapping off disk copies of your database and running consistency checks before it goes to tape or other backup device
- HA failover of servers, storage -- or even physical locations!
- Maintenance at off-hours, rather than prime time
Well, now we have a case to do elevate the category, so to speak.
And probably a willingness to spend more $$$ to keep this from happening again.
#3 -- Vendors In This Space Will Need To Revisit Their Processes
EMC sells mission-critical hardware and software for a living. We know what top-tier customer support looks like -- it's an integral part of our business.
You never can get good enough at this stuff, trust me.
Now, we're not blaming anyone here, but I think it's safe to say that we were exercising our software vendor's support processes in a very unique and unexpected manner. We had 10,000+ users down, and things were pretty bleak there for a while.
Everyone pitched in and helped once an emergency was declared, but it was pretty clear that it was an immature process, relatively speaking.
If you're a vendor in this space, and you're convincing customers that your product is essential to their business, and your customer does what you told them to do and now has their entire company running on your stuff, you're going to have to start thinking like a mission-critical vendor, and invest appropriately.
Everything breaks now and then -- it's what technology does.
What can't break are the service and support processes: problem escalation, expert triage, advanced notice of potential problem areas, proactive preventative fixes ... the whole ball of wax.
I've Seen This Movie Before
I've been talking to IT shops for a very long time, and I'm thinking back to five or so years ago when I was trying to convince them that email wasn't a "nice to have" anymore, it was becoming a "need to have".
From a technology perspective, that meant things like having enough performance, ensuring continuous availability, guarding against application or data corruption, archiving and retention strategies, making email available to mobile users, securing the environment, etc.
Most people thought I was crazy at the time. A few listened, though. But that's the world we now live in with corporate email.
Funny -- that sounds like just the same list of things we're gonna have to do with our social platform.