OK, it's no secret, I'm a *huge* VMware fan. They've been able to take their underlying strength in hypervisors, and move into all sorts of useful adjacent areas: desktop, software development, systems management and so on.
For quite a while, I've been tracking SRM -- Site Recovery Manager -- as another one of these VMware-based "game changers".
And, today, it's been announced.
So why do I think this is such a big deal?
The Basics
To understand why SRM is so compelling, you'll have to understand a bit about remote replication.
Whether it's called business continuity, disaster recovery or remote recovery -- the idea is simple: to be able to run your core IT applications at a remote site should the need present itself.
Decades ago, remote recovery was something only the biggest IT shops could afford; but with every passing year, costs have come down to the point where organizations large and not-so-large are interested in a remote data center that's able to run the business should it be needed.
And, simply put, SRM drives down many of the costs associated with remote recovery.
If you're already doing remote recovery, you can keep doing what you're doing for potentially less money, or decide to protect additional applications, or provide better protection for what you already have, or any combination of above.
And, if you're not doing remote recovery (but want to!), maybe SRM has made the idea workable for your organization.
Costs Of Remote Recovery
Let's face it -- when you're considering remote recovery, you're usually signing up for an expensive proposition:
- additional servers and storage (plus a data center to put them in!)
- network bandwidth to replicate data from source to target as it changes
- and a ton of continuing effort to to ensure that your environment can recover gracefully
Sure, there are other costs involved, but -- just for the sake of discussion -- let's keep it to these three biggies.
Simply put, SRM is a management package for VMware that understands what you're trying to do with remote recovery, and leverages ESX's properties to do all of this far better than we could ever do in the physical world.
Less Resources
One of the big costs when considering remote recovery is the "Noah's Ark" phenomenon -- if you have 45 servers of different flavors at your primary data center, well then, you're probably going to want a precise copy of the same 45 servers at your recovery data center.
Add a server at the primary, add the same one at the secondary. Upgrade or patch an OS at the primary, better do the same at the secondary.
In the physical world, you wanted your recovery site to be a precise twin of your production site -- since any difference (no matter how minute) had this ugly way of getting you in trouble when you tried to recover, besides being more difficult to manage. A lot of people put a lot of effort into making sure the two ends look as similar as possible.
ESX breaks that barrier. Since VMware provides an abstraction from the physical hardware, we've had many customers who run signficantly different server configurations at each end. At the primary end is the newer, bigger, faster stuff -- and, at the remote end, is the older, smaller and slower stuff.
The ESX hypervisor isolates any hardware differences from the applications. Applications and their supporting software can't tell what they're running on -- it's all been abstracted.
Turns out that -- depending on how large your environment is -- this humble capability can be a pretty big deal.
NetWork Bandwidth
Getting the network issues sorted out between primary and secondary can approach true "rocket science" in larger environments.
Now, VMware's SRM doesn't actually try to replicate data from primary to secondary -- it uses an abstraction layer to communicate with an underlying replication engine from another vendor.
That's good, because there are dozens of different flavors of data replication out there from different vendors (including a whole slew from EMC), and to attempt to say that there's one "best" way to do it means you don't really understand the problem.
For example, there's different flavors of synchronous (data is committed at remote site before proceeding). Many flavors of asynchronous (short to long gap between updates at primary and updates at secondary). Point-in-time (data is current as of a specified time stamp). CDP (continuous data protection -- the ability to rewind to an arbitrary point).
And, of course, all the combinations of the above, including things like multi-point (more than two locations), consistency groups (making sure multiple apps never get out of sync), local/remote scenarios, and much more.
It all can get very complicated very, very quickly.
To work with SRM, participating vendors have to create an adaptor between their replication product and VMware SRM, plus do a boatload of testing and certification -- more on this below.
I think VMware did something very smart here -- rather than try to solve all needs with their own replication product, they partnered with industry and added value to all the stuff that was already out there.
Oracle -- are you listening?
Time And Effort
Most people underestimate the time and effort required for a successful remote recovery environment.
Of course, there's the up-front effort of just making the stuff work. You've got applications and their relationships to discover, you've got some traffic sizing to do, you've got to get the remote side provisioned and configured, shake out your network link, and -- of course -- test your recovery at the other side to the point you're absolutely confident with it.
I've seen that SRM's abstraction of applications as virtual machines makes all of that signficantly easier. Virtualized apps are self-contained, you've got additional tools to measure how they're changing information (key factor in sizing any network link), it's easier to provision and configure the remote side, and -- of course -- failover testing can be done in a fully virtualized environment -- at the same time production applications are running normally.
And it's this last feature that's the uber-win, in my mind.
The Challenge Of DR Testing
The old saying goes that any DR plan is only as good as your last (successful) test. Indeed, most DR efforts fall over one of two ways: the team either spends an inordinate amount of effort doing continual testing (not good), or -- worse -- neglect to test often, the result being an insurance policy that doesn't pay off when you really need it.
What makes DR testing so hard? It's usually very disruptive and very complex.
If the equipment at the remote end is being used for something else (usually the case), it has to be brought down and brought up again in the new configuration. That can be disruptive for whatever work's being done at the remote site.
Not to mention reconfiguring servers, storage, networks, etc. And, of course, running some tests to make sure that everything's up and accessible -- should have this been a real emergency.
The uber-cool thing I saw in SRM was the ability to create a 'virtual remote recovery image' -- storage, virtual machines, even network connections -- and test this as a walled-off virtual entity -- perhaps even co-resident with production apps that might be running at the remote site.
This ability to logically encapsulate and test a complete recovery scenario in a set of virtual machines is simply huge, in my book. You can do testing easier, better and far more frequently than is possible in a physical environment.
It's that simple.
So, What's EMC Angle On All Of This?
Glad you asked. When we first started hearing about SRM, we quickly realized this was potentially big stuff for our customers, and we knew we wanted to play bigtime. From a purely personal perspective, this seemed like one of the biggest game-changers in the remote recovery space in a very long time.
As a result, we've been taking SRM very seriously for quite some time.
For vendors to play with VMware's SRM, the first step is to write an adapter between your remote replication product, and SRM. We did just that for DMX (SRDF et. al.), Celerra (files), CLARiiON (MirrorView) as well as RecoverPoint.
RecoverPoint is particularly interesting in this discussion for several reasons: first, it provides CDP (continuous data protection) in addition to more traditional flavors of local and remote replication, and -- second, it's relatively agnostic to the underlying storage platform: EMC's or someone else's.
Just like our Avamar backup product has an extremely strong affinity for VMware in the marketplace, RecoverPoint is turning out much the same for local and remote replication.
Keep in mind, not every remote replication product has programmatic access to its features, so this particular integration may be particularly hard for certain vendors, although I'm sure we may hear words to the contrary.
Next, you've got to do a bunch of production-level test and qual with VMware, and have them review the results. Not a big deal for EMC, given our usual enterprise-class approach to such things. Might be a bit harder for other vendors, though.
But we've gone farther. We've established specific professional services to help customers who want to either move to a virtualized SRM replication environment, or -- perhaps -- consider remote replication for the first time. These services are a natural evolution to our decade-plus history of working on large business continuity projects.
And, finally -- we've extended the solutioneering work we've done already with Exchange, SQLserver, Oracle, SAP et. al. running in VMs and now added remote replication to the characterization suite, making sure we can understand and account for all the application-specific nuances that you find in both physical and virtual environments.
If you're ready to go with SRM, so are we.
The Bottom Line
Virtualization is a game-changer for IT -- I think we're all seeing that -- and VMware is leading the parade by a comfortable margin. Now you've got yet another compelling reason to your existing list for virtualizing your production environment -- better, faster and cheaper remote recovery.
And, if you're a remote recovery expert, I'd suggest you get a quick tutorial on just how much SRM and virtual recovery techniques change the game as compared to physical server recovery techniques.
Because, before too long, I think that's going to be what just about everybody wants to do.

Very intersting post. I really enjoyed reading it.
Posted by: Sonny | July 27, 2008 at 04:58 PM
This looks like really cool stuff.
1) For 100% uptime SRM will have to have the built-in ability to ensure end-to-end commit management. This will entail intelligent interaction between the business application software, the database system and SRM - is this already possible/in place for the apps you mentioned?
2) I assume user sessions are not an issue?
Posted by: hennie grobler | September 18, 2008 at 10:08 AM
I think I have answers to this:
1) The "end to end commit management" is not provided by SRM per se, rather it's provided by the underlying replication technology that ensures consistency of multiple transactional databases.
In storage lingo this is called "consistency groups" or "federated databases". Many EMC replication products have supported this notion for a while, and it does work with SRM, although the management constructs may not be 100% integrated at this time.
2) As I understand things, the answer is "it depends". I think in certain scenarios (e.g. web apps) session integrity is maintained. Other user-app protocols may time out and have to be reinitiated.
And now I'm at the absolute limit of my understanding ... :-)
Posted by: Chuck Hollis | September 18, 2008 at 10:28 AM
SRM does look promising. You just need to wait for IT managers around the world to join in on the celebration.
Posted by: software test consulting | July 21, 2010 at 03:21 PM
Dear Chuck,
In conjuction with DRC site with SRM,
What is your suggestion to have a cascaded replication ( multisite DRC ). Can we do mirrorview/s between local and bunker. mirrorview/a between bunker and remote ?
Or Cellera Replication V2 between bunker and remote ?
Or RecoverPoint between bunker and remote ?
Do you any document on the comparison of pros and cons for that ?
BTW, I am a fan also with vmware :-p
regards,
Taufik Kurniawan
Posted by: Taufik Kurniawan | July 14, 2011 at 03:44 AM
Hi Taufik
Those are good questions, however I'm not even going to try and answer them in the confines of a blog comment.
Designing and deploying an effictive yet efficient multi-site DRC scheme is a process that takes multiple conversations and deeper expertise than mine.
This is the sort of thing you don't want to get wrong ...
I will offer up one personal observation, though: the vast majority of large-scale multi-site disaster recovery setups out there use EMC's SRDF.
Sorry I can't be more helpful. The best approach would be to either (a) find and download the plethora of relevant reference materials on EMC.com and Powerlink, or (b) contact an EMC or EMC partner DR/BC specialist.
-- Chuck
Posted by: Chuck Hollis | July 14, 2011 at 08:10 AM