One of the most basic things you can do with information is move it or copy it.
It goes by many names: replication, backup, recovery, migration, cloning ... the list is endless.
In the spirit of Eskimos having many words for "snow", we in IT have many words to describe the same basic process -- making a copy of information.
The landscape here has changed dramatically in just a few months -- there are entirely new categories of functionality -- and use cases -- springing up. And there's convergence going on -- what were once separate discussions are now getting intermingled and combined in very tangled ways.
Since EMC has always had a strong capability in replication (and growing rapidly!), I thought I'd use this post to unpack some of the big themes and ideas, and describe how many of the conversations are merging.
In the beginning
Way back when, the way you made a copy of information is you used a copy command. Copy the information from here to there, or from disk to tape (e.g. backup). When networks became prevalent (yes, I came from a world before networks), it was the same model.
It used server resources. It was pretty manual and error-prone. It wasn't automatic or continuous. It wasn't particularly fast. Over time, lots of different ways emerged to have servers copy data: better tools, or put it in the file system, or maybe have the database do it.
But, from my point of view, it was server-centric and had its limitations -- and still does. Most importantly, if you need to replicate a bunch of servers (maybe 10 or more), doing it one server at a time wasn't really practical. And, of course, when a server is moving data, it isn't giving the application all the performance it could.
Replication moves to storage
EMC changed the model in 1994 by introducing storage-based replication (SRDF). Instead of asking every server to move its own data, the idea was to use the power of the storage array to handle the chores for you. The primary use case was business continuance (recovery of IT operations at a remote site in the event of the unthinkable).
Turned out there were a whole lot of other use cases, including data center consolidation, on-line data migrations to new arrays, etc. The alternative was to somehow orchestrate all your servers to replicate the same information at the same time, which just didn't scale too well.
EMC followed that up a few years later with TimeFinder, which did the same sort of copying, but in the array. The first use case was accelerating backups and recoveries: make a disk-based copy, then dribble it to tape at your leisure, and maybe save the last disk copy for a quick recovery.
Turned out there were a whole lot of other use cases for this one as well, including helping people with application development and testing, creating reporting images of transactional databases, and so on.
And when you combined the two (remote and local) it turned out you could have even more fun either protecting your data, doing more with it, or -- hopefully -- both.
After everyone saw the fun that EMC was having in the marketplace, every storage vendor decided it was a pretty good idea too, and now it's devolved into a FUD-fest as vendors try and differentiate in ever-more-obscure ways [please note ... I still believe that EMC has a clear lead here in many aspects ... and that ain't just marketing hoo-hah]
But, up to just a year or so ago, there were two mainstream choices -- use your servers to move data, or use your storage array.
And then it started to get real interesting ...
Disruptive Change #1 -- Virtual Tape Libraries, or Disk Libraries
As disk costs have continued to fall, and databases have gotten ever-larger, more and more people have been interested in using disk as a backup target, instead of tape. But, unless you were willing to revisit your entire backup process and implement disk-based replication, all you saw was a ton of work (and cost) in getting there.
Enter the disk library (or VTL). The idea was simple: precisely emulate what a tape library looks like (using disk) so that none of the backup applications had to change, but you could get the backup performance -- and, more importantly, the recovery performance -- of disk.
EMC entered this market a few years back (CLARiiON Disk Library) with one of the first offerings in this space. Last time I checked, we had more market share than all the other folks combined (don't know if this is still true). Today, there are perhaps a dozen vendors with this sort of solution.
But, everyone knows, that one of the things you do with tape is ship it off-site. And if you're using disk to replace tape, well then, it needs to replicate its virtual tape image off-site.
Another use case was born for replication, but with different characteristics.
Disruptive Change #2 -- Intelligent SANs as Replication Engines
Replication -- at scale -- can burn a lot of CPU cycles. One of the advantages that large storage arrays have here (at least EMC ones) is a surplus of CPU to throw at the problem. For a while, there was some interest in server-style appliances that could do replication, but they have their limitations.
The emergence of intelligent SANs (dedicated ASIC logic for frame-level processing) not only opened the door to storage virtualization (EMC's Invista, plus other similar products that are slowly coming to market), but as a new location for replication functionality.
The example here is EMC's RecoverPoint. It can use the SSM in a Cisco MDS to act as a selective data splitter, and do so at near-wire speeds. Scalability is good. If you're doing remote replication, you'll need an IP connection, so there are some performance and cost synergies there.
Certain interesting forms of replication (e.g. async) require the engine to save state: what's been copied, what hasn't and so on. Sounds like a simple problem, but when you're replicating 20TB that's a lot of state that changes quickly and has to be preserved.
Turns out that the best approach here seems to be to use an out-of-band enhanced appliance to act as a traffic cop to save state, direct the intelligent SAN on what to do, and so on. Best of both worlds.
Putting replication in the network has some other advantages as well. First, it's largely storage-independent. One of the problems had been that if you wanted advanced remote replication, you'd have to buy the same product at both ends. Now you've got all sorts of interesting use cases where the disk at one end can be very different than at the other.
The second architectural advantage is that it's a lot easier to implement consistency (aka "congroups" in the patois). If you have multiple applications that are related, you want to make sure they don't get out-of-sync at the other end. You need someplace that knows about all the participants, all the replication, and can make sure A doesn't get ahead of B. Otherwise, nasty data salad when you recover.
Today, a lot of vendors put that particular feature in the array. EMC has taken an additional step of offering coordinating software that runs on servers that can synchronize across multiple arrays (albeit EMC) if needed. Putting this functionality in an intelligent SAN device means that orchestrating across multiple applications and multiple arrays gets easier -- until you need multiple switches, that is!
Now, in all fairness, I can't put the maturity of a relatively new EMC product like RecoverPoint up against the industry's benchmark SRDF. This is more about trends and directions, not about what's here today.
Disruptive Change #3 -- Continuous Data Protection, or CDP
For years, there's been three basic modes for remote replication: (1) point-in-time -- a consistent image as of, say, 3AM, (2) asynchronous, which means you're a bit behind the source application, but not too far, and (3) synchronous -- no chance of data loss between source and target.
As you go from 1 to 2 to 3, the protection levels go way up, but so do the network and related costs. Big customers who approach it right will use the same infrastructure (e.g. SRDF) to do all three, and change it around as requirements shift.
But now there's a fourth model -- CDP. It's basically TiVO for your data center -- you can rewind any core applications to any point in time, if you like. I think it's wildly misnamed, maybe "incremental replication" might be a better term, but it's too late for that regret.
The idealized problem it solves is massive database corruption. The traditional database approach is to have a clean, checkpointed image on hand (maybe using local or remote replication), and keep a log file around to replay it if needed.
Works up to a point, but if you're dealing with multiple applications, or can't wait for a bazillion log entries to play through since your last checkpoint, you're interested in this technology. The idea is that you can view multiple virtual recovery points (maybe hundreds or thousands), pick the one you want, and quickly recover back to that point.
Will it set the world on fire? No, at least not initially. But it's an important capability that more and more customers are thinking about, especially as allowable recovery time windows get ever-shorter, and databases get ever-larger.
EMC's offering in this area is RecoverPoint (yes, the same one which uses the intelligent SAN, described above). It does the typical replication modes, and also offers CDP. An example of convergence -- CDP becomes part of the broader replication discussion -- and, in this case, moves to the intelligent SAN.
Disruptive Change #4 -- Data De-Duplication
Sooner or later, everyone realizes that there's a huge amount of redundancy in the information we store, manage and protect. And one of the hottest areas today is around finding and eliminating these redundancies to lower costs and speed things up.
EMC's first foray into this space was with Centera a few years back. As part of its CAS model, it introduced the idea of a single-instance store: if two people stored the exact same object, it would detect that, and handle it transparently. Big win.
Later, we added the same functionality to our email management product (emailXtender) to handle those massive powerpoints we all email around to each other. There's also similar functionality for the file system world, as well.
But there was an opportunity to do more -- break each file (or object) into "chunks", and look for redundancy at the chunk level. Maybe I picked up a PowerPoint and just changed the name of the presenter -- the rest stayed unmodified. Or someone changed the access date of a file and nothing else.
The biggest use case for this seemed to be backup and recovery -- the normal approach of using incrementals just couldn't detect and handle this. And even though I discount the breathless hype of some of the smaller vendors in this space, there's something to this.
EMC acquired Avamar to meet this need. Unlike most data-dedupe approaches, it's client-side. That means all the heavy lifting is done BEFORE the information is sent to the target device (disk in this case).
Now, we could argue architecturally where it's better to do the de-dupe work, front end or back end, and where the cheaper CPU cycles might be, but add a long-distance network into the equation (client backing up to a remote site somewhere -- dominant model going forward) and it's pretty much decided in my mind what customers will want.
BTW, Avamar has the same "look at all the past instances at once, choose your favorite" type of feature that RecoverPoint offers. Another opportunity for convergence.
Where does that leave us?
If you've made it this far, you're probably looking for the big "what does it mean?" wrap-up. Me too.
So let me summarize the big points:
- The use cases for replication are converging
Every organization needs different ways to protect different information, but needs a single infrastructure to do it.
Whether it's local or remote replication, sync vs. async vs. CDP, backup to disk vs. backup to tape, data-dedupe vs. single instancing vs. compression, aspects of storage virtualization -- you're gonna want all of them sooner or later, but you won't want a complex landscape of different point solutions.
One backbone that does it all would be nice.
- Sooner or later, it's all going to disk
I don't want to get into the whole tape-is-dead debate. It's not dead, it's just becoming increasingly unattractive.
Even the tape vendors have figured it out.
- The logical place to do most of this is in the network
No one likes a lot of goop on their servers: agents, tasks, proxies, etc.
And, at the same time, making this stuff work across multiple storage arrays is architecturally very complex without the notion of an intelligent network device.
Network processors are designed for low-latency, high-bandwidth, great scalability and bulletproof reliability. Much nicer shopping from that side of the parts aisle then the server appliance side.
Bottom line -- I think over the next few years, you'll see more of the new functionality -- and converged use cases -- showing up as an adjunct to intelligent SANs or networks. It isn't going to happen all at once, but -- slowly, over time -- it seems as inevitable as global warming.
So, with all of that, maybe it puts all of the industry (and EMC!) activity into a bit of perspective. Hope you were able to follow it all -- thanks.
Great summary of where we are Chuck, I'll comment on this more from my blog...
Posted by: Josh Maher | November 14, 2006 at 11:07 PM
Very interesting article Chuck!
--------------
thanks, tbones, any suggestions are welcome!
Posted by: tbones | November 17, 2006 at 02:09 AM