« Building A Really, Really Fast NAS Platform | Main | SAP and EMC? »

February 11, 2008

Information Infrastructure for DW and BI

Well, it turns out we're finding ourselves in more and more conversations with customers about what they're doing with data warehousing and business intelligence.

From wherever it started, they're now coming to the realization that this is now an important part of their landscape. 

Whether there are a few warehouses that became business critical, or the darn things are proliferating everywhere, we're now getting asked to help out on a more frequent basis.

So I thought I'd use this post to share what we're seeing, and what we're doing to help.

From Little Acorns Giant Oak Trees Grow

Now, DW and BI have been around for quite a while -- nothing new there.

The idea is pretty simple -- take transactional data from a variety of sources, massage it, and use it to make business decisions.  The DW/BI industry has its own language and terms, but the underlying concepts are fairly straightforward to understand.

What's not as straightforward is how these DW and BI applications have evolved in several important dynamics.

The first DW/BI crowd has been around for a while -- early adopters (retail comes to mind) who were all over this stuff, and have been so for many years.  These are the folks who are pushing the boundaries of the technology.

But not all power users are early adopters.  We've also met more than a few online business models that absolutely depend on DW/BI to make daily (even hourly!) business decisions.

Their warehouses are big and getting bigger.  More performance translates directly into better business decisions.  The output of the DW/BI environment is now business critical -- no significant downtime is acceptable any longer.  "Operational" DW/BI means that the stakes have risen.

And the freshness of information really matters -- they're looking for faster ways to get information in to, and out of, their DW/BI environments.  It's evolved to the point where simply mapping the business processes that drive the DW/BI environment (and are driven by it) is a fascinating topic in itself.

And, of course, EMC has been talking to these people for quite a while, because (obviously) there's a lot of storage involved, and topics like performance, cost, manageability, recoverability, security, etc matter a lot to them.

But we've become acquainted with a few other folks as well.

One group I'm now routinely meeting is the "DW and BI proliferation crowd".  From their perspective, it seems that everyone and their brother want their own flavor of DW and BI to help them run their part of the business.  They're now facing a random collection of tools, technologies and supporting infrastructure, and they'd like to make their life a bit easier for themselves, as well as their business users.

Another group is the "our DW and BI environment is now really, really important" crowd.  Maybe it didn't start that way, but it became that way.  Like any other "successful" IT application, life comes at you fast!

Performance is now an issue, as is availability, and backup, and maybe remote replication.  These DW/BI environments are essentially large information repositories with some servers attached, so the storage discussion gets interesting for them.

So, What Does EMC Do Here?

A lot, actually.

Let's start with performance.  Not surprisingly, storage performance can be a big factor in overall DW/BI performance.

Most people think they understand the I/O profiles associated with DW/BI, but we've found there's an amazing variety of I/O patterns when we go look at real-world implementations.  Yes, there's a fair amount of sequential bandwidth-type access, but there are a few environments that are very random I/O oriented.

Some people think that write speed doesn't matter with DW/BI, but that isn't true in at least two cases: massive updates of DW/BI environments (they do need to be updated, don't you know?) as well as the formation of large analysis cubes, which can drive a ton of write I/Os.

Our bread-and-butter offering in this space is the CLARiiON CX series -- it has an attractive I/O profile (not to mention other factors!) that make it very appealing to a broad swath of DW/BI use cases.  Recently, our engineering group did a bunch of optimization work around Oracle, SQLserver and UDB scale-out implementations, resulting in a very attractive "building block" approach where server and storage are scaled in aligned chunks.

But part of the power of EMC's portfolio is our breadth of offerings.  We've encountered some environments where a cache-rich, replication-heavy engine is called for, and that's where the DMX does well.  And more recently, we've been encountering DW/BI environments that are using a file-oriented access method, leading to a MPFS implementation.

Very recently, I've been asked if enterprise flash drives are a big deal in the DW/BI space.  I'd offer that they're not such a good fit for storing a big DW (costs, I/O profile, etc.), but if you've got a huge, I/O-intensive data cube that you build regularly, and the business wants a bunch more performance, it might be worth investigating.

Something for everyone here.

All of that storage goodness won't be much use unless you can get to your information, right?  And we're finding that backup, recovery and replication is becoming more of a popular topic with the DW/BI crowd.

Here's the problem.  If your DW/BI environment is large, or important, tape probably ain't gonna cut it for you.  Simply do the math on a ginormous DW environment, and if you can't see your users waiting days (or weeks!) for a backup or recovery to complete, you're going to be looking at some sort of a disk-based backup/recovery solution.

Two popular approaches here to consider.  One is to use local replication technology (full copies, not snaps!) to make a second instance, and then stream that off to tape.  The thinking is that, if there's a problem, you've got the last known good image sitting on disk if you need it.

The reason that space-saving snaps aren't as popular as you might think is simple: spindle contention.  Most people want to run a consistency check on the DW prior to shipping it to tape.  That's another process beating on the same spindles that users are trying to access.  The same problem comes when you try and back it up -- you're fighting with users for spindle access. 

Over the last few years, we've seen a fair number of EMC's disk libraries end up in DW/BI environments.  Why?  They're smokin' fast on backups, and -- if you use space-saving incremental backups -- recovery time is just as fast as if you'd done a full backup. 

If you're thinking data-dedupe here, fine, but from what we've seen, performance matters in this particular application, and -- once again -- there's no free lunch with target-side dedupe.

And then there's remote replication.  Now, I know many of you who are reading this are wondering "who would spring for a remote replica of a big, honkin' DW?".  The answer is: more people than you'd think.

The logic is obvious -- many organizations depend on having the DW/BI environment around to make near real-time business decisions.  If it isn't available, it's hard to conduct business.  So the DW/BI environment gets added to the list of "business critical applications" that are candidates for remote recovery and business continuity.

The good news is that async replication generally works just fine here, specifically EMC products like SRDF/A, MirrorView/A and RecoverPoint.  Most DW/BI environments aren't continually updated, and having a backlog of updates to send over the wire isn't necessarily a terrible thing, so incremental bandwidth requirements are pretty moderate.

ILM Again!

Now, on first blush, you might wonder what ILM could possible have to do with a DW/BI environment, other than perhaps a creative marketing spin.

Well, we're finding that in many environments, people do a ton of analysis on their data warehouses.  The result is lots and lots of data cubes, reports, test cases, etc. that generally end up in a file system.

Sometimes, dozens or hundreds of terabytes outside the DW proper.  And, of course, no one wants to throw anything away.  Ever.

And, as a result, we've been able to come up with some nifty tiered file serving environments that move less-popular data objects to lower-performing (e.g. more cost effective) storage, or introducing something like a Centera on the back end for long-term archiving and single instancing.

Maybe this sounds pretty complex to you.  Well, in many cases, that's true.

So, over the years, EMC has developed a DW/BI practice that can help assess the environment, make some performance recommendations, help figure out how to best back it up, or replicate it, and so on.

The Interplay Between The Business and IT

I've met a few customers with a story line that goes something like this ...

Everyone is doing this stuff, and they're picking their favorites.  I've got a Teradata in my shop, and a Netezza, and Oracle DW, and SAP BI, and a ton of SQLserver stuff, and lord-knows-how-many large file systems with this sort of stuff slopping around.

And it ain't fun.

Well, we can't help you with the proliferation of different databases and tools (sorry!) but we can bring some logic and consistency to how you store it, how you manage it, how you tier it, how you back it up, how you replicate it, how you make it perform, and so on.

Not surprising, when you think about it.  The business user sees the application, and doesn't think about the infrastructure requirements.

I'd argue that, in these situations, IT needs to have a rather loud say as to what comes into the shop.  Trust me, most business users of DW/BI aren't exactly worrying about backup, or growth, or managing the environment, and so on.

Advanced Topics

There's a whole list of topics we're getting into -- maybe not everyone's cup of tea, but they might be interesting to you.

Database cleansing -- e.g. providing analytics while protecting sensitive personal information -- seems to be a growth area.

We've encountered more than a few customers who want to encapsulate their data cubes in a secure, portable wrapper and share with others outside the organization, leading to a combination of VMware and RSA solutions.

There's another cluster of customers who are using Documentum and eRoom to build repositories and collaborative workflows around the output of DW/BI environments.  Don't just hammer the database when you want a question answered, take a look and see if someone has already done it!

And probably a bunch more that I'm missing!

The Bottom Line

I think there's some safe predictions I can make here.

First, we'll probably see more DW/BI in the future, and not less.

More and more IT organizations will be asked to provide the same level of consistent service levels, flexibility, operational efficiencies -- and information security -- as they've been asked to provide for their other business-critical applications.

And I think we'll be having even more conversations with customers about this topic in the future.

TrackBack

TrackBack URL for this entry:
http://www.typepad.com/t/trackback/1106103/26030676

Listed below are links to weblogs that reference Information Infrastructure for DW and BI:

Comments

Chuck,

appreciate that you raised this topic.

I would like to add that there exists a joint solution by EMC and Fujitsu Siemens Computers in the SAP BI space which is running very successfully

A short overview can be found here

http://server-uk.imrworldwide.com/cgi-bin/b?cg=COM-complete&ci=siemensfujitsu&tu=http://www.fujitsu-siemens.com/Resources/43/830666871.pdf

Hi Hermann

Yes, you're quite right! Part of my team worked with the FSC team to make this happen, and I hear it is very successful.

Thanks for the comment!

Post a comment

Comments are moderated, and will not appear on this weblog until the author has approved them.

If you have a TypeKey or TypePad account, please Sign In

Chuck Hollis


  • Chuck Hollis has been with EMC for 12 years, and is Vice President of Technology Alliances at EMC. He frequently speaks to customer audiences about a variety of technology topics, and can usually be counted on for an interesting point of view. He lives in Holliston, MA with his wife, three kids and two dogs when he's not travelling. Chuck enjoys piano, mountain bking and skiing -- in that order.

General Housekeeping

  • Frequency of Updates
    I try and write something new 1-2 times per week; less if I'm travelling, more if I'm in the office. Hopefully you'll find the frequency about right!
  • Comments and Feedback
    I'm going to be approving comments before they get posted here. Any information you can share about who you are, how to contact you, what you do for a living, etc. would very much be appreciated.