One of the most difficult questions I get asked by customers is painfully straightforward: how should we manage the explosive data growth going on in our organizations?
I wish I had a simple recipe. Do these three ridiculously easy things, and you're set -- for example.
No such luck.
You'd probably think that EMC would have some sort of semi-official position paper with deep insights on this thorny topic, but -- alas -- that's not the case either. Truth be told, we tend to be somewhat divided amongst ourselves on the topic.
So, in an effort to be helpful, let me share with you how I'm answering the question when asked.
What Makes This So Hard
I've had a long time to think about this one. Every so many years, it seems my perspective evolves. It probably will evolve again before long.
But one of the things I've learned along the way is that the best way to understand a problem is to get as deep as possible into root causes, and doing so in a philosophical manner, e.g. not blaming one party or another for the woes at hand. Things are the way they are for a reason. Understand the reasons, and you'll have a good idea what the answer needs to look like.
So here's my best attempt at root cause analysis -- and it's much more than "we're generating a lot of data".
Enterprise Information Tends To Have Uncertain Value
For me, one major root cause of this problem is dead simple: while it's clear that data has intrinsic value (otherwise we wouldn't capture it), there's no easy way to get at a standard measure of data value.
Because we don't know a given data set's true value, we're not quite sure what to do with it, or -- more to the point -- how much effort and resource to spend on storing it, protecting it and so on. That inherent uncertainty leads to all sorts of ugly consequences.
Now, of course, there are many examples where business people have a fairly clear view of the value of certain data elements -- and are thus more willing to devote resources to classifying it, storing it, managing it, protecting it and making it available. Since those specific data sets have a well-understood value, there's far less debate about what to do.
But those tend to be the exceptions, rather than the rule.
Unfortunately, the vast majority of enterprise data sets have highly subjective value, and -- worse -- there's no one in charge of assigning a more precise value. That means that there's also a great deal of uncertainty around whether to keep it around, throw it away, archive it, and so on.
Over time, I see this clearly leading all sorts of lengthy debates around familiar topics: we have too much data, we're spending too much on storage and data protection, how come we can't delete most of it, etc. etc.
Conversely, when the value of a given data set is well understood, there's far less debate. But since the process of determining a given data set's value is itself an expensive proposition (think cross-functional teams, meetings, reviews, ongoing process, etc.) sometimes it's just simpler and easier to throw more capacity at the problem.
Is there any wonder as to why I work for a storage vendor?
Split Accountability For The Problem
Another important source of difficulty is what I call split accountability across different functions.
Simply put, the people who generate and use the data -- and the IT people responsible for storing and protecting it -- often work in two completely separate organizations -- different goals, different missions, different priorities, etc.
There's only a dim awareness by most business people on just how expensive and difficult it is to protect and manage large quantities of data.
Besides, isn't that storage stuff really an IT problem?
True costs are rarely exposed. Options and choices are usually poorly understood by those generating and using the data. Inefficiency is the inevitable result. IT budgets and staff get strained. Frustrations grow and grow.
And, over time, it gets worse and worse until there's some sort of flashpoint that forces a heated discussion.
Occasionally these teams come to our briefing center and want very much to blame us for their woes. We're sympathetic and empathetic -- up to a point -- but at some point they've got to take accountability for their own challenges. It isn't our data, folks.
Externalizing your frustration to this vendor or that vendor might make you feel good -- for a while -- but doesn't really change things.
Information Value Has Changing Context Over Time
Once you delete data, it's gone. It's gone for good -- or should be. It's a permanent and largely irrevocable decision. Sure, there are certain situations when you want specific data elements gone for good, but those tend to be the exception, rather than the rule.
The challenging observation is that -- more and more frequently -- over time old data frequently becomes more valuable in new contexts.
As only one tiny example, consider the current revolution in big data analytics and predictive models -- a vast consumer of historical data sets.
Bring in a data scientist, and one of the first things they're going to ask for is access to as many historical data sets as possible. You did keep them, didn't you?
In this digital world, you rarely want to throw anything away if you can avoid it. Put differently, a data set's present minimal value might become much more valuable in some future (and unforeseen) context.
And, as we all know, once data is gone, it's gone. So there's a continually increasing strategic incentive to keep it around -- just in case.
I Feel The Need To Also State The Obvious
Let's also not forget the painfully obvious: as individuals and organizations, we're generating an awful lot of data, and the rate of generation appears to be increasing exponentially -- not linearly.
Interesting exercise: a few months ago, I totaled up all the storage I was using (home and work) for all of my personal stuff -- files, content, etc. 13TB. I was floored. Just don't ask me to delete any of it, OK?
Also -- just to be obvious -- don't be looking for an simplistic rescue from product technologies, especially if you're in a larger organization. No matter what the vendor is telling you :)
Yes, storage tech is getting cheaper -- a lot cheaper. But aggregate storage expenditures continue to increase year-over-year, so the cheaper it becomes, the more people are consuming and spending. That's been true for many decades, and shows no signs of changing.
You might be tempted to think that something like dedupe, or thin provisioning, or some other cool storage efficiency technology might be "the answer". Sure, increasing storage efficiency helps, but demand always has this way of filling up available capacity. It was true 20 years ago in the storage business, it's true today, and it's a safe bet it's going to be true 20 years from now.
The bottom line? Great technology can help, but it isn't an answer unto itself.
And, finally -- just to be balanced -- there are many well-managed data sets out there: ones where there are clear business owners, data is reasonably classified, business value is well understood, process are in place, and so on.
But those data sets aren't really what's causing the major headaches -- it's all that other stuff.
Time For A Quick Checkpoint
So consider the typical IT team ...
Multiple floods of corporate data are coming at them at an ever-increasing pace. Of course, they're expected to store, protect and make available these data sets on demand -- but do so with fixed budget and staff numbers which aren't keeping pace, and are unlikely to change anytime soon.
Business users aren't engaged. IT folks are getting justifiably frustrated. Nobody's listening. IT people resent being put into the lose-lose position of being forced to decide what's important and what's not.
Heck, shouldn't that be a business decision?
So let me share with you what seems to work.
Start With A Simple Concept: The Storage Service Catalog
The idea of a storage service catalog is nothing new, it's been around for quite a while. It seems that these discussions were all the rage five or more years ago, but since have become less frequent. No, I don't know why that is.
The concept is devastatingly simple: create a number of standardized storage service buckets (performance, availability, protection, cost, etc.).
Expose those services (and costs) in such a way that they're incredibly easy to understand and consume.
Above the service catalog abstraction, create every incentive possible for people to easily understand their choices, intelligently consume the service, and automagically move their stored data from premium levels to the lowest tolerable service level as a default.
Below the service catalog abstraction, use the best available technology and process to deliver the required service level at the lowest total cost: hardware, software, energy, labor, etc.
Just to be clear, storage service levels always have abstract names like Gold, Silver, Platinum and not Thinly Provisioned 300GB 15K RAID5 or perhaps 10Gb Deduped NAS. Once you start exposing the specific technology you're using, you're toast, 'cause everyone is going to have an opinion :)
When data needs to be stored, it's assigned to one of the standard storage service levels, which should ideally change over time as data ages. Any organizational concerns (security, compliance, data protection, geographic location, etc.) is simply baked into the service definition. The beautiful thing about this is that there's no way to consume the storage service without the other stuff being embedded.
Got the over-simplified picture? Now let's dig a little deeper ...
The Bifurcated Storage Team
If you think about this concept, it implies that there are actually *two* primary storage functions.
The first team is familiar -- they're behind the scenes in the storage factory: working diligently to improve service levels, standardize workflows, speed requests, drive down the cost and efforts associated with delivering the storage service catalog, and so on. Conceptually, their mission is very straightforward.
They are all about service delivery -- not service consumption. They can -- if they choose -- measure their performance against other external storage services, of which there are literally many dozens to compare against.
But -- in this model -- there's also a small team out in front of the service catalog: making sure people understand it, driving policies around its aggressive usage, making sure the buckets fit 90% of business requirements, reporting back to various stakeholders on service consumption and service levels delivered, and so on.
In essence, they're the internal "go to market" for the storage service.
It's that last key bit that's missing so often, and can lead to extreme dysfunction. Without someone to drive intelligent consumption, expose choices and costs, create new services, and so on -- there's no engagement or interaction with the people consuming storage across the organization.
As a result, people using the service make all sorts of poor choices: too much, too little, etc.
One ugly outcome is that storage quickly becomes the quintessential "free" corporate resource. And we all know what happens when something expensive and valuable is perceived as "free".
Now, take these two core functions -- and add in someone with an architectural bent, some finance resources to price out cost-to-serve and do some comparative benchmarking, perhaps some process expertise, a few of the usual vendor handlers, and you're pretty much good to go for a storage team that's structurally prepared to face the future at anywhere from moderate to ginormous scale.
In a nutshell, you've organized for success.
What I Like About The Storage Service Catalog Approach
This is not abstract theory -- I've seen it done well in more progressive (and usually larger) storage settings. Based on my experience, meetings with these sorts of proficient storage teams are entirely different than those that haven't gotten around to organizing along these lines.
I can spot the difference in about 60 seconds. They are extremely precise and articulate about what they want from vendors like EMC. There is a minimum of BS-ing on both sides.
For me, it's vendor heaven :)
Here's one thing I like: when the inevitable budget crunch comes, the IT team is in a glorious position. All they have to do is state the painfully obvious: here are our storage services, here's how they're being used, here's how we're encouraging users to make good choices, here's how we're dropping our cost-to-serve over time, here's how we compare favorably with external alternatives.
Put this way, the gap between demand and supply isn't an IT problem; it's a business problem. IT is just delivering the competitive storage services the business is asking for.
By the way, in case you haven't made the connection yet, this is the familiar ITaaS concept -- just applied at the storage domain. Even if the rest of the IT function isn't yet on the ITaaS bandwagon, it often makes perfect sense for the storage function to organize themselves in this fashion.
Tricks, Shortcuts and Cheats
I've made a personal collection of how various IT teams have figured out clever angles to this approach. There doesn't seem to be any shortage of clever people in IT ...
One that I really like is the "automatic default". Dump an arbitrary data set on IT, and it automatically gets the most cost-effective policy: minimal performance, almost no data protection, automatically deleted after 90 days, etc.
The defaults are widely communicated, so there's no valid excuse that someone didn't know.
The idea is to force at least some sort of interaction between the business user and IT where the business user has to sit down and think through exactly what their requirements might be -- and what resulting costs will be incurred. Otherwise, you get the bare minimums.
I like that.
I've also seen many IT shops use something like EMC's FAST technology, dedupe, compression, thin provisioning, etc. to -- well -- skim a little on the margins.
User gets allocated 300GB of "Gold" storage -- very fast, very protected, etc.
Behind that, there's an array of automated storage technologies that create the perception of physical, premium storage resources without the resources being actually allocated until they're used.
Not that anyone would ever ask for something in a big hurry and then not get around to using all of it.
In once sense, I see this sort of virtual/physical spoofing as completely legit and even beneficial -- the resources are there when you need them. Just make sure you have decent operational processes in place when your users collectively claim their resources.
Should storage service functions show physical resources used? No -- there's nothing good that I can see coming from doing so. Server hugging and array hugging are both bad behaviors.
As a business consumer of IT technology, I fully expect this sort of virtualness from all my IT providers -- it helps them keep their costs down -- which I do appreciate.
I also like it when I meet storage teams that publish a "rate card", and compare their costs and services against external options, like Amazon or Rackspace, for example.
The internal IT choice doesn't have to be the cheapest or the best -- it just has to be in a somewhat reasonable range.
As a business user of IT, I'd much rather work with my internal team than strike up a relationship with an external vendor. Besides, showing external comparisons appears to keep discussions with business users rather short and to-the-point.
One likely result -- you'll get back big chunks of your calendar.
There is one part of the problem that doesn't have a good answer, and that's "orphan data".
Projects come and go, and it's not uncommon for a part of the business to simply walk away from once-important data sets. There's no one you can easily track down and ask what to do with the data, or perhaps offer up an internal cost center.
No one has a good answer to how to handle orphan data, other that to recognize this happens, and to allocate a fixed amount of resources for unowned data preservation to cover the potential risks or lost opportunities associated with deletion. Even with this approach, demand outstrips assigned resources, and someone has to decide whether to allocate more resource, or simply delete unowned data.
Establishing A Process For Continual Improvement
Make no mistake, this isn't a do-it-once and never-have-to-do-it-again sort of exercise.
The services in the storage catalog have to be continually re-evaluated. The business might have new regulatory requirements, or a set of new geographic concerns, or something similar. There needs to be a strong discipline to periodically sit down and dissect the service catalog -- and make adjustments as needed.
Behind the scenes, the operational team is signing up for continually improving service levels -- as well as continually declining cost-to-serve.
Every so often, fairly significant new technologies come along that have the potential of moving the needle for the back-end storage team.
The storage factory is continually reporting out their key metrics that their customers care about. Like cost-to-serve. Or time-to-provision. Or time-to-resolve a service delivery issue.
I have a handful of examples where I've seen our customers rapidly evolve both their storage service catalog offered, as well as the mechanisms for delivering those services. I get to see two snapshots, maybe six or twelve months apart. Usually, the rate of progress is quite exceptional -- once they get moving in the right direction, that is.
At the same time, thy are continually improving how they expose easy-to-consume storage services to more people: VMware admins, Oracle admins, Exchange admins, developers and more.
It's cool stuff. And I'm not just saying that because I'm a hopeless storage geek.
Where Does That Leave Us?
If I had to put all of this into a quick soundbite, I'd probably say that the trick to managing data growth is helping the business to make smart choices.
- Create attractive options that are easy to understand and consume.
- Invest in creating incentives for people to engage and make smart choices.
- Think in terms of continual process improvement around how those standardized storage services are delivered and consumed.
- Don't be afraid to compare yourselves to other alternative sources of those services.
- Organize as a service, not a silo.
And enjoy the ride -- because there's gonna be a heckuva lot more data coming at you very soon :)
Comments