The cool stuff: enterprise IT groups are now moving beyond the 1990s model of structured data warehouses, and now are starting to think in terms of more accessible and cost-effective "data lakes": vast repositories that are landing zones for any and all enterprise data that might potentially be interesting in the future.
The motivations are clear: information is the new asset, and its value can only increase when new uses are found. Yes, applications still "own" their information, but it's also an asset that's owned by the organization at large. Data lakes hold the promise of making it far easier to span organizational and temporal boundaries and harvest all that latent analytical goodness.
But, as in any valuable shared asset, it's going to need to be managed: what goes in, what comes out -- and who can play.
And when new important roles emerge, technology innovation is not far behind.
This Isn't Your Father's Data Warehouse
Data warehouses still have their role, but the notion of a data lake is quite different: more of an agnostic repository vs. the single source of structured and cleansed truth disgorging operational reports.
In a data lake, large data sets -- in any form, from any source -- is landed unmolested in a centrally managed place. These data sets are then thoughtfully exposed via metadata or catalog to potential consumers: what it is, how it was gathered, etc.
Consumers choose what they're interested in, and are responsible for cleansing, transforming, etc. Very often, that work produces yet another data set which goes back to the lake, and the process repeats itself.
Yes, these can get quite big - think petabyte to multi-petabyte scale. But -- at one level -- it's just storage hardware. It's relatively easy.
What sets a data lake apart from a simple file dumping ground is its management. You want to be able to authenticate users and use cases. You'd like your users to be able to quickly find data sets of interest, and easily add their own. Since it's an expensive resource, you'd like a sense of who's using what so you can optimize your efforts.
And all of this has to be done while balancing the needs of the users against the needs of the organization as a whole.
Do You Have Time For A Quick Story?
Here in Massachusetts, the largest nearby freshwater lake is Lake Winnipesaukee. Beautiful, pristine water surrounded by mountains. If you have a boat, that's where you go.
It can be a challenging beast: very large and complex, glacial rocks everywhere hiding just below the waterline, all sorts of choke points, and many thousands of boats on a busy weekend.
The State of New Hampshire understands what a valuable resource it is, and manages it appropriately.
One example is licensing -- you need a boating license to use the lake, and getting one is not trivial. There's a long training course, followed by a written exam. Not everyone passes. Alcohol is verboten, there are reasonable speed limits -- and everywhere you go, the lake wardens aren't far away.
The results are great: the lake can be crowded, but there are very few accidents -- other than me bending my propeller more than once. Strict environmental controls ensure water quality.
And -- let's not forget -- the state has figured out it can levy massive property taxes on upscale lakeshore property. Everyone wins.
Here's the point: there's much more to Lake Winnipesaukee being successful than simply having a lot of water.
It's a managed resource.
Back To The Data Lake
So, without getting over-complicated, let's attempt to tease out the basic management functions associated with a data lake.
First, you'll want to make it easy to land data sets on your lake.
If people have to go through a lengthy process to land data, you'll have more of a data puddle than a big lake, hence much lower value.
That implies a low-friction, low-or-no-cost self-service model, exposed as something simple like a file system. Since part of the anticipated value will result from others using the data, you don't want to amortize all the costs back to the people storing the data -- they are not the ultimate beneficiaries.
Second, you'll have to give people control over their data.
Not everyone is always comfortable with sharing, so that needs to be respected. But, at the same time, there needs to be a central authority that understands what is being stored, with the primary goal being on the lookout for additional value that might be exploited at some point.
The uncomfortable reality is that this implies the user being responsible -- at least in some degree -- for providing minimal metadata: source, context, format, etc. Make the metadata capture requirements too onerous, people won't use your service. Make them too lightweight, it'll be hard to properly extract value.
My initial thought is that we'll see a two-tier approach evolve: a quick metadata capture on ingestion, followed by more rigorous categorization and exposing if the data set proves itself valuable over time. Let the internal marketplace decide ...
Third, you're going to need a number of mechanisms by which data sets can be discovered.
Having lived through this before, I will argue that there will be no one best way to do this. Some users will prefer the neat and ordered catalog, others will want to go browsing, still others will want to rely on search functions, etc.
I will argue that the more distinct ways that data sets can be discovered, the better.
Fourth, there's going to have to be a quick and simple way to validate proper use by consumer.
For example -- while I might know that payroll data exists, I certainly can't use it.
While the existence of a particular data set should be general knowledge, access to its contents might fall into one of three categories: open (generally OK for authorized users of the platform), permission-based (owner wants to evaluate each request), or restricted (better have a really good reason for asking for this).
Finally, just as web managers want to know how people use their site, the wardens of the enterprise data lake will want their *own* tools to understand who's using what, what's popular, where the resources are going, etc.
A petabyte-class data set that's sitting around unused for months deserves a bit of investigation, for example.
Drawing A Convenient Line
Just to be clear, I'm separating the information-centric functions from the compute-oriented ones. A rich data lake is inevitably well-integrated with a compute farm, plenty of analysis and reporting tools, a convenient consumption portal, etc. which is often described as "business-analytics-as-a-service", or BAaaS.
While that's certainly part of the discussion, I think it's important to think about the information repository separately. I will observe that that the discussions I've been in so far tend to overemphasize the compute and tools aspect, and underestimate the information repository aspects.
After all, the environment will only as good as the data it contains.
A Light Touch Required By The Data Management Team
A common (and unfortunate) pattern I've observed is when the existing data management team assumes responsibility for the new data lake.
Yes, they bring valuable domain knowledge, but it comes packaged in a world view derived from the last model: structured and cleansed data, clean hierarchies, rigorous taxonomies, authoritative access models, and everything else that comes with it.
That's not the game here -- it's more bottoms-up, demand-driven.
Just like cloud and IT-as-a-service turned traditional IT models on their heads by focusing on ease-of-consumption and being responsive, the next-gen data management teams will need to learn to do the same.
We don't want a recreation of the last model simply using newer technology.
We've Got Tools To Do This, Right?
Err, not really. If there are good answers out there, I'm not aware.
Sure, there are bits and pieces here and there that could be lashed together with a bunch of work (I'm sure this sounds like big fun to someone out there), but there seems to be a paucity of off-the-shelf pragmatic frameworks that reflect the new data lake model vs. simply an extension of the legacy one.
But I don't think that will last for long.
There's so much VC out there -- and so many bright people -- I'm guessing we'll start seeing a new generation of "enterprise information management frameworks" that look very 2.0-ish: built to handle unstructured data, focused on ease-of-consumption, very lightweight models, etc.
No Excuses
But I wouldn't use the lack of tools to justify a lack of focus.
From my perspective, the physics are inarguable. Most every business is becoming an information business. That means creating platforms and processes that enable the use of *all* information: regardless of source, regardless of consumer.
There is no way we can build enough explicit point-to-point bridges between producers and consumers, so we have to start thinking of a new class of repository that doesn't demand to have a new business requirement predicted and justified six months ahead of time.
Put differently, just about every business will want to make sharing and re-use of large data sets as easy and as transparent as possible: just as email made business communications as easy and transparent as possible -- although not without a few side effects.
New requirements = new important roles.
And I think the guardians of our new data lakes will turn out to be very important indeed.
----------------
Like this post? Why not subscribe via email?
How about Data Ocean or Data Galaxy for that matter.
Posted by: shiningarts | November 22, 2013 at 01:04 PM
For some reason, there's a lot of use of the "data lake" term, so why not?
Posted by: Chuck Hollis | November 22, 2013 at 07:55 PM