We all speak loftily of IT architectures. But I think the reality is more IT archaeology -- layers upon layers of systems and interfaces, built on top of each other over time, each dependent on the layers below it.
Whether this is happens through natural evolution, or by explicit intent (e.g. SOA), the resulting landscape is the same -- it's damnably complex.
And today, I'd like to try and make a case that the fundamental technology required to manage service levels in these environments must take -- and is taking -- a big step forward.
It started simple enough ...
Since the beginning of IT, we have always thought in terms of discrete applications and users of those applications. We'd think in terms of order processing, or maybe financial reporting, or whatever.
One way of viewing enterprise IT management is that it grew up around this model. Put differently, when you dig into Tivoli, or CA-Unicenter, or other frameworks, what you see is primarily an application-oriented model.
Yes, there's awareness of other elements of IT infrastructure, but they're either thought of as belonging to an application, or (worse) free-floating entities whose relationships are poorly understood.
Then it got complicated ...
Over time, the whole concept of an "application" became up for grabs. Is the application the back-end customer database? Or is it the end-user workflow that uses that resource and a half-dozen others? I can tell you which one the user cares about.
The move to shared infrastructure caused new problems. You have a shared network. Maybe you've consolidated storage, so that's shared as well. Hard to tell what bit of infrastructure supports what back-end database which supports what end-user composite application, no?
So IT organizations took some basic steps.
- They'd use domain-specific monitoring tools to look at layers of the infrastructure: storage, networks, servers, databases and so on. But these tools couldn't talk about the relationships and interactions between the layers, just the elements themselves.
- They'd make sure that all alerts got sent to a common console, and tried to add filtering to eliminate the torrent of irrelevant messages. But the messages weren't correlated, and it became increasingly difficult to get to root cause in a sea of red "alert" messages.
- They'd write scripts and code to handle specific situations they'd encounter, patch on top of patch, layer on top of layer. Every time a new situation came up, they'd write more code. But if the IT landscape changed, all the code would have to change as well. More and more effort got expended with less and less results.
Then it got grim ...
As more and more composite applications emerged, the flood waters started to rise. IT organizations had to spend more and more time on conference calls (with all the different domain specialists on the line) to try and sort through where the problem might actually be.
Application down time became dominated by time spent trying to isolate the problem, rather than fixing the root cause itself. Once IT figured out where the real problem was, they could fix it pretty quickly, but it was taking longer and longer to get to the root cause.
Users (and their business management) got unhappy with declining IT service levels (from their perspective). New SLA targets and "incentives" started showing up in many IT compensation plans. IT people were spending more time finding and fixing problems then working on new solutions and functionality.
IT started to push back on the business when they were asked for new applications and new functionality -- simply because they felt that the foundation wasn't strong enough to support another layer.
Does this apply to you?
Maybe you're reading this at arms length, and think "hey, we've got things under control". And maybe you do, today.
But if you step back and say "what's the trend here?", you'd probably agree that, overall, we're going to see:
- more shared infrastructure
- more "composite applications" that are assemblages of traditional applications
- ever-higher demands by business users for predictable service levels
So, if it's not a problem today, it could be tomorrow. And if you're seeing symptoms of this problem already, the prognosis is not good, unless you start to think about it differently.
Some core concepts about the next-gen service level management model
First, in this environment, relationships matter more than the things themselves. Knowing how each layer supports and interacts with its adjacent layers become more useful (and important) than understanding the entire layer and all of its elements.
The classic example is networking, because it's pretty evolved. Anyone who does network management knows that they're the first stop in triage because everything connects through them. But, most of the time, they say "gee, the network's working fine, must be something either above me or below me".
From the end-user view, what are all the logical components that make up this application (software, hardware, etc.), what are their relationships, and what is their status?
Or, from an infrastructure view, here's a big consolidated storage array, what are all the pieces above me (server, database, network, logic, etc.), and how do they ultimately get perceived by the end user?
It's easy enough to figure out when an end-user application is sick (they complain), or a large piece of shared infrastructure is sick (it complains). What about all the goop in between?
So, try this on for size. Imagine you have a well-monitored environment. And let's say that there's a database server that's used by hundreds of applications. The database server has an HBA drop, so performance is now off by 25%. By the way, this may or may not result in an alert being generated.
Who could raise an alert in this situtation? Let's see: the server, the database, the message bus, any of the hundreds of consumer applications, maybe the storage array. Let's say you've got a torrent of hundreds of alerts scrolling across your enterprise console. Where you would like to start? And, oh by the way, the real offender hasn't checked in: it's dead!
It's easy to see that the relationships between the entities matter more than the entities themselves. If you understood how everything was related, you'd have a basis for correlating the alarms, and finding the guily party.
But how to do that?
You're going to need a model, not a repository. A repository is a collection of things. A model shows relationships and interactions. But how does it get built, and how does it get used?
One choice is to take an extensive inventory of everything on the floor, document all the relationships, and code it into a model. Let me help you here, don't even think about it.
Why? Because the IT landscape changes organically, despite everyone's best efforts in change control, ITIL and the like. Someone reconfigures the storage or the network. There's a new client of the database. Someone sets up a link to move information between two entities. IT infrastructure is dynamic and organic, and trying to pour concrete over it in an effort to slow things down will only result in IT falling farther and farther behind in keeping up with the business.
You're going to need a way to automatically discover and populate the model. Every application, every relationship, every dependency, every component, every version of code, and so on. Hi-def and real-time. And you're going to want to do it without loading YAA (yet another agent) on everything you own.
[not surprisingly, that explains our nLayers acquisition -- now Smarts Application Discovery Manager -- which sits off the mirror port of a network -- and if you let it, will do precisely that, and track changes over time. I know you all see lots of demos, but you've got to see this one. I was speechless.]
You're then going to have to use the model to correlate root cause. I see hundreds of alarms and alerts, but -- because I have a model of how everything is related -- I can create a large space of all potential error syndromes that would result from all potential root-cause errors. When a specific error syndrome comes in, I can match it against the universe of all possible errors, find a fuzzy match, and -- bingo! -- that's the problem that has to be fixed.
[that's exactly the core technology behind Smarts -- thousands of customers use it to do exactly that]
OK, I'm listening, but where are the gaps?
The first obstacle is cynicism. We've already got a bunch of tools deployed in our shop, and they seem to work alright.
OK, if the world is great for you, I can't offer you anything new or better, can I?
The second obstacle is ownership. Who owns end-to-end service levels in a distributed environment? If no one owns the problem, then no one has the mission to go fix it. And, surprisingly, that's the case more often than not.
The third obstacle is action. Using these technologies can pinpoint the problem in real time, but can't do anything about it (yet) -- it can't change the state of the IT infrastructure to automatically resolve the problem. But, I would argue, that most IT organizations are pretty good at fixing problems -- once they know where the real problem lies.
Now, without getting too technical, there are a couple of special cases where this approach won't particularly work well (e.g. we run everything on the mainframe, or the application stack was specifically architected to solve this kind of problem, etc.) but -- generically speaking -- I think it has the potential to add another "9" to your overall SLA, e.g. from 90% to 99%, from 99% to 99.9%.
[OK, maybe I'm stretching here, but the results have been nothing short of dramatic with the people I've talked to.]
So, what will happen to the existing enterprise framework vendors?
First, let's be clear. No one is going to throw away anything they've built, if they can avoid it. And customers have poured bazillions of dollars into implemented and enhancing these environments.
Indeed, most of the deployments we've seen have been layered: the agents from the enterprise framework can feed the model, the model makes decisions, and informs the enterprise management framework about where it thinks the real problem might be. The stuff you have stays in place, it just got a whole lot smarter.
But -- from a pure strategy perspective, these folks have a major headache on their hands. The core problem they're trying to solve has changed, which means a lot of the fundamental architecturaly decisions they've embedded in their products has to change as well.
Some will try and fight the argument, others will try to position this technology as a bolt-on. But I can't see any of them having the stones to pull out a fresh sheet of paper and re-think their entire approach to the problem. The reason the world was created in seven days is that there wasn't a legacy problem, right?
So, I'd ask you, take a look at your IT archaeology and ask the question -- does this sound like you?
Spot on!
Couldn't agree more.
Posted by: Noons | January 25, 2007 at 12:46 AM