Skip to main content

skip to main content

developerWorks  >  Autonomic computing  >

Ric Telford on the state of autonomic computing today

The challenges of developing with, and for, next-generation autonomic systems

developerWorks
Document options

Document options requiring JavaScript are not displayed


Rate this page

Help us improve this content


Level: Introductory

developerWorks staff (dwinfo@us.ibm.com), Editorial Staff, IBM

02 Dec 2004

This question and answer article features Ric Telford, Director for Autonomic Computing at IBM. developerWorks talked with Ric about the state of autonomic computing today, and the challenges of developing with, and for, next-generation autonomic systems.

Ric Telford, the Director for Autonomic Computing at IBM, has many years of experience developing emerging businesses and deploying new technologies, including on-demand and utility computing, and security. His newest passion is autonomic computing.

Photo of Ric Telford developerWorks: Your job title is officially "Director of Architecture and Technology for Autonomic Computing." What does that encompass?

Ric Telford: Well, as Paul Horn said almost three years ago, autonomic computing is a call to arms to the industry. We have to be better about making our components and IT systems more self-managing, because the cost of maintenance, operations, and administration of our systems today is greatly outstripping the actual cost of the components themselves -- and this equation won't close in the long term.

So, we have an organization to help drive that, and my part of the organization is architecture, technology, and standards. The architecture is really to drive the IBM internal deployment and specifications for how we make our own products autonomic. We take the set of architectures, and take those pieces that are fundamental to a heterogeneous multivendor environment, and propose those as standards. Then we work at the industry level to help drive these autonomic computing standards across the industry.

In support of the architecture and standards work, I also have responsibility for ensuring that the actual implementations of what we're proposing get built and developed. And, we are constantly advancing the technology forward. I do that through a number of different avenues. I have a joint program with IBM Research, because a lot of this technology is rooted in fundamental research that we rely heavily on our research division to deliver. I have programs with universities and the academic community at large, and government agencies -- who all can help in the mission to drive autonomic computing across the industry by focusing some of their research on self-managing systems. I also have a small technology team that does proofs of concept, prototypes, and demos to start showing the real value of how we can take these autonomic computing technologies and apply them to real business problems. So, it's the technical side of the autonomic computing mission that I'm held responsible for.

dW: You mentioned that this drive started about three years ago. Was there a crucial event that took place to spur it -- did the complexity suddenly become much more complex quickly, or has it been building up over the whole 40-50 years of the modern computing era? Or something else?

RT: The short answer is yes, there was an important event that happened, which was the advent of the Internet. Now, clearly, complexity in IT systems has been building over the last 40-50 years. When you have a mainframe in one room, where all the computational work goes on, and the only thing that's distributed is the presentation of the information out to "green screens" (or terminals) -- that's pretty simple compared to the next wave, which was client-server computing. [In client-server computing], distributed servers and then smart clients and computational capabilities are spread out across your enterprise. That's more CPUs, if you will, that need to be managed, monitored, maintained, and so on.

Siloed: to be a separate entity that does not mix with other entities.
But they were still somewhat siloed applications; you had to have a certain kind of client connected to a certain kind of server. What made it exponentially worse was the Internet. Now you had a common way to come in to any application, and the expectation was that you could create applications of aggregated data from multiple sources. So what do we see now when we go into, say, your online banking system? I know of one large bank that said, I think, that they have 22 different data sources just to populate your profile page (not their home page, but rather, when you log in, your profile page or your customized home page). 22 data sources! Some of these are ancient back-end proprietary banking systems, some of them are IMS databases, some of them kick off CICS transactions, there are relational stores, and data in Oracle. So you're lashing all this stuff together now, and dealing with the complexity involved in having so many systems tied together to give one unified view. If there is ever a break in the chain on one of those, you can essentially prevent the whole application from running.

dW: It's funny that you mention banking in particular. I find very often when I go in to a bank or talk to my bank or credit card companies on the phone, I am told that they are down.

RT: I tell you, that's my favorite example, because that's where the most amount of data aggregation in applications I've seen occurs. You know, my bank had a problem -- and they're also an IBM customer, so I'm their customer, they're my customer, and I have a good relationship with them -- and I mentioned it and they said,

"You know what that problem was? We had a small database, and all it does is save your preferences -- your customized settings for your home page -- and that database had a failure. And so the whole Web site just didn't come up at all. It went to retrieve the data, it couldn't get the data, so the whole application crashed."

It's that kind of thing -- having all these different potential points of failure -- and you've got to make sure that it all works.

dW: Well, with that we have set up the problem. How does autonomic computing fix it?

RT: The vision of autonomic computing is self-managing systems: to put more smarts into the components and the systems management and the applications themselves, to both detect and prevent -- and then self-cure -- problems. It's a combination of a lot of technology. But the net is that they all come to bear around ensuring that either failure doesn't happen, or the system doesn't go into sub-optimal mode, or in some cases, that the proper configuration is done so things that can essentially be root causes for problems down the road don't occur. Or if problems do occur, that they are mitigated in a way that keeps the whole application from coming down. So there is no one silver bullet technology -- it is a combination of things we are working on.

We talk about self-management in four categories because they relate very closely to how customers think about maintaining and managing their IT systems:

Self-configuration addresses a huge problem, which is in the tuning and setting of the configuration of all these pieces of IT infrastructure that need to work together to make a system work optimally. In other words, it's not just about the Web site not coming up; it's about getting responses back to you, and having a reasonable response time.

Self-healing is being able to do better problem determination, autonomic root cause analysis, and then ultimately self-healing.

Self-optimization is adapting to changing workloads. One cause of problems in systems today is spiky workloads -- rapid changes in the amounts of work. You want to be able to have the optimal amount of resources applied based on the workload, whether it be within one database, across a farm of databases, or across an entire Web application. Whatever the workload happens to be, you want to make sure you have enough resources. But, you also want to make sure you're not wasting resources, which is, today, the way that people optimize. They do something called overprovisioning. Even if most of the time it only takes two servers to do a job, but sometimes it takes 10 servers, they just keep 10 servers online all the time. Most of the time you're basically wasting eight servers. So self-optimization is all about making sure you have the right amount of resources, and no more than you need. Hence "optimal" (laughs).

Self-protecting systems are all about guarding the infrastructure from external threats, and having better ways to allow systems to detect potential things like viruses, or denial of service attacks, illegal access, and so on.

dW: If we take self-protection, and look at things like firewalls, TCP wrappers, or tripwire, is it possible to say that these are examples of autonomic elements? Where you can say, "This is what we need to do for other areas of IT?" Or, do they not actually fit in with what autonomic computing is all about?

RT: The question is, "Are security mechanisms today synonymous with self-protect?"

dW: Well, are they going down the right road, maybe more than other examples of things we already have today? For instance, a firewall can be at least somewhat autonomic?

RT: Well, I haven't really thought about it that way, but I don't think so. We have examples of self-managing. I can point to lots of different technology today, in each of these disciplines, that are good isolated examples of behaviors we want to make more common across other components or other applications.

dW: Maybe it would be better to talk about those than my example.

MVS: Multiple virtual storage. A mainframe operating system.

RT: OK. The thing to remember about autonomic computing is on one hand, it's nothing new. On the other hand, it's doing old things in a new way, so it all depends how you look at it. But if you go to Poughkeepsie, and you talk about autonomic computing and give them examples of behaviors that are desirable in an IT environment, they say, "We invented that 30 years ago in MVS™" -- and that's true. Those are some of the behaviors we point to, such as the way that z/OS® has a common model for the logging of events, and so on. If you are running on z/OS, there are rules by which you do that, such as what you need to log, and how you need to log events that go on your system. This makes it really easy to correlate events from across different components of the application stack, and to do root cause analysis. So problem determination is not perfect on the z, but there are a lot of capabilities there that you don't have elsewhere.

We want to take those sort of best-practices, and make them work on a heterogeneous multivendor environment as well as they work in a homogeneous single-vendor environment like the z/OS environment. And that's part of what autonomic computing is about, and why standards are important.

We submitted a standard called Common Base Event. It is pretty simple in concept, but pretty powerful in capability. Basically, the idea is that everybody should start logging events that are going on in their infrastructure, whatever they happen to be -- an SAP application, an Oracle database, a Sun server, a Cisco router -- in a common way. If we could all get this done in a common way, using the common situation model that we defined (which is not just a tactic, but also has some semantics, or meaning, that you describe also in a common way inside the event), that will make it so much easier to do problem determination, root cause analysis, and self-healing systems. Just by this one initiative, where everyone starts recording what is going on in a common way.

So a lot of these technologies exist in pockets. There are pockets of self-protecting technologies -- it's great that a firewall logs incoming hits against a port that's not open. It's even better if those logged events are funneled up into what we call an autonomic manager, that can maybe correlate that to mean there's a specific denial of service attack, or a probe going on against the infrastructure. So the combination of these technologies is really what can help make autonomic computing possible. The ability for events to be aggregated, then correlated, and analyzed is one of the key concepts behind autonomic computing.

dW: Are you happy with the way the standards process is going for this area?

RT: Yes. When you work in a standards world, which is very important to autonomic computing, you always have a balance between the rate and pace at which you want to drive architecture into your own instantiations of a standard, and the rate and pace of a standards body. But we solve that by going in knowing that we may have to change. So, we've probably implemented some features in the log trace analyzer that may not be standard yet. And when the standard is published for the Common Base Event -- if it differs from what we've already implemented -- we may continue to support what we have today, but we will make sure we are 100% standards-compliant. So, this is the fine line you walk, and it's not a problem.

dW: IBM is working with others in the industry on this -- Cisco, HP, and others. Are you happy with the progress that has been made in the last three years [since autonomic computing was launched formally as an initiative]?

RT: Yes! The interesting thing is, we have so many things going on with partners and we are all so busy, the only challenge is finding the time -- it's not the lack of interest. We have a lot of interest from people we've been working with in the industry, and there's a lot of work to do, and the only gating factor is how much time and resources we can apply to working together and moving things forward. But there is certainly no shortage of interest and excitement in this area from other companies in the IT industry that we are working with to advance the cause of autonomic computing. Because it doesn't just benefit the end customer, it benefits the IT vendors themselves. For example, with self-healing systems, one of the advantages of autonomic computing is that the number of problems that require a call in to a customer support center for Cisco, or IBM, or HP will also go down, and save us money.

What autonomic computing means to developers

developerWorks: Let's say I'm a developer (small or medium) and I want to incorporate this in my existing programs, what can I do?

Ric Telford: So let's say you are a small company, and you have a piece of software, an application that you run -- it runs on top of WebSphere® or DB2® or Oracle -- it's a financial application, or what have you?

dW: Yes.

RT: One of the things you want to do is to make the events that you create in your application -- the logs and the trace information -- leverage this Common Base Event specification. Because then all the tooling, and all the autonomic managers, and all the event correlators that we think will evolve in the industry around the CBE format will provide you benefits right off the bat.

Rational®, for example, has tooling that includes a log trace analyzer that supports the Common Base Event. Now a customer that buys your product, if they use that Rational tool, can correlate events going on in your financial application along with related events that go on in DB2 or Websphere. In another example, a lot of these applications from small and medium software companies have a stack of middleware as prerequisites. Application developers are all hopefully getting out of the business of writing their own databases, their own transaction systems, and their own Web middleware, to focus on what they do best, whatever their application is. But they still need to prereq the installation of things like Oracle or DB2, or WebSphere, Web application servers, and so on. One of the things we're working on, that we delivered recently to the industry, is a Solution Installation for Autonomic computing, which is a specification for how you can describe your particular application in terms of a standard -- what we call installable unit -- and aggregate it with other installable units to have one customer view of a solution installed.

To the customer, they're just installing your product. But there are mechanisms built in to install and configure the whole stack. It's a more unified solution view, rather than a set of products that have to be installed separately. We released that specification along with our partners InstallShield and ZeroG. So, that is another area that a programmer would want to look into, and figure out over time how to evolve their application and describe it as installable units.

dW: That sounds like a lot of rewriting of existing code, is that right?

RT: That's a good point. We are definitely not on the same tack as some vendors who have what we call a "rip and replace" strategy. We're not on that at all. We spend a lot of time focusing on the evolution of autonomic computing, rather than a revolution needing to happen. We have a whole set of work where we describe the steps you take as a developer, or more generally as a company or as a consumer, to evolve your data center toward autonomic computing. But it's also true for the developers as well.

Let me give you an example. If you have a product in the field today and it writes events, logs, and trace information in a certain format, we have something called a Generic Log Adapter available as part of the Autonomic Computing Toolkit. You don't have to rewrite your logging to the Common Base Event format, you can use the Generic Log Adapter to convert from your existing format to the Common Base Event format. The adapter is a tool; you define the rules by which you map your existing format to the Common Base Event format, and then anybody who wants to aggregate and correlate those events can. It'll do the translation on the fly, as needed, for whatever log files you need to pick up, so it doesn't require you to rewrite everything right away.

dW: Provided I had a well-behaved application to begin with.

RT: Yes, provided you had a reasonable way that you log events today. And, we haven't really seen any that we can't map. Some are tougher than others, but we haven't seen anything that we can't map.

It's a little more complicated in the solution install arena, but we do the same thing there. You can wrapper, as a first step, your existing install, and do not have to rewrite it; and you will still get some of the benefits, like dependency checking, from the Solution Install technology. So, in all cases, we're very sensitive to what we call a crawl-walk-run approach to adopting autonomic computing. We do have these crawl steps if you just want to sort of dip your toe in the water.

dW: Let's say I want to try this -- I'm going to download the Autonomic Computing Toolkit. What do I do once I've got it downloaded?

RT: Once again, it depends on the audience. If we take the same example we used earlier, you are a developer and have a product that gets used in a large enterprise, and you're developing their banking application, or --

dW: It could be that, but also someone with a new product that hasn't yet launched, or someone working on an in-house app --

RT: Sure, I understand that, but pick one: it will be different for each! (laughter).

Let's say I'm a developer in a large banking company and I write code for our Web banking app. When people come in to pay their bills online, that Web screen they see is my program; that's what I'm responsible for. So, I would tell that person there are some standards that need to be adopted inside the IT department of my company.

When we go into a large enterprise that has their own application programmers, developing Web facing applications or whatever, we say you need to adopt a standard, the Common Base Event, as your own internal standard. Most log events that need to be analyzed are not the middleware (not DB2, not the Cisco router), but their own applications, the code they are writing. And what generally happens is that the guys who are writing the Java applications have their own log format, and the guys who are writing the database access code have their own log format, and they don't even have standards within their own organization. So the toolkit contains the definition of the standard, the generic log adapter and the log analyzer. It contains all the components you need to start standardizing on a common event format within your own enterprise, and then put a plan in place to, over time, convert to an internal corporate standard and start a series of activities around common logging and common log analysis.

One of the firms we're working with is a large business information company that is implementing processes for moving logs overnight to a secure area so that if they need to be correlated, they can be all brought together into one physical location. Those are the kinds of things that would start triggering the development of programmer guidelines -- internal policies for how to do the common logging and trace. Because putting what we call basic hygiene into place is critical to making the self-healing and the self-optimizing functions work. It makes it so much easier to do self-healing when it is easier to figure out what the problem is in the first place. So, putting in basic hygiene for common events is one of the things I'd be focused on if I was a developer inside a corporate IT group .

And then there are related kinds of things, the other developments of the toolkit. In the solution install arena, I'd be putting in and defining standards, and putting in programmer guidelines for how to create the installable units, and so on, for all the different groups.

Read Meet the experts: Kathryn Britton on autonomic computing's Integrated Solutions Console for more information.

We also, as you know, have something called the Integrated Solutions Console, which is a common way to create what we call the human computer interaction components area, where the admins meet the computer in terms of an end-user interface on the administrative console. And all of that is in the toolkit. I'd start with that for any part of my application that is an end-user interface, or that is presented to the administrator.

dW: And all of this is in the toolkit already?

RT: That is all in the toolkit.

dW: But there is more to it than a toolkit...

RT: I think there's a combination of things that developers think about with regards to autonomic computing. I believe that, number one, they think about their own applications and programs they're writing, and what are they doing to put self-managing capabilities in from the start. Sometimes that is independent of any tool that we give them; it's just the approach and methodology they use to develop their applications. A lot of autonomic capabilities that ship in IBM applications and programs like DB2 didn't require any core technology or deliverables from the autonomic group, it was just a group of programmers within DB2 that said, "How can we make this system more self-managing?" They implemented a number of capabilities that are unique to the database world, and DB2 now requires a fraction of the administration costs that it used to. What they've done in DB2 can be mapped to any programmer's application, if you think about what requires human intervention today, and ask yourself, "Do I really need that human intervention for my application?" That's one thing that programmers think about.

The second thing is the standards. As you see in the toolkit, and through other initiatives for self-managing standards, developers need to be thinking about how they can move on to that.

And the third thing, of course, is any tooling we ship in our toolkit. Developers can take advantage of, and make use of, the event logging or the Install Solutions in their applications. These three things, really, are what they should be thinking about.

dW: Can you tell me about this MAPE loop? This is a whole new thing, right? You made it up for autonomic computing? It didn't exist as a best practice before?

MAPE: monitor, analyze, plan, and execute

RT: Correct. And it is another thing that's germane to developers. The MAPE loop is a consolidation of concepts that have been around for a long time into one coherent high-level architecture that says, "These are the steps, with an autonomic manager (in the abstract), that need to take place to make self-management decisions." And it's a model that can serve applications in any number of domains. It brings a common view of how components that are self-managed can interoperate. The MAPE loop construct essentially says that there are a set of managed resources that can be anything from an application, to a piece of middleware, to a server, to a router, within the data center --


Figure 1. MAPE control loop
MAPE control loop illustration

dW: So it includes hardware?

RT: Oh yeah, absolutely. Anything in the data center, any discrete element, is considered a manageable resource. You know, I spend as much time with hardware people as I do with software people. Because storage area networks, clusters of UNIX servers, BladeCenters™, and so on are all manageable units, and all have points of failure, and all have ways to optimize and heal and so on. And all emit events that need to be monitored, right?

So it's the hardware, it's the middleware, it's the applications; all are discrete manageable resources that need to be monitored. Think about what a human being is paid to administer and to do; that is what is costing the IT center so much money. They have a lot of people who are constantly monitoring what's going on in all these managed resources, tuning them, configuring them and such, and trying to ensure that there are no problems. So if you want to get that functionality -- all of the things that people are doing today -- embedded in the management infrastructure of your data center, you need autonomic managers. This is essentially taking that functionality and coding it into software. The autonomic manager essentially provides:

  • A constant, closed-loop concept of monitoring what is going on in some set of managed resources
  • Analyzing the data that comes in
  • Making the plan for change and executing that change, if something needs to change
It's dependent on a set of knowledge in addition to the data it's monitoring, sort of background knowledge (the knowledge components of the MAPE loop), such as "What are the policies? How do I know what decisions to make? What are the known symptoms?" In other words, when I see a certain series of events, how do I know if that's important or not?

That's something that is called the symptom information, and is another stored form of knowledge that we're creating for the autonomic computing architecture. What are the properties and capabilities of the managed resources that I can effect? We call those the effectors.

And this is essentially the knobs and dials, if you will, of the given resource you are monitoring, and that the autonomic manager is capable of changing. So that's the model, then you instantiate that model in any number of products. Tivoli® has a product called Risk Manager that's constantly monitoring security events. We're also working on something called the Autonomic Monitoring Engine, which looks for known potential problems in a managed resource and can take action before the problems occur. For example, if a disk is filling up, it can allocate more space before it fills up, or alert the administrator that the disk is about to fill up, or both.

dW: Would there be any point in incorporating that in my programming even if I wasn't using the autonomic computing framework and components from the toolkit?

RT: Yes. We separate the architecture we've done from the sample implementations. We have this go on at different places. For instance, in some of the university programs, there are people writing autonomic managers that are compliant from an architectural point of view. They operate in this mode that I talked about, and they can interoperate with other autonomic managers, but they didn't use any of the code that we ship as examples in the toolkit. Or, you can pick up code we've prebuilt and include it as a component when you're building your solution. It is similar to many other things in the industry around standards -- sometimes people like to write their own implementation of an architecture or a standard, and sometimes they like to leverage someone else's implementation. Both are valid, depending on what your needs are.

dW: Thank you so much for the interview!



Resources



About the author

developerWorks staff




Rate this page


Please take a moment to complete this form to help us better serve you.



 


 


Not
useful
Extremely
useful
 


Share this....

digg Digg this story del.icio.us del.icio.us Slashdot Slashdot it!



Back to top