 | Level: Introductory developerWorks staff (dwinfo@us.ibm.com), Editorial Staff, IBM
02 Dec 2004 This question and answer article features Ric Telford, Director for Autonomic Computing at IBM. developerWorks talked with Ric about the state of autonomic computing today, and the challenges of developing with, and for, next-generation autonomic systems.
Ric Telford, the Director for Autonomic Computing at IBM, has many years
of experience developing emerging businesses and deploying new
technologies, including on-demand and utility computing, and security. His
newest passion is autonomic computing.
developerWorks: Your job title is officially "Director of Architecture and Technology
for Autonomic Computing." What does that encompass?
Ric Telford:
Well, as Paul Horn said almost three years ago, autonomic computing is a call to arms to the industry. We have to be better about making our components and IT systems more self-managing, because the cost of
maintenance, operations, and administration of our systems today is
greatly outstripping the actual cost of the components themselves --
and this equation won't close in the long term.
So, we have an organization to help drive that, and my part of the
organization is architecture, technology, and standards. The
architecture is really to drive the IBM internal deployment and
specifications for how we make our own products autonomic. We take the set of architectures, and take those pieces that are fundamental to a heterogeneous multivendor environment, and
propose those as standards. Then we work at the industry level to help
drive these autonomic computing standards across the
industry.
In support of the architecture and standards work, I also have
responsibility for ensuring that the actual implementations of what
we're proposing get built and developed. And, we are constantly advancing the technology forward. I do that through a number of
different avenues. I have a joint program with IBM Research, because
a lot of this technology is rooted in fundamental research that we
rely heavily on our research division to deliver. I have programs with
universities and the academic community at large, and government agencies
-- who all can help in the mission to drive autonomic computing across
the industry by focusing some of their research on self-managing
systems. I also have a small technology team that does proofs of
concept, prototypes, and demos to start showing the real value of how
we can take these autonomic computing technologies and apply them to
real business problems. So, it's the technical side of the autonomic
computing mission that I'm held responsible for.
dW:
You mentioned that this drive started about three years ago. Was there
a crucial event that took place to spur it -- did the complexity
suddenly become much more complex quickly, or has it been building up
over the whole 40-50 years of the modern computing era? Or something
else?
RT:
The short answer is yes, there was an important event that happened,
which was the advent of the Internet. Now, clearly, complexity in IT
systems has been building over the last 40-50 years. When you
have a mainframe in one room, where all the computational work goes
on, and the only thing that's distributed is the presentation of the
information out to "green screens" (or terminals) -- that's pretty
simple compared to the next wave, which was client-server computing. [In client-server computing], distributed servers and then smart
clients and computational capabilities are spread out across your
enterprise. That's more CPUs, if you will, that need to be managed,
monitored, maintained, and so on.
 | | Siloed: to be a separate entity that does not mix with other entities. |
|
But they were still somewhat siloed applications; you had to have a certain kind of client connected to a certain kind of server. What made it exponentially worse was the Internet. Now you
had a common way to come in to any application, and the expectation was
that you could create applications of aggregated data from multiple
sources. So what do we see now when we go into, say, your online banking
system? I know of one large bank that said, I think, that they have
22 different data sources just to populate your profile page (not
their home page, but rather, when you log in, your profile page or
your customized home page). 22 data sources! Some of these are
ancient back-end proprietary banking systems, some of them are IMS
databases, some of them kick off CICS transactions, there are relational
stores, and data in Oracle. So you're lashing all this stuff together
now, and dealing with the complexity involved in having so many systems tied
together to give one unified view. If there is ever a break in the
chain on one of those, you can essentially prevent the whole
application from running.
dW:
It's funny that you mention banking in particular. I find very often
when I go in to a bank or talk to my bank or credit card companies on
the phone, I am told that they are down.
RT:
I tell you, that's my favorite example, because that's where the most
amount of data aggregation in applications I've seen occurs. You
know, my bank had a problem -- and they're also an IBM customer, so
I'm their customer, they're my customer, and I have a good relationship
with them -- and I mentioned it and they said, "You know what that
problem was? We had a small database, and all it does is save your
preferences -- your customized settings for your home page -- and that
database had a failure. And so the whole Web site just didn't come up
at all. It went to retrieve the data, it couldn't get the data, so the
whole application crashed." It's that kind of thing -- having all these
different potential points of failure -- and you've got to make sure
that it all works.
dW:
Well, with that we have set up the problem. How does autonomic
computing fix it?
RT:
The vision of autonomic computing is self-managing systems: to put
more smarts into the components and the systems management and the
applications themselves, to both detect and prevent -- and then
self-cure -- problems. It's a combination of a lot of technology.
But the net is that they all come to bear around ensuring that either
failure doesn't happen, or the system doesn't go into sub-optimal
mode, or in some cases, that the proper configuration is done so
things that can essentially be root causes for problems down the road
don't occur. Or if problems do occur, that they are mitigated in a way
that keeps the whole application from coming down. So there is no one
silver bullet technology -- it is a combination of things we are
working on.
We talk about self-management in four categories because they relate
very closely to how customers think about maintaining and managing
their IT systems:
Self-configuration addresses a huge
problem, which is in the tuning and setting of the configuration of
all these pieces of IT infrastructure that need to work together to
make a system work optimally. In other words, it's not just about the
Web site not coming up; it's about getting responses back to you, and
having a reasonable response time.
Self-healing is
being able to do better problem determination,
autonomic root cause analysis, and then ultimately self-healing.
Self-optimization is adapting to changing workloads. One cause
of problems in systems today is spiky workloads -- rapid changes in
the amounts of work. You want to be able to have the optimal amount
of resources applied based on the workload, whether it be within one
database, across a farm of databases, or across an entire Web
application. Whatever the workload happens to be, you want to make
sure you have enough resources. But, you also want to make sure you're
not wasting resources, which is, today, the way that people optimize. They do something called overprovisioning. Even if most
of the time it only takes two servers to do a job, but sometimes it
takes 10 servers, they just keep 10 servers online all the time.
Most of the time you're basically wasting eight servers. So
self-optimization is all about making sure you have the right amount
of resources, and no more than you need. Hence "optimal" (laughs).
Self-protecting systems are all about guarding the
infrastructure from external threats, and having better ways to allow systems to detect potential things like viruses, or
denial of service attacks, illegal access, and so on.
dW:
If we take self-protection, and look at things
like firewalls, TCP wrappers, or tripwire, is it possible to say
that these are examples of autonomic elements? Where you can say,
"This is what we need to do for other areas of IT?" Or, do they not
actually fit in with what autonomic computing is all about?
RT:
The question is, "Are security mechanisms today synonymous with
self-protect?"
dW:
Well, are they going down the right road, maybe more than other
examples of things we already have today? For instance, a firewall can
be at least somewhat autonomic?
RT:
Well, I haven't really thought about it that way, but I don't think
so. We have examples of self-managing. I can point to lots
of different technology today, in each of these disciplines, that are
good isolated examples of behaviors we want to make more common across
other components or other applications.
dW:
Maybe it would be better to talk about those than my example.
 | | MVS: Multiple virtual storage. A mainframe operating system. |
|
RT:
OK. The thing to remember about autonomic computing is on one hand,
it's nothing new. On the other hand, it's doing old things in a new
way, so it all depends how you look at it. But if you go to
Poughkeepsie, and you talk about autonomic computing and give
them examples of behaviors that are desirable in an IT environment,
they say, "We invented that 30 years ago in MVS™" -- and that's
true. Those are some of the behaviors we point to, such as the way that z/OS®
has a common model for the logging of events, and so on. If you are
running on z/OS, there are rules by which you do that, such as what you need
to log, and how you need to log events that go on your system. This makes it
really easy to correlate events from across different components of
the application stack, and to do root cause analysis. So problem
determination is not perfect on the z, but there are a lot of
capabilities there that you don't have elsewhere.
We want to take those sort of best-practices, and make them
work on a heterogeneous multivendor environment as well as they work
in a homogeneous single-vendor environment like the z/OS environment.
And that's part of what autonomic computing is about, and why
standards are important.
We submitted a standard called Common Base Event. It is pretty
simple in concept, but pretty powerful in capability. Basically, the
idea is that everybody should start logging events that are going on
in their infrastructure, whatever they happen to be -- an SAP
application, an Oracle database, a Sun server, a Cisco router -- in a
common way. If we could all get this done in a common way, using the
common situation model that we defined (which is not just a tactic, but also has some semantics, or meaning, that you
describe also in a common way inside the event), that will make
it so much easier to do problem determination, root cause analysis,
and self-healing systems. Just by this one initiative, where
everyone starts recording what is going on in a common way.
So a lot of these technologies exist in pockets. There are pockets of
self-protecting technologies -- it's great that a firewall logs
incoming hits against a port that's not open. It's even better if
those logged events are funneled up into what we call an autonomic manager,
that can maybe correlate that to mean there's a specific denial
of service attack, or a probe going on against the infrastructure. So
the combination of these technologies is really what can help make
autonomic computing possible. The ability for events to be
aggregated, then correlated, and analyzed is one of the key concepts
behind autonomic computing.
dW:
Are you happy with the way the standards process is going for this
area?
RT:
Yes. When you work in a standards world, which is very
important to autonomic computing, you always have a balance between
the rate and pace at which you want to drive architecture into your
own instantiations of a standard, and the rate and pace of a standards
body. But we solve that by going in knowing that we may have
to change. So, we've probably implemented some features in the log
trace analyzer that may not be standard yet. And when the standard is
published for the Common Base Event -- if it differs from what we've
already implemented -- we may continue to support what we have today,
but we will make sure we are 100% standards-compliant. So, this is the
fine line you walk, and it's not a problem.
dW:
IBM is working with others in the industry on this -- Cisco, HP, and others. Are you happy with the progress that has
been made in the last three years [since autonomic computing was launched formally as
an initiative]?
RT:
Yes! The interesting thing is, we have so many things going on with
partners and we are all so busy, the only challenge is finding the
time -- it's not the lack of interest. We have a lot of interest from
people we've been working with in the industry, and there's a lot of
work to do, and the only gating factor is how much time and resources
we can apply to working together and moving things forward. But there
is certainly no shortage of interest and excitement in this area from
other companies in the IT industry that we are working with to advance
the cause of autonomic computing. Because it doesn't just benefit the
end customer, it benefits the IT vendors themselves. For example,
with self-healing systems, one of the advantages of autonomic
computing is that the number of problems that require a call in to a
customer support center for Cisco, or IBM, or HP will also go
down, and save us money.
What autonomic computing means to developers
developerWorks:
Let's say I'm a developer (small or medium)
and I want to incorporate this in my existing programs, what can I do?
Ric Telford:
So let's say you are a small company, and you have a piece of software, an
application that you run -- it runs on top of WebSphere® or DB2® or
Oracle -- it's a financial application, or what have you?
dW:
Yes.
RT:
One of the things you want to do is to make the events that you create
in your application -- the logs and the trace
information -- leverage this Common Base Event specification.
Because then all the tooling, and all the autonomic managers, and all
the event correlators that we think will evolve in the industry around
the CBE format will provide you benefits right off the bat.
Rational®, for example, has tooling that includes a log trace analyzer
that supports the Common Base Event. Now a customer that buys your
product, if they use that Rational tool, can correlate events
going on in your financial application along with related events that
go on in DB2 or Websphere. In another example, a lot of these applications from small and medium software companies have a stack of middleware as prerequisites. Application developers are all hopefully
getting out of the business of writing their own databases, their own
transaction systems, and their own Web middleware, to focus on what they
do best, whatever their application is. But they still need to prereq
the installation of things like Oracle or DB2, or WebSphere,
Web application servers, and so on. One of the things we're working
on, that we delivered recently to the industry, is a
Solution Installation for Autonomic computing, which is a specification for how
you can describe your particular application in terms of a standard --
what we call installable unit -- and aggregate it with other
installable units to have one customer view of a solution installed.
To the customer, they're just installing your product. But there
are mechanisms built in to install and configure the whole stack. It's a more unified solution view, rather than a set of products that
have to be installed separately. We
released that specification along with our partners InstallShield and ZeroG. So,
that is another area that a programmer would want to look into, and
figure out over time how to evolve their application and describe it
as installable units.
dW:
That sounds like a lot of rewriting of existing code, is that right?
RT:
That's a good point. We are definitely not on the same tack as some vendors who have what we call a "rip and
replace" strategy. We're not on that at all. We spend a lot of time focusing on the
evolution of autonomic computing, rather than a revolution
needing to happen. We have a whole set of work
where we describe the steps you take as a developer, or more
generally as a company or as a consumer, to evolve your data
center toward autonomic computing. But it's also true for the
developers as well.
Let me give you an example. If you have a
product in the field today and it writes events, logs, and trace
information in a certain format, we have something called a Generic
Log Adapter available as part of the Autonomic Computing
Toolkit. You don't have to rewrite your logging to the Common Base
Event format, you can use the Generic Log Adapter to convert from
your existing format to the Common Base Event format. The adapter is
a tool; you define the rules by which you map your
existing format to the Common Base Event format, and then anybody who wants to
aggregate and correlate those events can. It'll do the translation on
the fly, as needed, for whatever log files you need to pick up, so it
doesn't require you to rewrite everything right away.
dW:
Provided I had a well-behaved application to begin with.
RT:
Yes, provided you had a reasonable way that you log events today.
And, we haven't really seen any that we can't map. Some are tougher
than others, but we haven't seen anything that we can't map.
It's a little more complicated in the solution install arena, but we
do the same thing there. You can wrapper, as a first step, your
existing install, and do not have to rewrite it; and you will still
get some of the benefits, like dependency checking, from the Solution
Install technology. So, in all cases, we're very sensitive to what we call a crawl-walk-run approach to adopting autonomic computing. We do have these crawl steps if you just want
to sort of dip your toe in the water.
dW:
Let's say I want to try this -- I'm going to download the
Autonomic Computing Toolkit. What do I do once I've got it downloaded?
RT:
Once again, it depends on the audience. If we
take the same example we used earlier, you are a developer and
have a product that gets used in a large enterprise, and you're
developing their banking application, or --
dW:
It could be that, but also someone with a new product that hasn't yet
launched, or someone working on an in-house app --
RT:
Sure, I understand that, but pick one: it will be
different for each! (laughter).
Let's say I'm a developer in a large banking company and I write code for our Web banking
app. When people come in to pay their bills online, that Web screen
they see is my program; that's what I'm responsible for. So,
I would tell that person there are some standards that need to be
adopted inside the IT department of my company.
When we go into a large enterprise
that has their own application programmers, developing Web
facing applications or whatever, we say you need to adopt a standard, the
Common Base Event, as your own internal standard. Most log events that need to be
analyzed are not the middleware (not DB2, not the Cisco router), but their own applications, the code they are writing. And what generally
happens is that the guys who are writing the Java applications have
their own log format, and the guys who are writing the database access
code have their own log format, and they don't even have standards
within their own organization. So the toolkit contains the definition
of the standard, the generic log adapter and the
log analyzer. It contains all the components you need to start
standardizing on a common event format within your own enterprise, and
then put a plan in place to, over time, convert to an internal corporate
standard and start a series of activities around common logging and
common log analysis.
One of the firms we're working with is a large business information
company that is implementing processes for moving logs overnight
to a secure area so that if they need to be correlated, they can be
all brought together into one physical location. Those are the kinds
of things that would start triggering the development of programmer guidelines -- internal policies for how to do the common logging and trace. Because putting what we call basic hygiene into place is critical to making the
self-healing and the self-optimizing functions work. It makes it so
much easier to do self-healing when it is easier to figure out what
the problem is in the first place. So, putting in basic hygiene for
common events is one of the things I'd be focused on if I was a developer inside a
corporate IT group .
And then there are related kinds of things, the other developments of
the toolkit. In the solution install arena, I'd be putting in and
defining standards, and putting in programmer guidelines for how to
create the installable units, and so on, for all the different groups.
We also, as you know, have something called the Integrated Solutions
Console, which is a common way to create what we call the human
computer interaction components area, where the admins meet the
computer in terms of an end-user interface on the administrative
console. And all of that is in the toolkit. I'd start with that for
any part of my application that is an end-user interface, or that is
presented to the administrator.
dW:
And all of this is in the toolkit already?
RT:
That is all in the toolkit.
dW: But there is more to it than a toolkit...
RT:
I think there's a combination of things that
developers think about with regards to autonomic computing. I believe
that, number one, they think about their own applications and programs
they're writing, and what are they doing to put self-managing
capabilities in from the start. Sometimes that is independent of
any tool that we give them; it's just the approach and methodology
they use to develop their applications. A lot of autonomic
capabilities that ship in IBM applications and programs like
DB2 didn't require any core technology or deliverables from the
autonomic group, it was just a group of programmers within DB2 that
said, "How can we make this system more self-managing?" They
implemented a number of capabilities that are unique to the database
world, and DB2 now requires a fraction of the administration costs
that it used to. What they've done in DB2 can be mapped to any
programmer's application, if you think about what requires human
intervention today, and ask yourself, "Do I really need that human
intervention for my application?" That's one thing that
programmers think about.
The second thing is the standards. As you
see in the toolkit, and through other initiatives for self-managing
standards, developers need to be thinking about how they can move on to that.
And the third thing, of course, is any tooling we ship in our toolkit. Developers can take advantage of, and make use of, the event logging or the Install Solutions in their applications. These
three things, really, are what they should be thinking about.
dW:
Can you tell me about this MAPE loop? This is a whole new thing,
right? You made it up for autonomic computing? It didn't exist as a
best practice before?
 | | MAPE: monitor, analyze, plan, and execute |
|
RT:
Correct. And it is another thing that's germane to developers. The
MAPE loop is a consolidation of concepts that have been around for a
long time into one coherent high-level architecture that says, "These
are the steps, with an autonomic manager (in the abstract), that need to take place to make self-management decisions." And
it's a model that can serve applications in any number of domains. It brings a common view of how components that are
self-managed can interoperate. The MAPE
loop construct essentially says that there are a set of managed
resources that can be anything from an application, to a piece of
middleware, to a server, to a router, within the data center --
Figure 1. MAPE control loop
dW:
So it includes hardware?
RT:
Oh yeah, absolutely. Anything in the data center, any discrete
element, is considered a manageable resource. You know, I spend as
much time with hardware people as I do with software people. Because
storage area networks, clusters of UNIX servers, BladeCenters™, and so
on are all manageable units, and all have points of failure, and
all have ways to optimize and heal and so on. And all emit events that
need to be monitored, right?
So it's the hardware, it's the middleware, it's the applications;
all are discrete manageable resources that need to be monitored. Think about what a human being is paid to administer and to do; that is what is costing the IT center so much money. They have a lot
of people who are constantly monitoring what's going on in all these
managed resources, tuning them, configuring them and such, and trying to ensure that there are no problems. So if you want to get
that functionality -- all of the things that people are doing today -- embedded in the management infrastructure of your data center, you
need autonomic managers. This is essentially taking that
functionality and coding it into software. The autonomic manager essentially
provides:
- A constant, closed-loop concept of monitoring what is
going on in some set of managed resources
- Analyzing the data that comes in
- Making the plan for change and
executing that change, if something needs to change
It's dependent on a set of knowledge in
addition to the data it's monitoring, sort of background knowledge
(the knowledge components of the MAPE loop), such as "What are the policies? How do I know what decisions to make? What are
the known symptoms?" In other words, when I see a certain series of
events, how do I know if that's important or not?
That's something that is called the symptom information, and is
another stored form of knowledge that we're creating for the autonomic
computing architecture. What are the properties and capabilities of
the managed resources that I can effect? We call those the effectors.
And this is essentially the knobs and dials, if you will, of the
given resource you are monitoring, and that the autonomic manager is
capable of changing. So that's the model, then you instantiate that
model in any number of products. Tivoli® has a product called Risk
Manager that's constantly monitoring security events. We're also
working on something called the Autonomic Monitoring Engine, which
looks for known potential problems in a managed resource
and can take action before the problems occur. For example, if a disk is
filling up, it can allocate more space before it fills up, or alert
the administrator that the disk is about to fill up, or both.
dW:
Would there be any point in incorporating that in my programming even
if I wasn't using the autonomic computing framework and components
from the toolkit?
RT:
Yes. We separate the architecture we've done from the sample
implementations. We have this go on at different places. For
instance, in some of the university programs, there are people writing
autonomic managers that are compliant from an architectural point of
view. They operate in this mode that I talked about, and they can
interoperate with other autonomic managers, but they didn't use any of
the code that we ship as examples in the toolkit. Or, you can pick up
code we've prebuilt and include it as a component when you're
building your solution. It is similar to many other things in the
industry around standards -- sometimes people like to write their own
implementation of an architecture or a standard, and sometimes they like
to leverage someone else's implementation. Both are valid, depending
on what your needs are.
dW: Thank you so much for the interview!
Resources
- To learn more about autonomic computing, visit the developerWorks Autonomic computing zone. You'll find
technical documentation, how-to articles, education, downloads, product
information, and more.
- You can download the Autonomic Computing Toolkit to take advantage of the tools discussed in this article.
- Keep up with news, documentation, and downloads at the IBM Autonomic computing home page.
- The articles An autonomic computing roadmap and Understand the autonomic manager concept (developerWorks, February 2004) provide information about self-CHOP, the MAPE Loop, and more.
- For more about the Integrated Solutions Console, read Meet the experts: Kathryn Britton on autonomic computing's Integrated Solutions Console (developerWorks, August 2004) or download the Integrated Solutions Console.
About the author
Rate this page
|  |